 We have here Dan from Facebook. He's going to talk to us about resource control. Thank you Thanks so much. So originally I was supposed to give this talk with Tajin who works with me at Facebook We both work out of the New York office. I work on containers and Tajin works on the Linux kernel He works on C groups in particular. So he's intimately familiar with all the details here I will try to do my best to represent him well here, but We'll see where we get So Losing signal All right, hopefully that does not have him too much more often so The kind of high-level view of Facebook infrastructure that you need for context on this talk We're on a lot of different services. There's web servers. There's load balancers There's thousands and thousands of different microservices that kind of comprise Facebook infrastructure There's all also a lot of data stores databases key value stores caches You name it it runs inside Facebook and the vast majority of our servers that run in our data centers are Linux servers the workload runs Sorry, I'm not sure if there's an issue on my end here or not Well, hopefully live with this but Maybe there's an issue to be fixed So, yeah, the vast majority of our machines are our Linux servers The workloads run in containers our container system is called top where that's what I work on And we have a set of system services that run across every machine In the fleet and you can think of these as like this is like the Facebook operating system There's some number of services like chef like SSHD common ones You've heard of a lot of ones that are custom for our purpose See if we can get slides working again, let me unplug and try plugging in all right. I'll hold my laptop above my head We're going to see if we can get another cable. Yeah All right. Nope. All right. I'll start all over again now. I'm kidding so The Tupperware agent which is responsible for launching containers all these services I'll refer to as widely deployed binaries or WDBs again, most of these you've probably heard of some of them you are our Facebook custom, but The whole idea here is what is what are we talking about when we want to mention resource control, right? The vast majority of software developers at Facebook write services that run in their containers we have a fairly wide amount of of Services that run on the on the machine and we have a lot of questions about how do we manage the resources of the single machine from memory compute an IO primarily So one question that someone might have is like how do we ensure that a service gets enough resources to run successfully? What guarantees can we provide to service owners as far as how much they should be able to expect to be able to use on a particular machine? What happens if one of these widely deployed binaries regresses in its memory consumption? This is something that has happened and how do we prevent this from causing fleet-wide failures for example where? Suddenly we have no memory on any of our machines. It starts thrashing and we're out Another case that has happened a bunch is a misbehaving workload suddenly starts consuming a bunch of memory the machine Starts misbehaving or acting up and we don't have enough IO going around And then suddenly the host is unresponsive and then the workload fails over to another machine and does the same thing and we start Seeing these cascading failures Finally there's a whole slew of things that we want to look at what workload stacking prioritizing multiple workloads on the same machine Dealing with issues like latency versus batch sensitivity in different jobs all this kind of loosely falls under this domain of resource control I'm going to cover in this talk a lot of how we approach these problems I'm going to talk about C groups and when I talk about C groups, I'll Exclusively talk about C group V2 if you really want to know the differences Actually before I get going next show of hands who here is familiar with C groups All right good good chunk of people who here is familiar with the distinction between C group V2 and C group V1 All right, also a pretty good good set if you want to know that distinction, please see Chris Downs talk Fosnum 2017 Where he goes into this in depth. I will give a very high-level overview of C groups They're basically the Linux kernel mechanism for doing resource control Taging couldn't make it, but I stole an image of him off the internet and a quote from him He refers to C group as a mechanism to organize processes hierarchically and distribute system resources along this hierarchy The way I kind of translate this and I'm lying a little bit about some of the details here, but that's just to simplify things C group is a tree. You have one root C group and you can have Children of C groups processes belong to one and only one leaf C group So it's this hierarchical structure that we can use to do resource control System D calls non leaf C group slices. You'll hear me use that term a bunch as well And each C group can be used to measure the consumption of resources So at any point in the tree you can say how much CPU memory or IO or any number of things are being consumed And then you can also control the distribution of resources You can set resource limits and a bunch of other configuration at various points in the tree I'll explicitly call out. This is not like a security mechanism. It's totally orthogonal orthogonal to namespaces and charoots But you we for example construct containers. We use both C groups and these other concepts as well And this is all configurable through system D I recommend you look at the system D resource control man page for for details about how to do that I'm gonna pretty much exclusively refer to the kernel C group Mechanism names. There's usually a one-to-one mapping to system D configuration And I'm not gonna call those out. Just just look at the man page if you're interested So see group as I mentioned has a bunch of control mechanisms And I thought I'd give like a really really brief overview of how that looks so one is for each C group you can set weights This is like CPU weight or IO weight that I'm gonna talk about and basically allows you to give out proportional amounts of CPU or whatever resource Across this tree usually only child child or sibling C groups can really compete on on a resource like that But weights work nicely because they just kind of divide evenly if and if anyone wants to consume more CPU They can as soon as it becomes contended then you start applying CPU weight to determine who gets how much? Limits which I think most people would be familiar with things like CPU dot max memory dot max I can say this C group and all the processes that live within it can Only take five gigabytes of memory That's a hard limit and if it reaches that and we can't get any more memory you run out you call the um killer and kill things There's a few cases of allocations I'm not gonna cover really these are like Reserving setting aside resources so they can only exclusively be used for a single purpose This is really mostly useful in like real-time use cases. We don't play with this too much at Facebook And protections Are kind of the last category of control mechanism. This is something like memory dot low So the way memory works is once you start hitting contention for memory The kernel begins to look what kind of memory can be reclaimed file cache can be flushed all this sort of stuff Memory dot low allows us to protect to see groups memory from reclaim to say okay If you are under four gigs of memory you will not have your memory reclaimed Will prefer for other services and other C groups to do that Our our resource control philosophy. That's kind of like how we approach all this work is I kind of broke down to these three topics. I'll cover one I think quite obvious we want across our fleet to be as homogenous as possible The other thing I'm going to talk about is we try to avoid setting limits for a number of reasons and Finally, like we really depend heavily on monitoring and visibility to make all of this function well I'll start talking about the homogeneity This is probably the most obvious one but like having configuration that is different depending on where you run what data center You run or what kind of workload you run Really makes it difficult for us because configuration can bit rot. We want to update things suddenly we have to update several places But also making changes to this is costly You oftentimes don't know the effect until you've run it at scale for quite a while before you know Did this make a positive change or not? And the third and final thing is to most developers at Facebook They don't need to care about what machine they're running on they get a container They get some guarantees and they go forth from there And that's why having homogenous resource control is really a goal for us. We do have to make some compromises here, but This is kind of part of our philosophy The the other thing is avoiding limits, which I think may be somewhat controversial to people But there's a number of reasons for this Every time you hit a C group limit You are saying I'm okay with rate wasting resources here like that's kind of fundamentally the case If I hit CPU dot max and I had idle CPU Well that that means I'm just limited in in consuming that CPU I'll use the term in like operating systems theory. You call this non-work conserving There's more work to do and we can't use it We want to avoid limits in part because we want to increase utilization of our data centers There's another Another big reason which is that hitting a limit is like a really really heavy-handed way to budget our resources You know suddenly my service is working, but it's at 99% of the limit no issues, right? I increase utilization 2% more and now I've got like you know I'm out of memory. I'm killing everything and that can turn into some pretty major issues across the whole fleet and so because of this like the limit isn't like we need warning well before we hit the limit in many cases and Actually applying it can can cause these kind of like cascading failures We see where one thing changed and suddenly across the fleet. We're hitting issues The other thing is of course configuring limits just requires a lot of tuning right Suddenly you start hitting up against the limit you have to decide should I bump it is this limit too too high I need to get it lower and the final case which is a little more of a like Pragmatic aspect of this is that we've hit a lot of priority inversions with limits as soon as something starts getting limited There are many cases where another process behind it can get blocked or something like that in many cases These are bugs that we can go and fix in the kernel But that we will we sort of feel like we'll always be finding these kinds of cases I'll give one cool kind of war story about this So there's one system service in particular that we run on on pretty much every machine in the fleet And among other things it does it logs proxpid command line for all processes and The behavior of doing that is like the command line of a process lives in its address space So in order to read proxpid command line of all processes We go through and we acquire in the kernel it acquires this emmaps Sam the semaphore that protects the address space from being mutated while we're looking at it And then it goes reads the command line This isn't exactly how it happened, but it's kind of an equivalency Someone added CPU dot max to the system service for safety in case something changes to it It is limited from taking up all the CPU on the machine Suddenly what happens is that we found cases where a service was acquiring emmaps em And then would relinquish the CPU because it was at its CPU limit And so what happens is that you know, whatever that PID was for example We could be trying to spawn a thread and it's blocked on its own emmaps semaphore because this other process had acquired it and blocked This is just like a classic classic priority inversion, right the the CPU limit on the Wdb we had is preventing other services from making forward progress this particular one I think we we can probably find solutions for by in the Linux kernel having it not relinquish CPU While while it's holding emmaps em here But we found many of these kind of kinds of cases and if we don't apply limits What happens is the only case where this is really an issue is if we're just out of CPU on the machine And then you kind of expect you'll see some stalls and whatnot And whether that's because a process is holding the lock or not doesn't really matter as soon as there's enough open CPU You make forward progress Monitoring visibility is a tricky topic But basically we use c-groups as a way to classify things like kind of what is the tax We are taking away off of all all resources by running the Facebook OS, right? Whatever that that collection of widely deployed binaries and everything else that we have how do we end up, you know Tracking that and driving it down To to make workloads, you know be able to use as much as they want Similarly like how can we you know provide guarantees to service owners by saying okay, here's how much that we're using Across the fleet. Here's what you can expect to use safely and and make that feasible This is a really hard problem. Johannes is giving a talk tomorrow about Memory sizing in particular where this has been a challenge and I encourage you guys to take a look at that talk I won't go into further details here though So I wanted to get actually pretty concrete here about what is the Facebook c-group hierarchy look like And what do we how do we apply our various research control mechanisms here? So the root of this c-group we have system slice where which by default if you run a system d-service, that's where it would run The that part should be common to a lot of different systems We create two additional top-level c-group slices here One is the workload and the the third one the third slice we have is host critical And the idea is almost everything that we just run on the system these wdb's you name it chef or A lot of monitoring that we do that all runs in system slice I'll dive in a little deeper into the workload slice But those are where we run containers and anything that is really responsible for whatever workload We're running here the web servers the databases you name it Who's critical slice is the stuff that we need to have up for this host to be really operational. So that's sshd It's our user space umkiller. It's top where agent a couple other things like D bus Are in there as well So you can sort of like this high-level categorization is enough for us to kind of say Apply a bunch of like logical research control things We would want to say like host critical should be protected at all costs right anything That interferes with that We will take action right the system slice can oftentimes be delayed or stalled if it means that it The workload is protected from that happen from from any ill effects due to the system slice It's fine if we run chef, you know 5% slower or something. It's oftentimes not fine if we run a web server 5% slower The workload slice we further oh, sorry head slides here Covered those The workload slice we further dig into three things workload TW is basically the slice that the Tupperware owns This is the container system when it launches containers. It always throws them into workload TW slice workload WDB slice is another Place where we're on some widely deployed biners they live on the host But they provide some functionality that the workload depends on oftentimes. This is things like Caching configuration data or a number of things that don't live inside the container, but otherwise the workload induces a lot of Resource consumption from and is highly dependent on it working. Well, there aren't many of these But this kind of a concession to the fact that some things on the host do not always live inside the container But provide some useful functionality for the workload and the final one is workload TW commands dot slice And this is Tupperware the the you know container agent on behalf of the workload doing some Particularly resource-intensive work So that's setting up a charude or fetching fetching packages for a container all of that We want to to make sure is properly isolated and accounted for as part of the workload really So the next question is like how do we configure it all and as I said we avoid limits as much as possible The the three things we've relied on our memory low Which as I said is a way to say as long as you are below this watermark You are not getting getting reclaimed for memory. We prefer to reclaim memory from other C groups CPU dot wait similarly is is just a way for us to divide up CPU I'm not going to get into the like two specific numbers here just because I don't think it's particularly interesting And a lot of this is just experimentally Validated the third thing we've used I haven't talked about is IO latency So one thing that may not be obvious, but if you start protecting memory meaning, you know Suddenly this this C group the workload slice for example gets most of the memory on the machine protected and When we hit, you know memory shortage we start reclaiming heavily from system slice for example As soon as you start doing that you start inducing a lot of IO from system slice because it starts Flushing to the page cache and having to read a lot from the page cache So if you're protecting memory you're you need to also protect IO or else You're just translating the problem into an IO shortage very quickly This is anyone who's who's like run a Linux machine and run any number of you know Processes that consume a lot of memory the first thing that really starts limiting you is the IO You're just thrashing on the disk and not making forward progress. So IO is really important for us Historically, we've used IO latency as our control the way this works is you say I Want you know The workload slice should hit this target of 40 millisecond IO's and in the kernel after eight issues IO's it checks how long did those those those take and Decides okay, if we are over our target we need to throttle some other C group that has a higher IO latency target set for it This works, okay, but we oftentimes hit cases where we're throttling and there's plenty of available IO on the system This has to do with depending on the device you see very different behavior some devices see IO latency spikes and it's not because the disk is you know overutilized this because say the The disk had to spin a bunch to make make certain operations. So We've used IO latency a lot. It is not great though and one new thing. We're pretty excited about is IO wait this I don't know this exact kernel version it came out in but it's been this year developed And what this does is it creates a cost model for each IO you can say every IO for however many bytes it reads or writes and however many operations you're making you can have some predictive cost of how much this Consumes out of our total IO budget And use that then like you would any other like scheduling algorithm on top of that to say all right This C group gets this percentage of this the the IO available. This one does not And with that in place we really have all work conserving Controls right memory dot low CPU dot weight and IO dot weight as long as there is available resource on machine a C group gets it Regardless of how these are configured We also only set these on the slices in the containers. We don't do individual system services Configuration here and the idea here is just to simplify our configuration. We don't have to tune a lot of things constantly Works pretty well. We do have to change our configuration depending on the hardware This is like obvious with memory rate machines with different memory. You need to protect different amounts The way IO weight works is you have to parameterize the cost of IO depending on your disk And and so we do have different configurations But the number of different hardware configurations is not that high compared to the number of different workloads we run So like the outcomes of this approach, I guess I should say before going forward like We set CPU dot weight so that you know system slice gets some forward progress guarantees and Workload slice gets most of the CPU and similarly host critical gets its its share same with IO weight We set some configuration memory dot low. We typically set On the host critical slice in the workload slice to make sure their memory is protected And if we need to reclaim anything we take it from the system slice as long as everyone is operating well So the outcomes of this are that like our configuration doesn't need to be really precise All the different controls we use have proportional behavior What I mean by that is let's say like the ideal CPU division for this particular Workload and everything else running on the machine is that system slice gets 25% of the CPU workload slice gets 80% And host critical is 5% or something If I'm off by a few percentage on how that's configured It's fine. It's like the the the cost of being off is proportional to how much I'm off by So giving a workload 80% of the CPU instead of 75% of the CPU is only a proportionally not so bad On the other hand like limits is don't work this way, right? If you set a limit that is 5% too low, you your your system is broken, right? And there's a lot of why we we take this approach The other nice thing is that this only really takes effect when the machine is out of resources so A kind of Interesting property here is that when we start to apply all this resource control the vast majority of machines don't see any issue It's purely protecting what happens when we are contended for some resource Or there's memory compute or IO and deciding who gets what at that point What that means is that like a widely deployed binary that has a bug in it that starts leaking memory or consuming a lot of IO Because of the way we configure everything it can't harm the workload We cap the amount that it can in effect The truth is that the workload We don't limit anything unless the the actual machine is contended But as soon as it is and the workload wants to consume the kind of resources that we guaranteed it That's when we start really harming this this System slice and any WDB here Similarly because we protect host critical a misbehaving workload cannot take down a host We will always be able to SSH into a host always be able to start containers on it or stop containers on it And a number of other properties that we want to guarantee just because rebooting being our only option is pretty terrible Finally one thing that's like pretty important here is the because we'd only apply these when the machine is contended Small regressions do not have a large blast radius suddenly some WDB starts consuming 5% more memory It's not running out of memory across the whole fleet It is consuming 5% more memory and maybe on the most loaded hosts We have there it's getting harmed a little bit But this this means we're not seeing just widespread failures across the fleet when something regresses and that's just you know a kind of Our our Enforcement mechanism how we enforce the guarantees here needs to be a little less heavy-handed than suddenly we kill everything When to kill something is policy for us, right as long as we're protecting everything some Misbehaving workload or misbehaving system we we are protecting all the guarantees We believe we're providing our upheld and when we want to kill something is really a matter of well This machine isn't doing useful work. Anyways, we should kill it I won't go into further details, but Daniel and Anita are giving a talk about um D So user space um killer and and how we apply a lot of policy here. They're giving that that talk tomorrow as well So I'm getting close to wrapping up, but There are a bunch of aspects of resource control that we still care about quite a bit And we haven't figured out a lot of this one is how do we know that this all works right like an interesting thing that? About this is if the machine is under utilized none of the resource control it takes any effect right everything We're using is work work conserving. So if you make a change you oftentimes only see an outlier cases That something is not behaving the way we expect and it's very hard for us to validate that our configurations are behaving Correctly so we're investing a lot into setting up, you know various synthetic tests and in other ways we can load tests Systems at Facebook so that we can validate that our guarantees are actually being upheld and when we make configuration changes That's safe to do There's a lot of other resources I didn't talk about like everyone thinks yes compute memory and IO is everything But obviously networking is something I didn't talk about all the IO control is for block IO Consuming disk space is something we are always concerned about and we don't have good solutions there yet Power like also they just like kind of micro resources power cache memory band with BPF programs all these sorts of things can consume a lot of resources and We don't yet have good controls to to do this Finally one thing I didn't really talk about is like how do container users get to say how much resource they consume Right now the interface we use is I think fairly standard thing you would see with any kind of cloud provider You say I need a container with x gigabytes of RAM and x y CPUs But personally, I'm not sure that's the best interface one like how do you know how much you should should consume? But also like that's not what a developer often cares about right? They really just want their workload to run right how wherever you find it And I think we have a lot of like introspection to be done on our current interface and how we can expose it better for users And the final thing that like keeps me awake at night is just how would we improve the visibility of all of this? Developers don't develop their software well for cases where they're suddenly low on resources you hit timeouts you hit all sorts of problems and that's oftentimes when they need to debug and understand what happened on this machine and Exposing as much as we can to the users so that they can develop is really important for us And something we continue to iterate on That's all I have so I'm happy to take questions from here Thanks so much Hi So you're putting all the containers in one under one basically root which is workload slice How do you do resource? Management inside that slice so that one container does not take down the other containers Yeah, so the from the host perspective so Tupperware is what controls that right well within the workload slice doesn't matter Doesn't matter what we do underneath the specifically workload TW slice There we do things like a container can say I want two gigabytes of RAM We can set memory limits there or set memory low depending on how how the user configured their container And if you run more we run multiple containers underneath the slice each of these has their own control That is configured by Tupperware, so we can Again usually depending on the case we prefer to use These like we're conserving limits, but a software they all can say no actually enforce my limits I want you to cap me at four gigabytes of memory if they don't do that It's still fine You can run other containers alongside of it and because of memory dot low and other proportional controls It just works out. It's oftentimes that users want the determinism of I actually run out of memory at two gigabytes And and my workload dies But that's all through the Tupperware interfaces All right. Well, you mentioned something about faults and injections. How do you inject faults into a sea groups and things like that? like specifically try and force Slices to run out of memory or contend on something else I'll briefly talk about that because I'm sure Daniel and Anita are gonna talk about that tomorrow when they talk about um deep but Basically just synthetic workloads that consume a resource you can you can add them to a sea group and see what happens And this is a test. We've done done a bunch and you can see the effect Any other questions? No, thank you Dan. Thank you