 Hi, everyone. Thanks for showing up to listen to my talk. My name is Daniel. I work at Facebook on a team called kernel applications The charter of our team is to pretty much make the Linux kernel more usable from the user space side of things And so one of the projects I work on is umd I want to apologize in advance if I sound kind of weird. My throat feels a little funny So if I need to like off that's probably why So what is umd so umd is a user space out of memory killer And I'll go into a little bit more detail about what what an oom killer is if for those of you who are not familiar with that So umd lives in user space. It's also mostly dependency free and when I say dependency free I mean once if you have a static binary and it's running You don't really need any other system services to be available All you really need is the latest latest Linux kernel and when I say latest I also mean it's really latest and then some special patches that Facebook's Johannes has created But I think the patches are gonna be upstream pretty soon So in a couple months, you should be able to just pull the latest upstream kernel and things should be fine Umd I Posit is deterministic faster and more flexible than the kernel umkiller and I'll talk about that in some later slides as well Umd is also open-source. It's licensed under gpl2 You can view the code on the first link and then you can read the very nice documentation that a guy named Thomas from Facebook Wrote on the second link So the agenda for this talk. It's fairly short actually I'm gonna go over the motivation mechanism and then results behind umd And then I'll leave some time at the end for questions and if no one asks questions, you can get some of your day back So motivation so why create umd So first I think we need to back up a little bit and go over why um snare is actually happen So on most Linux hosts you typically over commit memory and overcuting memory pretty much means that memory allocations do not fail Sometimes it can fail depending on how you have configured memory over commit The thinking behind memory over commit is that most applications that allocate memory don't necessarily use all of the memory that they allocate the classic example is Sparse arrays so an application might Allocate a bunch of memory for a big array and then not actually fill up all the entries in the end in that array However, just because the kernel returns you a pointer not an old pointer doesn't mean the memory is always available So if the system actually runs out of physical memory something has to happen The kernel will typically try to free up some pages that it can for example try to flush any dirty pages in page cache Or it'll try to swap some anonymous memory out to swap But if that fails the kernel has to come and do something and that usually means it will go pick a process Usually the biggest one and just kill it There's a couple of ways you can configure kernel loomkilling one of them is or I've listed a few of them There's a bunch of proc fs things that you can turn To tell the kernel what to kill an event of an oom Personally, I think it's a little bit confusing The knobs that are all pretty much numbers I think some of them go from like negative 16 to 15 to positive 15 some of them go from zero to a thousand In my opinion, it's not very intuitive But if you are reading the oom underscore kill that C file and Linux source tree I think it makes a good deal of sense, but that's only if you know the implementation So from the user side of things, it's I don't think it's very ergonomic and I think there could be a better way to do it The kernel loomkiller is also pretty slow to act if you're here for tajin in your harness's resource control at Facebook talk They alluded to this Somewhat briefly by the time the kernel loomkiller actually kicks in things are already probably too late for user space That's usually because the kernel killer tries to protect kernel health So it doesn't really care too much about what user space is doing so long as the kernels making forward progress One thing that we have seen at Facebook is that so under heavy memory pressure the kernel execute some instructions It'll try to fault in a memory page to access some memory and then after that instructions done It'll fault out that memory page and fault back in the code page and then it kind of just repeats over and over So an operation that could usually doesn't take that much time starts taking like five to ten minutes And then all the while user space is live locked and your application health is died The kernel also doesn't have any good context into the logical composition of a system So for example, there could be two processes on a system that you really don't want killed So you really want to kill at the same time So for example if one dies the other really should be dead because it doesn't do anything else without the first one Or there might be another case where two processes should never be killed at the same time So one process should always be alive and the other should always be dead or something It's hard to tell the kernel do this There's also not a really great way to customize kill actions You can't really say hey, I don't actually want you to stick killer stick term something I want you to send an RPC or I want you to send some kind of notification For example to like Docker-D. You don't actually want to kill Docker-D. You want Docker-D to start reaping containers There's no great way to do that other than using something like eventfd But again that falls in the first thing I mentioned where it's other kernels pretty slow and reacting to these kind of scenarios It's also the kernel killer is also somewhat non deterministic you have to turn all the knobs in procfs It's not to say it's impossible to get it right But if you have a service that forks a lot of processes off you're kind of racing against the system to set the correct Um adjust knobs or whatnot So um problems at Facebook Facebook actually runs into a bunch of out-of-memory problems I've listed some of them off here one of the platforms that suffers ish out of memory issues is our build and test platform that we call sandcastle So sandcastle essentially every time Facebook developer uploads some code to be reviewed sandcastle build and test that code and Sandcastle typically co-locates these build jobs on to a small group of shared hosts Building arbitrary not really arbitrary But like building code can sometimes lead to issues because you know linking takes a lot of memory Plus if you build everything in temp afest which sandcastle doesn't happen to do it can eat up a lot of memory And so bugs and accidents do happen And when they do you can do a box and ideally you don't want to take down the whole box for an extended period of time We also have a container and service platform called Tupperware where developers and operators services can run containers much like Kubernetes Bugs do happen memory leaks happen and if you use a shared pool of hardware You don't really want to take out your neighbors because one developer from one service had a somewhat nasty bug We also have a somewhat more esoteric environment. We have Commodity top of rack switches that we call fboss. It's very resource constrained environment The boxes only have 8 gigs of RAM and so it's really easy to own the box actually So for example if chef comes along and runs an update It'll do a bunch of IO to use a bunch of memory and then maybe another package update happens In the same at the same time asynchronously and then maybe the racks which is serving a lot of traffic It's really easy for the box to run on memory And in these cases you don't want the host to lock up or freeze because you'll take down a whole rack Or the networking for an entire rack, so you'd like to gracefully shut down some things as a chef or the package update And pretty much most multi-platform Multi-tenant platforms suffer from out-of-memory issues because bugs and mistakes do happen a Lot of these platforms choose to turn on panic on oom because in these scenarios You don't want something on deterministic for example if you're running again go back to example of Docker You don't want to accidentally kill the management demon and let the tasks or containers run without any management Oversight that could lead to some pretty nasty bugs So some of these services they'll turn panic on oom so if the host runs out of memory It'll shut down the entire box and then these containers will get reassigned to another box somewhere else While this is logically pretty correct. It's sub-optimal in that it wastes resources There's just servers and data center spinning Rebooting and not really doing much else. So there could be a better solution Umdi is also used for fp tax to Tejan and Johannes talked about it briefly in their earlier talk if you weren't here to see it these Summary is that they want full work conserving OS resource isolation across applications in short it means Two workloads should be able to coexist on a machine if one starts doing bad things The other one shouldn't really be affected and it new de plays a part in rectifying these Pathological cases where the kernel isn't able to protect everything There's a bunch of links at the bottom if you want to check it out later a lot of cool stuff going on there So moving it to mechanisms, so how does umdi actually work? so umdi heavily leverages a new kernel feature called PSI that Johannes wrote and Essentially what PSI tells you is it gives you a number between zero and a hundred that tells you how much wall clock time You have lost due to resource surges If you if it says zero it means you have not lost any wall clock time That means your workload should be theoretically healthy barring any bugs. You may have introduced yourself 100 means you're not making any forward progress and something is terribly wrong with resources on your system So umdi monitors or it keeps like a time series of the memory pressure or IO pressure and then if the If there's like if it's trending upwards or trending really high it'll start performing correcting actions At the core of umdi is the plug-in system So that means it's designed so that people can customize detection and kill actions We provide a default undetected new killer plug-in that is pretty sensible and works across a variety of platforms We've deployed it to a bunch of tears and they work really great out of the box If you want to change it you can subclass these plug-ins and then override the methods you want pretty standard behavior Umdi doesn't just monitor memory. They're also monitors IO pressure That's because PSI actually covers IO pressure as well Umdi also monitors swap because swap is pretty essential for umdi to have enough runway to detect building memory pressure because if you Don't have swap an honest memory tends to be unlocked into memory and then you could very suddenly spike from zero to a hundred Memory pressure really quickly So this is the original umdi config and that's what we're running within production these days As you can see it's mostly just JSON. It's pretty straightforward. So we're monitoring system that slice for monitoring the Pretty non essential system services and we have a kill list which is in order So what it says is if the host ooms If chef is using more than a thousand megabytes of memory Please kill chef and then never kill SSHD because you don't want to lose SSH SSH access And then we're using the defaults and detector and killer plug-ins If you had custom plug-ins that you would put the custom plug-ins name in there So the umdi as it currently exists works pretty well We have in production in a bunch of places and helps prevent a lot of relief bad pathological cases when a host runs out of memory However, as we've onboarded more and more customers and experienced different use cases It's become apparent that we need to iterate on umdi. So we're changing the config file language You can we're changing how the thing works in the back end We're still iterating and playing around with the details But I think we're on to something really nice here and it works really great and helping protect hosts from ooms What I have here is the umdi to config This is the next iteration on the config It's mostly pseudocode here. I've circled the pseudocode in a yellow box What this is essentially saying is that if the workload slows down by more than five percent or If system that slice slows down by more than 40 percent, please kill something that hogs a lot of memory and system That slice so in other words if your workload experience is a little bit of slowdown Please do something about it if the non-essential stuff experience is a good amount of slowdown It doesn't really matter to us. So as long as the workload is healthy I'm going to flash by to the actual config. This is the actual config that would work I'm not going to put it up because then you'll just squint at it It's not really important because what it essentially says is what I have outlined here You might have noticed that the actual config is pretty long and verbose That's because this umdi to config isn't necessarily meant to be written by end-users Frequently it's designed for with two use cases in mind the first use case is that a workload aware application such as the Orchestration layer of our the control plane of a container platform It would dynamically generate these umdi to configs such that it protects the workload as best they can In the second use case is say you're not running a shared multi-tenant service. You're running a single platform thing We're with your custom software on bare metal In one operator might sit down for a couple days and write a config that works well across all these machines And then so you really don't need people tweaking it every day It's not really meant to be used by like Desktop Linux users for example I wouldn't really put umdi on my personal machine as I don't really do things that in my box other than Sometimes build things that take too much memory to link Note that interesting implementation detail not that it's super important because just details is that there is an intermediate Representation layer in umdi so that means you're not we're not fully locked into json here We could theoretically spend a couple hours and add in a Yammel interface or a toml or maybe an IP tables like interface where you can have a config all in one line super concise So results so how well does umdi actually work? So here we have a graph of memory usage over time on a single host and this is a host from one of our build-and-test It's just in our build-and-test fleet for sandcastles one of the hosts So you can see at some point a build starts and then the memory issues spikes really high And then at another point that the memory issues dips really fast The dip is because umdi to came and decided to kill something because it detected memory pressure was too high And so prevent of the box from being locked up for an extended period of time and then essentially not being utilized You might notice for those who are very perceptive the y-axis is missing Labels and units that's because the lawyer said I couldn't have numbers Yeah, but I'm sure you can figure out what this means This is another graph of the panic on um rate before and after an umdi rollout So you'll notice again the y-axis doesn't have numbers and this makes probably more sense as we don't want to expose Or I'm not allowed to expose how many hosts we have running this kind of stuff But you can see that the rates at which hosts panic on um Kind of dip pretty significantly at a certain point in time and that's when the numdi rollout occurred And it was 8 a.m. On a Friday, so it was like you have a full data figure it out if there's bugs So yeah, there's time for questions if anyone has any otherwise you get some of your day back. Yeah Do we need the mic And one of your first slides you mentioned butter a fast does it require butter a fast so I can use it without it You do not need butter a fast now Yeah, should be file system agnostic buying like priority or inversions that we've hit Yeah, one of the interesting bugs was the M M some thing. It's like it would it put Processes into un-interruptible sleep under high memory pressure because it holds M up seven tries to do the read head thing And then even though um D tries to seculate it just won't die because it's stuck doing IO Which is kind of nasty, but I think it's been fixed. Yeah, all right. Awesome