 All right. Hi everyone I'm Daniel. I work at Facebook. I do various Linux related things. I work on UMD and also BPF trace Hi, I'm Anita. I'm a software engineer on the containers team All right, so this talk is on UMD. Oh Man, no, it's being slow. I think I skipped a slide All right, cool. So here's a brief overview of this presentation. So we're gonna cover motivations past development Pretty much why we made UMD and what the state of it was and up until last year Then I'm gonna cover present state that covers up until now Then future plans direction. We want to take UMD and then we have some time for questions Cool All right, so motivations and past development So I think it's important to back up and talk about resource control at Facebook So for those of you who weren't at the resource control of Facebook talk yesterday that Dan gave This is a brief overview So it turns out or so the main goal of resource control at Facebook is to isolate resources across applications And it turns out this is a pretty active area development and it's kind of hard to do to guarantee in a you know sane manner The main use cases we're targeting is a primarily protecting the workload So imagine you have a server you want to serve web traffic the web server is the workload and typically want to protect that Protect that from anything that's not as important to serving web traffic because you know if it's not serving web traffic Then like why did you buy the machine? Another use case we've been looking into side loading for batch workloads So for example, if you have if you have extra compute capacity Maybe you off to opportunistically start transcoding video and then back off when there's not enough resources We haven't done too much in that area yet. Not too much investment, but that's on the road map. I think So the way umdi fits into all this is umdi steps in when kernel resource isolation Isn't enough and I'm going to talk more about then later slides Okay, so what is umdi so umdi is out of memory killing in user space We claim it's faster and more accurate and based on our deployments We've seen that to be true in production under the hood it uses C group to PSI and other various more traditional system stats It's open source under gpl2. There's a link to the github We recently also packaged it for fedora. There's a link to the copper repo Hopefully we it's included in the main fedora just a boss toy. Whatever the terminology is. I'm not too familiar But yeah, it's it we package it and there's there's the proof So why umdi so why do we create umdi? Well, so there's a lot of reasons and mainly because the kernel in killer is somewhat in insufficient, so for example the Kernel in killer Configuration is not very intuitive. So there's a bunch of control files. That's kind of archaic There's all these knobs you turn with just like numbers So some of these files go from like negative 16 to positive 15 So I'm go from like negative thousands like positive thousand and like what what do these numbers mean, right? There's totally arbitrary and this gets especially confusing when you have multiple teams working on a shared system So if someone sets, you know some number to 100 another person sets another number to like 200 Well, if you weren't there for the decision-making that the numbers don't really tell you too much Like was the one of them a hundred times one more point than the other is just a one one time more important than the other like I Don't know. Sorry to say The kernel in killer is also somewhat slow to act, you know kicks in by the time it kicks in the user space is already kind of screwed up The reason for that is that the kernel in killer is there to protect the kernel health So if it thinks the kernel is making forward progress, then it won't do anything A typical example is the kernel is sitting there training pages like constantly refalting pages Technically, that's for progress, but the user space user space may not be doing anything So it could be spending 80% of its time just you know waiting for pages So it's it's it's you know, the workload is already kind of messed up at that point The kernel in killer also doesn't have too much Context on logical composition of system right because to the kernel everything is an application or just some program So for example, like there's stuff that you can imagine should always be killed together Right, so if one thing depends on the other if you killed, you know the base dependency Then there's no reason to not kill the other one because it's not doing anything Maybe there's other situations where there's like two redundant services So maybe shouldn't ever kill both of them at the same time. Those are pretty hand-wavy, but I'm sure you can come up with You know actual scenarios It's also pretty hard to customize kill actions So there's the vent of deep that suffers the same problems and as I mentioned laughs the last slide So for some processes, you know typical sick terms that kills totally okay others might want some kind of you know song and dance So for example one one thing I used to work on we implemented a hot restart So instead of dropping client connections You save the same machines and then you pass the file descriptors to another process and you reset about you reset up the world So in cases like we can predict an oom coming you that would be optimal to restart it If you like knew there was a memory leak and for some people at Facebook that's actually what they can configure me to do They restart a specific system we service when certain condition is met So yeah, if you if you can like do a hot restart instead of doing dropping connections It's obviously profitable and it's it's kind of hard to set up with without MD So the kernel and killer is also somewhat non-deterministic I mean There's probably a way to get it more deterministic and I'm really curious to hear about it if anyone's figured it out But in general at Facebook people have sort of given up on that and just turn on panic on oom Because if you don't know what the kernel kill is gonna kill it could kill something totally random And now your system is in a non-deterministic state and it's better just restart it because you can fail over to some other machine somewhere So that's super suboptimal. So that's what we're trying to event make it more deterministic So that covers up until about last year. So this is the recent developments So we so umdi to happen So essentially we turn umdi into a rule engine, you know It's just bunch of if this this this and do that if that that whatever do whatever The first thing we tried so there's bunch of things we unsuccessfully tried the first was monolithic config So the idea was you don't want people writing configurations if you can avoid it So we were hoping that you could be smart enough to figure out, you know what to do and any given situation Turns out that didn't work. It wasn't flexible enough You you have to have some sort of configuration to tell the tell them do you like what you wanted to do and what's in situations? So then we pivoted to a plugin only that was very short lived We try to get everyone to write there on a plug-in. So we give them a hook. Please write some code to handle the oom situation Turns out no one wants to do that like writing code is really annoying and no one wants to understand yet another framework So we want somewhere in the middle with the core plugins So core plugins as we ship a set of plugins that we wrote that are pretty small. They do like one thing It's pretty orthogonal and there's like a self-consistent interface So if you understand how to use one plug-in the rest of them are pretty similar and that worked actually really well We're still using that today Superflex we can do all sorts of things and if you need more functionality just add another plugin that does like one thing So what that sort of enables is what I sometimes call a gotcha for your configurations So the idea is you can make mistakes, but they should be pretty obvious to your mistakes So you shouldn't be brand by things. You don't know you didn't know So this is nice because you can like inherently encode domain knowledge So there's one example the swap free plug-in. So we ship one plug-in called swap free Essentially what it does is it tells you how much swap is left out of system But there's this weird case So if you have swap turned on and you have some pages in swap and then you turn off swap This system has to bring those pages back into main memory and depending on how fast your system is this could take some amount of time During that period of time prok mem info and prok swaps show slightly different information So prok mem info is somewhat misleading because it shows total swap is zero which is technically true But that can get you into weird spots when your swap is draining So there is swap but it says a total zero so you can get into like you know math gets weird So mem info isn't wrong But it gives you somewhat incomplete picture because doesn't tell you there's actually swap there whereas prok swaps does But then parsing prok swaps is also kind of a mess because there's like a bunch of like tabs Like tabs and spaces mixed in the format. So it's just really annoying to parse So this is one of the things that you can encode the demand knowledge in right so okay So now we're gonna look at prok swaps we're gonna parse it correctly And then no one else needs to worry about it. They choose to plug in like they've always used it So here's an example of a rule set config it's kind of pointless trends squint at that So I simplified in some pseudo code Note that this isn't a full config. This is just a short just one rule set of the arbitrary many that you can have And so what this does is if a user slice workload slice or dub dub dub slice slows by over 60 percent or System slice slows by over 80 percent then please kill the largest memory hog in the system and then this is sort of um, you know contraved but You can you can do very specific things and like one rule set and then just combine them orthogonally to do something To cover a complex use of cases in your system One other thing that we added was drop-in configurations So if you're familiar with the system de-drop in configurations is which I assume most of you are it's pretty much the same thing It allows you to modify base configuration settings without having to modify the base config file and this is useful in cases where Containers can move in and out of systems so if you have a shared compute infrastructure and then a container comes in and has specialized config because it does some Interesting stuff with a C group setup like with whatever the delegated You drop this configuration in it applies to the machine or rather just their Just there's a container and then when it leaves it cleans up everything and this is really nice So one thing we considered doing but we didn't do was in container umd So the idea was instead of writing all this code to support drop-in configurations Why not just run an umd instance in every container and then also have one running on the root host to protect the root host Well, there's a couple problems that so the first problem was that Perhaps a Facebook thing but the monitoring inside a container is slightly different than monitoring on the root host And so we'd have to monitor two sets of monitoring configs that did the same thing, but like you know differently The second problem is that Coordinating rollouts is actually pretty tricky because if you update the software on the root host You don't necessarily you shouldn't necessarily update the software on the in the container as well, right? So what if there's a bug fix that you fix in the root host and not in the container and then Debugging and get super weird The third reason is that you could be like a split brain issue So the root host umd instance could race within container umd instance to make kill and that Could get super weird And it's also kind of hard to get information flowing out of the container because that's really not how containers work So we learned a couple lessons from deploying this on somewhat wide scale So the first is that most people including myself are pretty hazy on memory management internals And so it's important that someone does it right and the work can be reused because it's kind of pointless to get Everyone really familiar with the space because it I mean for most people they just want their code to to run, right? They don't really want to care about the details too much The second thing is zooming is not a widely solved problem. So if you're dealing with infrastructure at a pretty large scale Uming happens all time because you know various errors and bugs and whatever and new new interesting workloads The third thing is a lot of things can trigger new and they're sometimes usually pretty unexpected So one interesting thing we ran into is a so for like networking code If it can't allocate atomic memory then it owns a system and it's like well I didn't know that could happen, but it did and so understandable diagnostics are pretty crucial because You know most people like you know the kernel dumps like the memin for dump into dmessage or into K message, I guess When the system kernel room killer makes a kill But like if you know what you're looking at that's super useful But if you don't it's it's pretty much useless because you have no idea what it's saying So future improvements, so there's a couple improvements that I've been thinking about So one so a while ago someone added a e-poll support to pressure files So this is interesting because then you might be able to short-circuit some logic and save on our CPU cycles instead of Pauling Because some people complain about high CPU usage with UMD and a sweet profile Then it turned out the most of the cost is coming from accessing memory dot stat Which used to be an ovan operation on the number of c-groups in the system plus a number of dead c-groups That have not yet been reclaimed which could grow pretty large in systems And so sometimes that was really expensive to you know access memory dot stat, but a newer kernels This is an ova one operation because they do passive accounting, so it's like per CPU accounting and then you Smush them all together when you access it One interesting development is IO cost so umd right now actually monitors for IO issues across c-groups or cross slices And that's kind of complicated but IO cost Could fix the issue so that might be interesting saying how it simplifies umd configs So I'm going to close the talk with the proposal for where we see the future of umd If you ever taken a look at how umd is set up on the host You'll notice that it has a pretty tight coupling with system D We expect you to turn on resource accounting with system D and the umd plug-ins actually need to understand slices in order to work So why don't we bring umd to system D? So as Daniel mentioned before a kernel umkilling is complicated It's slow to act and it's pretty inflexible in terms of what you can actually configure However, we've shown that user space umkilling with umd is both performant and a flexible And so we believe that because of this user space is the right place for umkilling Because it provides the best insight into service level resource shortages Um issues Plague any sizable fleet because we want to make a solution like umd as accessible as possible We want to make it. Oh, yeah, we want to make umd as accessible as possible System D is well-positioned between the kernel and the application to make well-informed Umkilling decisions and thus if umd were to be shipped with system D It would be well positioned to provide same defaults for all the hosts running system D And if umd were to be shipped with system D It provide a cohesive configuration experience using the syntax that you're already familiar with to configure units So what are we actually proposing? Well, you'd need the system D umd binary, of course But we'd also ship a core set of plugins This would be following the umd 2 model versus the monolithic configuration model of umd 1 We'd of course provide a way to configure umd through unit files as well as through D bus and A tool to view the host-wide umd configuration So you don't need to shuffle around different files to figure out what umd is doing Here's a mock-up that Daniel and I came up with for how we might potentially configure umd on a host So here we have system dot slice and in addition to the usual slice section We would have an umd section under the umd section. There's an umd detector property Which is split up into three subsection separated by a colon The first subsection is the name of the condition in this case we name them system large and system tiny The second subsection is the name of a plugin and then the last subsection would be the argument to the plugin So in this case we have two detectors What system large is saying is if the total memory usage for ten seconds is greater than five gigs for longer than five seconds Then system large will fire and system tiny would be the same thing But for one gig to configure the umd actions We'd propose to have a configuration with an umd condition property and an umd action property Action would be slid up into two subsections like the detectors were The first subsection would be the plug-in name and then the second would be the argument to the plug-in So in this case our example is showing that if is system large fires Then we'll try to kill the biggest memory hog in system slice And if that fails we'll try to kill the next biggest thing in user slice Of course, this is all just in the discussion phases So if you're interested in talking to us about this will be at the conference and the hackfest tomorrow Okay, that's all we had so we're gonna open the floor for questions now any questions So you mentioned the IO cost as a possible improvement. How would The um to take like the IO cost into advantage So the right now the umd configs we have are pretty complicated because they have to take into account IO They like watch IO between different slices and C groups That's because the kernel is like that there wasn't a mechanism to enforce that in the kernel That was like that actually worked really well So if you the kernel added that which it tried to do or is trying to do an IO cost Then maybe you can simplify the umd config a lot and get it like much shorter because right now They're pretty long with a bunch of rules rule sets So that'd be the main improvement. Hi in system D 243, I think there's now for unit files or service files and your property that you can also configure umd Behavior a little bit and do you know more about it or how this would? integrate with Your daemon daemon. I believe the um policy stuff is for the kernel umkiller This would be a user space umkiller So you could actually like choose to kill specific services rather than just like Yeah, whatever the kernel is gonna do But actually I like that it's a very good question I think because like most of the um policy stuff in system He's actually bound to the fact that the kernel tells us that there was an um event on a specific service So that we can log about this and track this in the state of the service so that the administrator can eventually Learn about this because that's useful information. Have you thought about somehow? Like how precisely you actually kill you just call the kill system call Or do you do anything different than that like because it would be interesting if you could even If you kill due to om from user space somehow tell us that you did that for you um, right? That would be yeah, because then system you could track it and like as if it was the kernel side because from system These point of view they shouldn't we shouldn't matter if it's a kernel or user space or whoever did it It just wants to know that that's the reason why that happened Yeah, I'm not I have to look at the mechanism or closely But I think so all we did in the past as we set the X adder attribute on C groups who killed and then so when like our container agent Detects that you know processes died it'll check the X adder on the C group as well to see if when do you made the kill Obviously, that's like not like unified right? That's not how the kernel tells you to do it made a new kill I think it does like the event memory dot events or whatever Yeah, I think it'd be interesting to investigate explore what the options are How exactly do you kill a container? Do you execute a command to kill the container? Or do you just start shooting the process in a C group in the head? It's a while loop with sick kills It's not optimal, but yeah, I mean I think we should probably freeze the C group and then start killing things But we haven't run into issues with that yet But you don't sudden pit one of the container Siggins to allow it to clean attempt to clean up. It just it's too late at that point I mean so there are different plugins that like send a sick term first and then waits a bit To then the suit send a sick kill, which is a pretty typical behavior right now. It's just sick kill Okay, if there's no more questions, then thank you for your presentation