 Hey there. Good morning, good afternoon or good evening depending on where you are in the world. I'm Chris Down. I'm a kernel engineer at Facebook. I work on memory management mostly in the area of control groups or C groups which are one of the building blocks for modern Linux containers. I'm also a maintainer of the systemd project, small project which some of you may have heard of, and all wrote and and reposts about. In general, like my my job is thinking about how can we help make Linux operate more efficiently and more scalably. It's quite important because at Facebook, you know now we have a million plus machines. We can't just buy more RAM. We can't just in throw hardware at all of our problems. We need to extract the absolute maximum out of each machine safely. So that's one of the things which I'm going to be talking about today. I know that nobody really gives a single toss about that though. What they really care about is that I was arch Linux user since 2008. So I remember the bad old days of in scripts and all kinds of stuff. So yeah, nobody can bullshit me telling me those days were better. Especially as someone who works on upstream software who works on quite core software, you know, I find arch Linux incredibly useful having the latest core software available having a great build system. All of that makes arch really for me the perfect distribution for anyone who works on kernel for anyone who works on systemd for anyone who works on like upstream software. It's a great distribution if you are a developer and you work on those kind of things. So keeping recent, keeping rolling makes creation of testing of new features and fixes really, really straightforward. So I definitely give like a big thumbs up for arch Linux and that's one of the reasons I've been using it for so long. So thank you very much to everyone who's been involved in it. So memory management in general is the kind of topic which I want to talk to you about today. I want to help give you more of the tools and information which you need to manage memory at scale. There are many many misconceptions even among senior members of our community about the primitives which we provide as the kernel and what they are good and what they are not good for. And memory management in general is a very inexact science where basically everything is a trade-off between performance or accuracy and it's not a mathematically correct science. You've really got a trade-off. So I want to give you a better feeling about what trade-offs and what decisions might be applicable to you and your workloads. So hopefully you'll come out of this talk with some ideas that you might want to take away and apply as part of your memory management strategy and help make it more efficient or more reliable for your specific case. First before we even go into that let's cast our minds back a few years to 2017 and the years before the big rona hit us and left me stranded trying to record this in my own house with the kitten behind me. I went to a number of conferences that year talking about C Group B2 and C Group B2 is our big bet as an industry for resource control and management in Linux kernel. C Groups are essentially a kernel mechanism which we're building to balance and limit and throttle things like memory, CPU and IO and that kind of stuff. Things which you share across a system and across applications. We have pretty good problems at Facebook. Our user base is increasing, our product range is diversifying, but with that growth comes scaling problems of a class which we've never really had to deal with before. And in the next few years we're really going to feel this crunch for capacity and we simply can't effectively solve this problem by throwing more and more hardware at the problem. Not only because of expense but because of availability. It's really starting to become a problem. And we have hundreds of thousands of machines. We just cannot afford to waste capacity. A small loss on every machine translates to a huge absolute loss across the fleet. So we need to use resources more efficiently and C Groups are one of the things which fundamentally allow us to do that. This is also really important as many huge site outages and incidents are caused by resource control issues. Not being able to readily control things like CPU, memory, IO. It can really either cause or make worse existing issues and we've seen this not only at Facebook but at a lot of our competitors like Google, Amazon, LinkedIn, all of those kind of companies. You know we are having a lot of problems as an industry dealing with these resource control problems. So we need to start an initiative industry wide to solve this problem and that's a large reason why C Groups by their nature exist. Two years ago it was a totally different story though. We had really only one user of C Groups which was us. Even Google which is now a fairly large user hadn't really started using them at that point. And it took a lot of work and overcoming obstacles to reach the point where we are confident saying these are now ready for use in production generically. In general in the old days what happened was we had a whole bunch of cool and new primitives which we invented which we were very excited to use and seemed correct. But when we actually went to go and use them it ended up being something kind of like this instead. The primitives aren't really wrong. The problem is just that they are kind of on the right track but we need to know in which cases we should make the primitives or the operating system a little bit more square. And the primitives may be a little bit more round to achieve what we want. And knowing which of these makes sense in each situation requires a huge amount of testing and experimentation. Another problem in general is that limiting one resource artificially can if you're not careful just result in transferring all of its contention on to another resource. For example if you limit memory too tightly for an application then it likely will just become a huge amount of disk IO which is obviously not desirable. This is because we will start doing things like evicting the page cache and might also start to swap pages which are harder or more in use than we would ideally like to swap. And as such we need an underlying system which mitigates these issues and we simply didn't have that back in 2017. This is also why you'll see me talking a number of times in the talk about IO. You might wonder why is he talking about IO and overall kind of talking about memory. The reason is because without talking about IO we can only really have an academic discussion of limiting memory. So to be practically useful you always have to have controls on IO when talking about controlling memory. For the reason we mentioned just before about like evicting page cache, swapping out that kind of stuff. Another problem is that while we have had Segary 2 technically working for a while, the past few years the major focus has been on issues which it has brought up elsewhere. Because Segary 2 limits resources in a way that was never really done before in the kernel it has exposed a lot of these issues, these long standing issues. They're not new issues but we've never really had fine grained prioritization so they've never clearly been exposed in such a way before. And there's not enough time in these 45 minutes to really go over every single issue which we found but I did go into them quite in detail at my talk in SREcon last year. If you go for me at SREcon 2019 you'll probably find a lot more detail about the problems we had with EXT4, with journaling, with memory mapping semaphores in the kernel and so on and so forth. In general like the point is we found and fixed a humongous number of issues across the kernel from memory management to IO to file systems all over the place and a huge amount of our time has really been spent invested in improving resource control in these subsystems to not break down under scrutiny. So let's kick off the main part of the talk by discussing some of the fundamentals of the next memory management and this is really important to go over at this point of the talk because if we don't agree about these then the rest of the talk you're just going to spend like shaking your head. So we should at least agree what we're trying to achieve in the first place. One thing that's really critical in understanding that is understanding in Linux, Linux has different concepts of different types of memory. From the CPU's perspective they're really kind of all the same. It has some kind of concept of different permissions but it doesn't know anything really about the semantic intent of the pages. For example in Linux anonymous memory is as the name would imply not backed by any backing store. Memory explicitly allocated by malloc and so on and so forth during a program lifetime typically is anonymous because it doesn't have anywhere else to go. Most people also know about caches and buffers which are kind of two sides of the same coin. Nowadays they're both part of the unified page cache in Linux so I'll probably just refer to them interchangeably as file pages or unified page cache or something like that. One of the problems is like if you ask most Linux users or most Linux admins or even most senior Linux administrators they will say file caches and buffers, they're reclaimable. They can easily be trivially freed when things go awry. And they're not entirely wrong in that but the problem is that people misunderstand what reclaimable means. Reclaimable doesn't mean like you can reclaim it. It means you might be able to reclaim it if you ask at the right time and we happen to be feeling particularly nice at that particular moment. So for example if some application is absolutely hammering on some file on the disk right it's very unlikely that we would evict it from the file cache because we would just destroy the application performance. So while these can under some circumstances be free trivially it doesn't always have to be the case and this can cause some confusion when people ask you know why did my application run out of memory when I still had this free memory and we'll come back to some more of these unexpected foibles in a little bit. The fact that these caches might not be evictable at a particular point in time is another example of why the resident set size or RSS, a metric that we'd love to measure is really kind of bullshit. RSS often unduly skews a lot of attention towards a few singular types of memory anonymous memory and mapped file memory and we forget though that many workloads simply cannot operate in any performant way without extensive file and buffer caches or otherwise other system-wide shared memory. So measuring RSS is kind of a losing game and the reason that we measure it is not because it's a good measurement the reason is because it's really really fucking easy to measure which is a nonsensical reason right doesn't make any sense like I don't understand why how we've got to this point. So when somebody asks how much memory your application uses or your service uses the only reasonable answer unless you've compressed it beyond belief and found out the answer for yourself like using some kind of workload compression is that you have absolutely no clue like how much it actually uses right. In one case inside Facebook just as an example a team that had for years believed their memory footprint was in the order of like 100 to 150 megabytes found out that it was more like 2 to 3 gigabytes of critical memory working set size just because of file caches and buffers and so on and so forth. So you really can't just exclude them from your memory calculations if they don't exist because it's not a reasonable way to operate your service or application. This is one reason why in modern resource control and in SIGURY 2 we limit all types of memory together instead of just trying to limit anonymous memory or one type of memory or so on and so forth. The problem there is if you limit just anonymous memory or you do what SIGURY 1 did and limit every type of single memory separately you're trivially going to allow the application to still cause problems on the system by thrashing the file cache or by using some type of pages which you're not accounting for so it doesn't really work very well in the long term. This whole discussion also brings us somewhat neatly on to SWAP. SWAP in general is a very widely misunderstood concept, highly controversial topic. A lot of people think that SWAP is mostly irrelevant nowadays with tens of gigabytes of memory. It's a little bit strange though because for the things that SWAP is good at you really cannot get them any other way and for the things which is bad at it's usually entirely possible to mitigate them. So usually these discussions kind of hinge on these very large misunderstandings about SWAP is actually intended for, especially being that it has almost absolutely nothing to do with this popular misconception of being emergency memory or kind of slower RAM as some people believe it is. People also have this really weird idea when talking about SWAP that they should disable SWAP because disabling it will somehow magically stop memory contention from tining into disk operations. But this is obviously not true, right? Just because you can't measure it you just believe that that's true. It's painfully bullshit. If we can't evict anonymous pages which is what happens if you don't have SWAP then we're just going to thrash the file cache, we're just going to throw out the file cache and even worse we might be forced to reclaim a hot of file cache pages than we would actually want to reclaim. So we even do something less desirable than what would happen in terms of disk IO if you had SWAP. So these kind of misunderstandings have really hobbled SWAP's reputation and they've led to this misconception that it's not useful if you have a lot of memory or some bullshit like that. So SWAP isn't just some mechanism to expand your RAM. What is it? Well SWAP allows reclaim on types of memory which would otherwise be M-lock. That is it provides this backing store for anonymous memory which otherwise would be locked into RAM. It also allows to by its slow nature ramp up memory contention more gradually which is really important to catch and kill processes which are affecting the system rather than just kind of falling over and dying. Without SWAP it's really hard to run with positive memory pressure. It's really hard to run hot on memory or in a memory bound system because we almost immediately go from using all of the available pages and everything working hunky-dory to oops you requested one too many pages and now the system is completely falling over and we get completely deadlocked. This is because running really hot on memory and running out of memory becomes really binary without SWAP. It just happens way too fast to act on in any reasonable way. Another thing is that if you've compiled things you've probably seen this make-j cause plus one stuff. Well why do we do that? Well we do that because we want to fill up any of the remaining slack by work reappearing or work going away from increases and decreases in demand. And we want to do this with memory too. We want to always be pushing it just a little bit so that we can make the absolute maximum use of everything that we have. And this comes back to what I mentioned earlier about us as the operating system as the links kernel wanting to promote the most efficient use of memory possible. And we want to do this without impacting system latency too much so we do need to be careful in a way we do that. Without SWAP memory contention increases by comparison are really just very very sudden. Often systems aren't really recoverable from that state and I'll go into an example case of that later. That is when we go over the edge you know we end up just recycling pages so fast in and out from disk and into memory that we can't make real forward progress anymore. And SWAP does come with some trade-offs but in general it does prevent a lot of these situations from occurring in the first place. And if you're generally of the opinion that SWAP isn't useful or it's some relic from the past I really do recommend reading the post of this link and taking a look at what's contained there. I hope it will give you some insights into how we want to use it. One reason that people particularly don't like SWAP as well is because it can delay invocation of the OOM killer. The kernel OOM killer is essentially what's invoked when the system is ostensibly out of memory. And it's essentially a massive cannon that you fire at some process and you pray that you fired it, you know, in the right direction and usually you fired it kind of in the wrong place. And these are part of the two fairly major constraints of the OOM killer. Number one is, you know, if you invoke it you've probably already lost probably everything on the system now is pretty much queued up due to the fact that memory hotness information and memory access information is hidden behind the CPU's memory management unit. We don't actually know as the kernel when we are out of physical memory in a readily available way. This might come as a surprise, but it is true. Linux really just has absolutely no idea when you are out of memory. It only knows when it tries really, really, really hard to go and get some more memory to go and free some pages but it couldn't do it for quite a while. Generally this is because only the CPU propagates this MMU page hit information, that's this PTE young stuff and calculating ourselves would be way too expensive so we have to proactively pull this information when we are trying to free pages, that's what reclaim means. So the problem here isn't knowing when our memory is full, we know that because our goal is all the time to have our memory as full as possible basically. It's easy to know how many pages are resident. What's not so easy is knowing how many pages we could easily free if required and as discussed we don't really know when we are out of memory because we need to reclaim first to know that information which can take a really, really long time. It's a very expensive process potentially involving all kinds of I.O. As you'd also imagine we only want to invoke the umkiller when we actually are out of memory. So this means there can be a really long, long time after totally running out of memory that you wait before you actually get the umkiller invoked. So we already have to go through a lot of reclaim attempts. We might have to do a lot of disk I.O. We might have to look at the MMU hotness statistics. So relying on the umkiller to do something on your system in a timely fashion is really a fool's errand. It might happen to work for you right now but there's a million other cases in which it will not work for you and in which your system will grind to a halt. So our goal in general should be to avoid having to invoke the umkiller at all and avoiding memory starvation in the first place and I'll come back to how we go about doing that reliably at scale in a slide or two's time. A second problem with the umkiller is that it's not really configurable. We have this thing called umscores. I love anything in the kernel called a score because it means that when we designed the interface we really had no fucking clue how it was supposed to be used. In general like nobody knows how this is supposed to work even the people who work on memory don't know how this is supposed to work. You either set it to like minus a thousand or like plus a thousand or I think 999 is probably the limit. You set it like the maximum limits or something or you set it like in the middle and you kind of pray that it's higher or lower or somehow like somewhere different than some other application. In general nobody knows how this is supposed to work like it does play into the umkiller's heuristics about what to kill but aside from the absolute absolutes of like minus 999 or 999 or whatever it's really hard to reason about what that's actually going to make the umkiller do and how exactly the umkiller is going to behave and whether it's going to have the desired effect or not. So this often means that the umkiller will just kill the complete wrong thing on the system. It often just kills the largest thing on the system or kills things until it randomly kind of finds the one which was causing the problem. So obviously not a great solution to the actual problem there. So I keep on bringing up this thing called Reclaim. It's probably worth going a little bit into how that works. So Reclaim is just a fancy word for this process of trying to free pages. There are many different ways in which Reclaim could happen. Two common ways are this case swap declaim or direct reclaim. Case swap declaim is done by a background kernel thread which tries to proactively freeze some memory when we are reaching some threshold. For example if you use 95% of memory or something like that. Case swap declaim avoids going into this next and scary phase of Reclaim which is called direct reclaim which is what happens if your application requests memory but there isn't any available right now. This can result in your service having an actually measurable freeze which obviously sucks because what's going to happen is we're going to actually suspend the application from execution while we go and try and get some memory. So that's why we have this other proactive case swap decolonial thread which avoids getting into this direct reclaim state in the first place. Reclaiming some pages may also be easier or harder than others. We talked a little bit earlier about this reclaimable memory stuff and what it actually means. Some pages may be reclaimable but just not right now. For example some cache pages may be so hot that it makes no sense to evict them right now and the applications accessing those pages would just be driven into the ground if we tried to do so. And the same goes for anonymous pages if you don't have swap. If there's no swap free or there's no swap available then without a backing store for that memory it's just locked in memory and you can't do anything with it. Which kind of breaks down memory management because you simply cannot reclaim a huge class of pages. Some page types may also be reclaimable right now but it takes some effort in order to do so. For example dirty pages which are pages in the page cache which have been modified and are waiting to be written back to disk need to be flushed to the disk first before we can actually proceed in freeing them right? Otherwise we'd either lose the modifications to the file or even potentially corrupt the file system or other critical parts of the system. In practice this variance in unpredictability and reclaim means that it's typically really hard to tell ahead of time when you're actually running out of physical memory but if we wanted to know that how would we go about it? If I were to ask you to tell me how would you tell if your machine was over subscribed on memory? What kind of metrics would you traditionally look at? Well when I ask this question at conferences typically the answer I get is something like look at the free memory but the problem with looking at free memory is that it doesn't tell you anything really because in general we want to use all of the available memory on the system as much of the time as possible so it's going to say it's full almost all of the time and then usually somebody says okay we'll just subtract the caches and buffers from the free memory but again some caches and buffers might actually be needed they might not be able to be reclaimed at this particular point in time so you really can't just subtract them and expect that to make sense. Another metric that often some more senior engineers bring up is page scan rate. Page scan is essentially the rate at which we are going through pages trying to reclaim them, trying to free them and if it's very large it means we are having to scan a long way through the memory list to try and find some pages to free which are eligible to be freed and the problem with page scans is very hard to differentiate and a very efficient use of the system from oversubscription on the system so it's not really easy to tell just from these metrics whether you're tipping over the edge or whether you're in a perfectly steady state there are other metrics of course we could spend all day going through them but all of them up to this point have their caveats and limitations and they can all trigger under normal operation and useful system state as well as actual system instability and oversaturation so usually all of these metrics which we come up with are just approximations of memory pressure so what is memory pressure? Well we've never really had a metric like memory pressure in the kernel before we have many related metrics like the ones we've just been over but even with all of those metrics it's really hard to tell what is real pressure and what is just a very efficient use of the system for example in memory we might want to be able to tell how much of the time are threads on the system being stuck doing memory work and we have this new kernel feature called PSI which for memory for example can tell you the amount of time the threads on the system were stuck doing some kind of non-ideal memory work like faulting back in pages or doing some kind of work which if we had more memory we probably wouldn't need to do so the word sum here, the line sum here means some threads in that C-group were stuck doing this non-ideal memory work for say 0.21% of the time in the last 10 minutes and full means the same except for all threads in the C-group were stuck doing that this could be things like waiting for a kernel memory log, being throttled waiting for reclaim to happen even more it could be just you know waiting for this guy or waiting for things to be swapped, waiting for page cache faulting and stuff like that so pressure is essentially saying you know if I had more of this resource I probably could run 0.21% faster so it's quite a useful metric if you're developing high availability applications or high reliability applications and stuff like that it really gives you an overview of like how your system is operating not just at an ideal circumstance but whether it's being up subscribed and stuff like that for example for the high availability and high reliability use case we at Facebook use this to know in advance whether we're about to use too much memory for async jobs for background jobs on the web servers and we do load shedding we can say you know we couldn't take that much more memory ahead of time we know when we are actually reaching memory saturation using this metric which we don't know with any of the other metrics and you can't measure this just by looking at resident memory because again you simply don't know what's reclaimable and what's not ahead of time these PSI metrics are also what powers our pre-oom detection at Facebook and we do this as part of a project which we've open sourced called OOMD OOMD is essentially a user space OOM killer with a really fine grained policy engine and this allows you to encode policies of what to do in certain memory pressure situations for example we run Chef on our machines as part of configuration management and while Chef is important if we're running tight on memory we can live without it for a few minutes and the same goes for other background work which isn't customer facing OOMD allows you to encode things like that based on these pressure metrics for example we might be able to measure a best effort applications pressure metrics and if the best effort application starts to cause contention for others or cause system wide contention we can either kill it before OOM or performance declarations happen or throttle it using these metrics we can prevent OOM entirely while still preserving system stability and avoiding grinding to halt system wide and the same cannot be said when the kernel OOM killer is used because again it often results in a significantly degraded system for long periods of time simply because it has to go through so many reclaim passes whereas we can passively monitor it through system reclaim in general another thing which you might have seen me talking about in the old days of Sigurbi 2 was limits that's this memory.high and memory.max stuff and the initial proposal for Sigurbi 2 was to limit non-essential work in the machine kind of a traditional approach but this has among a few other things a few major problems one fundamental problem is that it doesn't treat applications which have highly variable results so it's very well for example if you have an application which uses 900 megabytes of one second and 1.5 gigabytes of another second if you set the value too low to the low amount then the application might just get OOM killed unnecessarily but if you set it too high the application for the majority of its lifetime might just have completely free reign over the system with no real consequences so that becomes a bit of a problem just using limits it doesn't really allow for any ball parking another critical problem is that the resource usage of some system wide demons can be very heavily tied to the core workload on the machine for example if you have a demon locally available which does like tier look up or service look up that kind of stuff and it has a cache how big that cache gets largely depends on what it's asked to do by the main workload if the web server requests some kinds of tiers which are very large maybe the cache gets very large and this is a problem because most fleet wide service owners for these kind of demons don't necessarily know what the reasonable values for their services memory should be because it's so variable depending on what they're asked to do so even if they can record it right now it could change five minutes in the future when they're requested to do something else something that we can reason about though is how much memory we actually want to reserve for the primary workload like the web server or whatever so it's much more easily configurable to make some kind of guarantees about what we will give the workload rather than act punitively against other non-essential services when maybe we really didn't need to generally large service owners like web tier owners do know roughly how much their service does require because they know what their user base looks like, they know what their requests look like and that's something which is a lot more uniform across the fleet so for this reason we've largely shifted to using and improving our protection tunables that's this memory.min and memory.low stuff which define hard and best effort protection respectively and when you're below your protection threshold what happens is that we don't reclaim memory at all from you in the case of memory.min we never breach we will never reclaim from you as long as you're below your limit and for memory.low we would only reclaim from you below your limit if we are under system-wide memory contention anyway this allows significantly more work conservation allows a lot more work to happen on the machine without punishing people unnecessarily.demons can use as much memory as they like as long as they don't defect overall system stability and that's one thing which has been very helpful in developing this overall strategy for resource control so this is all very well and good I've gone through a whole lot of primitives but how do we actually intend to use them well this is where FB tax 2 comes in FB tax 2 is our overall project for resource control at Facebook C-group B2 is certainly one of the primary things which we need to achieve that however as we've gone over in this talk we also need a lot of other supporting elements to work and be in place to make it actually composed to a functional and coherent system so one concern of course is stopping background services that the main workload isn't dependent on from affecting the workload for example as we mentioned metric aggregation or configuration management going crazy and like stealing memory from the web server or running the machine out of memory entirely we also want to have reasonable behavior if we start to exceed the capabilities of the hardware if we try to use too much memory again to scale to our future capacity needs we do need to have this high resource density we need to be having high memory pressure and making the best use of positive pressure so of course this does come with the risk of going over the edge it's a little bit too far right so we need to have graceful behavior when we do inevitably push it a little bit too far it's also really important that our efforts are kind of lightweight it's no good if we can now load machines 10% more if we also take 10% of load of the machine to do it if you get my point it's also no good if we produce this very technically incredible solution but it's such a burden on site operation or maintenance or service operation that nobody actually wants to use it in practice so we have to make it usable and that's one thing we've been very conscious of when designing Secretory 2 so fb tax 2 in general comprises of really wide range of solutions to compose to a usable system we do need to tune the base OS and be opinionated to make sure that it's capable of isolating resources otherwise on non-compliant systems we might actually end up breaking down and making things worse than doing nothing at all as we mentioned earlier we also have the app I mentioned early umkiller, umd running on fb tax 2 machines on those machines it monitors for threats to the workload and misbehaving workloads on shared machines it prevents these very obviously misbehaving applications from posing a threat to overall system health based on the policies which you've assigned as for how this actually looks at the base OS layer we use Battle Reference as the root file system this is needed because as mentioned we have some fairly insurmountable priority inversions we found one of those was in EXT4 in the way in which it does there is some high-priority application in the journal where you have to flush out all of the entries in the journal at once and one of the problems is then you cannot limit IO for any application because all applications become contingent on it completing so if you have some low priority application in the journal and some high priority application in the journal the high priority one becomes contingent on the low priority entries journal entry being completely flush to disk and because we're throttling that's not going to happen so fixing these kind of things wasn't really possible on EXT4 but it has been possible on ButterFS we do have a lot of the ButterFS call maintainers at Facebook obviously not a coincidence of why we've been able to do a lot of these things but in general ButterFS's design is far away from the means of like 10 years ago where everyone was complaining ButterFS ate my data nowadays we use it on tons and tons of systems I can't say an exact number but a lot of systems like a lot of digits of systems and it works just fine in fact it detects a lot of hardware issues which are not detectable due to the way it does check something and some of that which are not detectable by things like EXT4 so that's one reason people often complain ButterFS ate my data and the reason is because no it actually detects these issues which other file systems are not detecting I also mentioned earlier about the importance of swap usually we might disable swapping on the main workload it kind of depends a little bit on what it does it may or may not be reasonable to have it enabled on the workload we do need swap those system wide to make sure that we're efficiently able to reclaim all types of memory and not just a limited subset we're also opinionated about some kernels one of them is WBT or write back throttling if you use desktop Linux in the past and I assume you do since you're at ArchConf and most people are probably using it on the desktop you may have encountered the situation in the bad old days where you plugged in a USB drive you tried to copy some files to or from the USB drive and the whole system would just freeze completely the reason for that is because write back is generally not stoppable like when you've issued a write back everything the forward progress of the system becomes contingent on write backs completing and your whole system is kind of stalling because like this IO is the most critical thing it's doing at that point in time it's a little bit complex to explain but the general principle is you cannot slow it down you cannot stop it so this is what write back throttling is designed to address write back has this special property as we mentioned that due to the way it's implemented we cannot slow it down or stop it so write back throttling is a way of us avoiding issuing the write back in the first place and slowing it down if we are exceeding some threshold or if we know we're causing the system pain C groups are also the bread and butter resource control so as such it does make some sense to go into our choices in terms of configuration there to get sensible resource control it's very important to have these clear definitions of roles from the top level for example we have system.slice which is a slice for work which is not time sensitive for best effort nice to have services which can be killed or throttled in an emergency like chef or some metric aggregation that kind of stuff we also have host critical.slice which are for demons which are required by the system to run these are typically things that we need for debugging like SSH, logs also some system.d demons some divest demons that kind of stuff and we want all of these to work even if the system is generally not having a very good time we also have workload.slice which is for the key workloads which are running on the machine for example HHVM on a web server or MySQL on a database server and having these really clear distinctions allows us to know exactly how we are going to propagate resources from the top down based on service role which makes life a lot easier this is actually how this used to look which is punitively limiting the best effort work as we mentioned earlier this is fairly brittle system.slice memory for example might legitimately spike and that might be okay in some circumstances and we might end up slowing down or killing things when we really actually didn't need to or didn't need that it's also really absolute it doesn't really allow for any ballparking if you set a limit a memory limit we're going to enforce it absolutely we're going to take some punitive action on the application if you exceed it so it's not really great for applications which have highly variable resource usage so we need something that can help with that and that's why we've changed from using that to this memory.high from this memory.high and memory.max stuff to using protection that's this memory.low and memory.minstuff and again these are kind of a guarantee that we will reserve some memory for that process we don't prohibit any other application from using it but we aggressively take it away from them if the workload does end up needing it and this allows ballparking significantly more easily because you can set these kind of thresholds without affecting overall system throughput and system stability you'll also notice the addition of io.latency this is a new kernel feature which we've been working on at Facebook but it's available in modern kernels everywhere use arch Linux so you probably have it and we didn't really have time to deep dive into how this works but essentially it works by providing disk io.latency threshold that is you guarantee some amount of io.completion responsiveness to the main workload and some of the level to the best effort and some of those who the host critical services and it keeps on degrading until it reaches each of those thresholds.io.control is really important when limiting memory as we discussed and if an application is misbehaving it can just really cause huge io storms if you're limiting its memory so this is one reason why we really need this so having these completion latency guarantees the right back throttling and umdi really permits stopping these bad applications from affecting system stability overall and this happens before they affect system throughput and stability at large. I also mentioned I don't want to just talk about abstract theory I also want to talk about production success stories well in this case we intentionally cause a fast memory leak in systems.slice on Facebook web servers the best effort slice don't worry about all of the colors and lines and little bars yet we'll go over those. Essentially the purple line is without fb tax 2 and the green line is with it. With a normal operating system setup the crux is basically that it takes in the order of minutes to recover from a 10 megabyte a second leak and this is basically nothing that's pretty tame compared to a lot of real life situations. This is another example of why when people think that the umkiller will come to their rescue it simply won't. This is all works based on reclaim so again if it can make progress if it's making progress system wide doing reclaim the system will not invoke the umkiller even though your application might be being completely driven into the ground and only after an incredibly long time sometimes in the order of a double digit number of minutes does the system realize it's not actually making any forward progress long after a fleet wide event could easily have taken out your entire service. Contrast that with the fb tax 2 case where we quickly isolate and stop the leak before it actually affects machine or workload stability. Now we can have significantly better control over resources and can not only increase fleet wide stability but also fleet wide efficiency which is quite important considering the crunch which I mentioned earlier. In fact we can repeatedly relaunch the same memory leaking application over and over and over and it has no effect on the workload whatsoever due to the fact that umdi is killing it and the fact that we are throttling it beforehand. This is made possible by all of the primitives which I've been talking about in this talk working in tandem no one of them could do this alone and this is what I was talking about when I said we had to build a complete and compliant operating system in order to perform resource control reliably and none of these can do it by themselves. On tiers which run some of the background jobs of Facebook some of the primitives mentioned were even part of permitting them to run on smaller and more efficient machines. Previously they couldn't move because of so much uncertainty about the variable memory use of their workloads but now they can actually use some of these primitives mentioned to reliably measure can they take on more load or should they do load shedding which was a huge part of allowing them to migrate to smaller and more efficient machines which is another huge win from these efforts. One thing I'm pretty excited about is that the new technologies which I've presented in this talk are allowing us to do things which we simply never really been able to do before as engineers or in Linux so I really hope that I've been able to give you some idea about how these tools and technologies might be able to help you in your specific line of work and I'm really looking forward to seeing what new novel things people in this conference might be able to build using what's presented here in the coming years. I've been Chris Down and this has been Linux Memory Management Scale. Thank you very much and have a great day. Welcome to this Q&A my name is Marcus aka Michelinu and I have Chris Down with me. How are you doing man? I'm good you. I'm doing good. Awesome awesome. I get a few questions for you regarding your talk obviously and let's start with this which is asked by me. What are the pros and cons of swap files versus swap partitions basically? Yeah it's quite a detailed question because I think a lot of people have this conception that there should be some performance loss if you use a swap file because you go through a lot of these it intuitively makes sense right because people think you have this separate layer of the file system and so on and so forth but inside the actual swap code if you look at mm slash swap file I think it is or maybe it's in swap lock I'm not sure which file exactly but if you look the way it works is we generically have an iNode which may or may not be a block device and we do exactly the same thing more or less. In fact for a swap partition we in fact do more stuff we have to pretend we are a real file system we have to do bad block detection we have to do like all kinds of stuff which you wouldn't usually have to do so nowadays there's no reason not to use a swap file we also added better swap files quite a while ago Omar from my team added that so yeah I think nowadays you might as well use the file. Alright awesome thanks for the answer so the next question is by Dr. Hashimoto and he asks how you did get into this area of kernel development like what is your journey basically? So for a long time I've been I guess kind of an expert user of the language kernel I think that goes for a lot of us who have used Linux for a long time especially like people who use Arch Linux it tends to be a distribution which people who are quite exacting in what they want use so that feels like an I use Arch by the way comment but let me just tune down the elitism but yeah I think the answer is like find something which annoys and have a think about how to fix it I think the thing which always stopped me was A my C was very bad and B I just didn't really have I felt like you know if there's a problem to be solved probably somebody's already solved it so why should I go into that? Truth is both of those are not that difficult C might seem like a very intimidating language it's actually a really simple language and as for like all problems being solved I think once you've got into kernel development you'll discover there's a lot of outstanding problems which really just need people with the drive and energy to solve them I would also like for me personally I want to give a big thank you to Johannes Weiner who was like my friend and someone who helped me a lot through when I got started in kernel development can be kind of a hostile community it's I guess it has a reputation and it has it for a reason having someone who can be there to tell you you know that guy's just a asshole like your patch is good and that person is just being kind of a dick I think that's really helpful having that kind of support so those are the two things I would really recommend Awesome so the next question is from Darnimator and he asks he runs all his production systems with vm.overcommitmary equals 2 and he asks why you shouldn't do that or alternatively everyone else do it so vm.overcommitmary is probably the worst sys cuddle we've ever made because 2 has like a special meaning and 0 means totally the opposite of what you would think and everything, I think 2 is the value which means allow overcommit but also do the heuristic check I'm sorry if I'm getting that wrong, I think it is if so then the answer is the heuristic check is actually not particularly useful the way it works is I think it does something like if you're virtually allocated memory is some order of magnitude larger than the available system memory then we will refuse to do overcommit anymore essentially with the idea that if everyone came back and demanded their memory now we need to actually provide it to them so we don't want to prevent it, it gets too large in reality it doesn't really prevent against that because in reality, oh hold on yeah I'm back my legs are going mental and God knows what it's talking about yeah I think that the main thing is like it doesn't really prevent the situations it was supposed to prevent so it's fine, you can also just run with overcommit on, I mean I don't think the protections do a whole lot because they're all based on virtual memory instead of resident memory so yeah alright so the next question is by KGZ and he asks do you need to be using cgroup2 for workloads to make use of any of the PCI metrics so it kind of depends, for PSI we have availability of this like slash proc slash pressure files, those are available whether you use cgroup2 or cgroup1 those are both available you do need to use cgroup2 to get pressure per cgroup reason being we have some dependencies for PSI which are cgroup2 dependent in cgroup2 this could be a whole other talk in itself so I'll try to keep it brief but in cgroup2 there are a whole bunch of evolutions which we've made and how we manage how individual resources are accounted and we need those in order to be able to say you know on a per cgroup level how does this break down so if you want granular metrics yes you need cgroup2 if you just want system-wide metrics which may or may not be enough for you then you can get it either way but I do recommend cgroup2 cgroup2 all right so the next question is by mark mark mark a good name yeah very good and he asks if facebook use arch or are you just I'm on the call with Mark Zuckerberg as we speak telling him we should migrate no let's put it this way as an individual speaking as somebody who works in the operating system space I think people I'll answer a different question the one you answered and I think it's the one you want to answer I wish that we had arch on a lot of servers I think a lot of people say you know oh I wouldn't use arch on a server the reality of using arch in a server I think that the paradigm isn't any different than using most other distributions you have to if you're running at scale have some kind of testing some kind of rollout some kind of strategy for testing changes nothing really changes that I mean we run CentOS the strategy is exactly the same you have to treat it as if it's really risky so I don't think anything really changes the only bit which is kind of more annoying is you have to pick and choose which things you get more modern versions of so we run CentOS but that's really almost like fedora we backboarded so much stuff from fedora and for those who don't know fedora is like a more modern it's like almost like what I've done to is the deviant kind of thing and I wish we did it because I think it would be a lot easier to manage but well I don't see that that's happening anytime in the future as for my personal use I've been using this in 2008 in my opinion it's amazing for anyone who works on upstream software just having all the latest libraries having all the latest you know latest G-Lib C latest system D-List no matter what I'm testing I know all the other components as I want them to be and it just makes things super easy for me awesome so that was all the questions we had time for thank you so much for answering them yeah no worries at all thank you everyone for attending and thank you Rhys Parry for the great memes and chat awesome thank you