 All right next one is going to be Chris down is going to be talking to us, but the next memory management at scale Hello there you lovely people Hope you've had a good lunch. I'm Chris down. I work at Facebook on the kernel team I mostly work on improving manner of memory management within the Linux kernel I'm also a maintainer of the system D project so you can hack on me later Mostly I spend time thinking about how we can make Linux more reliable and usable at scale And that kind of feeds into what I want to talk about today I want to help you get more more of the tools and information that you need to manage memory at scale There are many misconceptions among even senior engineers about the primitives that we provide Around memory in the kernel and what they are good and what they are not good for Memory management in general is a really really in exact science where basically everything is a trade-off So I want to help try and give you some information that might help you make the right trade-offs for your services And your workloads hopefully you'll come out of this talk with a little bit of better understanding of how you might be able to make your memory Management more efficient for your specific case So here's a talk within a talk first before we go go into that. How about we go back to 2017? I came here at this Container's dev room with exactly the same people running it. Although this one is from Qcon And I gave a talk about a thing called secret we to secret v2 has been for the last few years our big bet for resource control and management and C-groups are a kernel mechanism that we're essentially building to balance and limit things like memory CPU IO Some shared resource which you have across a machine. We have pretty good problems at Facebook and as an industry as a whole Our user base is increasing more people are on the internet our product range is diversifying But with that growth come scaling problems that we've never really had to deal with as an industry before and in the next Few years especially at Facebook we're going to fill this massive crunch for capacity And we simply can't solve this problem by throwing more and more computers at the problem or more Ram at the problem We have hundreds of thousands of machines Any small loss in one of those machines represents a huge absolute loss in resources for the company So we need to use more resources more efficiently and see groups are a huge part of that It's also really important because a lot of huge site incidents and outages not just for Facebook But for people like Google LinkedIn all the big sites are caused by lacking resource control not being able to readily control things like memory CPU IO can either cause problems or it can make existing problems significantly worse or longer So resource control issues of course some of the most pervasive problems in our industry And we need to start an initiative industry-wide to solve this and that's largely why C-groups exist Two years ago it was kind of well, I think we're in Europe so this is an appropriate thing to show But two years ago it was a totally different story though, right? We really only had one user of C-groups 2 which was was us like not even Google was using it And they like to use everything, but we were really the only people using it We had a bunch of different primitives that we had just found out about and were using and invented But when we actually came to use them it kind of went a little bit like this It's not completely wrong like what we got is like somewhat there But we got to work out in what cases do we want to make the primitives a little bit more round and in what cases Do we want to make the operating system? You know a little bit more square so finding the right trade-off takes a lot of production experience and a lot of production experimentation another problem is that C-groups Literally one of their jobs is to limit resources artificially within the machine And one of the things that happens is that that can cause another resource to then become over subscribed as a result For example if you artificially limit memory So you limit something very tightly an application that should use one gigabyte and you put it in say 200 megabytes All that's going to happen is you start hitting the disk a lot because you're going to start to have to swap things out Even if you don't use swap you're going to have to end up taking stuff out of the pagecast So you're going to end up hitting the disk more often to do these things So you end up just with essentially the shittiest memory that money can buy in the form of a disk So as such we need an underlying operating system That mitigates these problems and allows us to limit these things without affecting other parts of the system This is also why in a talk about memory you'll hear me talking a lot about IO in this talk disk IO Is is something you really need to have control over if you're going to limit memory without IO control memory control is Really just an academic concept. It doesn't if it's an effective memory control And you don't have any IO control at all You're just going to destroy the machine and the existing solutions we have in IO control which don't in memory control Which don't affect IO? It's mostly because they don't work Another problem is that while we've had secret we to kind of academically working like this for a while For the past few years a major focus has been issues that it brought up elsewhere in the colonel Because secret to actively limits things effectively in a way that we've never really had before It's brought up a lot of issues in the colonel and limitations that we've had to either fix or work around There's not really enough time in these 30 minutes to go over the cases that we've had But if you're interested, I do go into them in detail in my talk at SRE con from last year We found in fixed issues all across the colonel from kind of the IO stack to file systems to memory Just to make this work and make resource control not break down under scrutiny and a huge amount of our time has been invested in that kind of stuff So how about we kick off kind of the main part of the talk now I've got all the disclaimers out the way With by discussing some of the fundamentals of Linux memory management So these are pretty important to go over even though some of you may already know most of this Because if we if we don't agree about this then the whole rest of the talk is going to absolutely be completely useless So one thing that's really critical to understanding Linux memory management is that Linux has these different types of memory from the CPU's perspective There's really not much difference. It knows a little bit. Maybe it knows something about permissions It knows something in the MMU, but it doesn't really know What the semantic intent of your memory is? For example it for Linux anonymous memory is as the as the name would imply memory which has no backing storage It has no name usually things which you allocate with malloc or MAP map anonymous stuff like that Most people also know about cash and buffers two sides of the same coin But if you ask most Linux users what they think about like buffers and cashers, they will say it's reclaimable And I'm sure most of you would agree with that but the problem is most people's understanding of what a reclaimable means is a little bit off It doesn't mean that you can reclaim it right now It means that you might be able to reclaim it if you ask nicely in about five minutes, maybe For example if some application is like hammering the shit out of some file We're very unlikely to evict that from the cash and the colonel knows this It's not going to keep on evicting that from the cash when it won't your system won't be able to make any forward progress This can cause some confusion when people inevitably ask and do ask a lot Why did my system run out of memory when I got so much free memory available and the answer is well You know, it's not actually free these things do serve a purpose and we can't always trivially free them So we'll come back to more of these kind of cases which are non-intuitive in a moment The fact that cashes can be essential is another example of why RSS the resident set size a metric that we love to measure is pretty much bullshit and We measure RSS because it's really easy to measure not because it represents anything useful Which is just fucking I don't know what the fuck that's about but as an industry. We've got to stop this shit RSS often skews a Lot of attention to very very specific types of memory It skews a lot of it to follow map memory and anonymous memory We forget though that many workloads simply cannot operate without these disk caches without these page caches And if you take them away, they wouldn't be able to run in one case inside Facebook a team Which for years had thought that their operating footprint was like 100 to 200 megabytes Discovered through the metrics which I'm introducing in this talk that the actual footprint was on the order of more like two gigabytes And this is one of the reasons why RSS as a metric just doesn't make a whole lot of sense So this brings us come somewhat neatly on to swap a completely uncontroversial topic which no one has any strong opinions on Swap is really widely misunderstood a lot of people think that swap is kind of pretty much irrelevant nowadays With gigabytes of memory But this is a kind of a strange opinion because swap still does a lot of good stuff and most of the things that it Well, you really can't get them any other way and for the stuff which it does poorly and it does do some stuff poorly You can mitigate it like there are ways to get around that Usually these discussions kind of hinge on misunderstandings about swap is actually for using swap is almost all about promoting this positive memory pressure Using as much memory as possible and it almost has absolutely nothing at all to do with being a slower Ram or like emergency memory, which is what people tend to codify it as People also have this really really strange idea that if you run without swap when you encounter memory contention This guy's don't happen. I don't know how people got this idea, but obviously that's not true, right? We have to evict something and what we're gonna evict is really hot Page cache pages, so you're gonna end up hitting the disc for those instead. You just end up making the reclaim less efficient Even worse like we end up reclaiming things which we otherwise wouldn't want to reclaim you're reducing the amount of Things which we can reclaim So these kind of misunderstandings have really hobbled swaps reputation and led to this belief that it's not useful in 2020 Which is not really true So if swap isn't a mechanism to expand your RAM then what is it? Well swap allows reclaim on types of memory, which otherwise would be locked in memory That is it provides the backing store for things like anonymous memory where we simply don't have anywhere else to really store them Except for physical RAM without swap. It's really really hard to run hot on memory It's hard to run memory-bound workloads because we are almost immediately go from this state of the system being totally fine Running at the maximum capacity to oops you've gone one page over and now the system is kind of in the ground And that's not really how anyone wants to run a system Another thing is that if you if you've been compiling things you've probably seen like this make j cause plus one stuff It's pretty much what everyone compiles like why do we run with cause plus run? Why don't we run with cause well the reason is because we want to promote a little bit of positive pressure, right? In case one of the threads starts doing a little bit less We want to make sure that the cause are being totally utilized all the time and swap allows us to do the same thing with memory This this comes back to kind of what I was saying about as the operating system We want to make the most efficient use of memory possible And we want to do this without impacting system latency too much, which is kind of impossible if you don't have swap Without swap these memory contention increases happen really really suddenly and often there's no way for your system to recover from that state That is when we go over the edge We go really over the edge and often your system can enter some pathological outcome as a result Swap does come with some trade-offs. I mentioned there was kind of some downsides This is a really really long post if you have the Perception that you know swap is not useful. I really recommend reading it and feel free to send me an email or whatever If you have some comments One reason why people don't like swap is because they think the oom killer will come to their rescue The oom killer is the out-of-memory it's Handles out-of-memory situations in Linux and it's essentially a massive massive fucking cannon that you point in the direction of Some process and you pray that it was facing the right way And usually, you know, it was kind of facing up and set it down or some bullshit like this It it's a it really doesn't really do what what it says on the tin Usually if you're running a web server, for example, well the web server itself is like the largest thing on the system So it will just kill the web server fucking genius However, the oom killer that kind of has this constraint, right? Like one is when you when you use it you pretty much already lost Usually by the point that oom killer is invoked all of your applications on the system have kind of become queued up to hell and This is because due to the fact that memory hotness is hidden behind the CPU's memory management unit We actually have no fucking idea when you're out of memory, right? Like this is not a Linux thing This is this is the same for basically every modern operating system and every modern CPU and every modern memory management unit We have no clue when we're out of memory. This might come to a shock as a shock to you But the reality is we just have to eventually work out that we didn't get pages for a while And probably that means we're out of memory Generally as I said the only the CPU's memory management unit propagates this information Which means we have to pull for it We have to walk the page tables to find out if this thing actually is hot or not And so we have to periodically pull which is really really expensive so we can only do it when we're already doing reclaim So the problem here isn't really knowing when our memory is full because that is the state We want to be in most of the time But what we want to know is not what's resident But what we could take out of memory if we had to and that's something which is hard to know ahead of time As you'd imagine as well We only really want to invoke the oom killer when we really are out of memory This means that there can be a really really long delay from you running out of memory to the oom killer actually being invoked So we have to already go through a lot of reclaim attempts already through a lot of iterations to for it The system to finally decide we're out of memory and I should actually kill something So relying on the oom killer to do this reliably just because you don't have swap is really not true And I'll show some graphs at the end to show that Our goal in general should be to avoid to invoke the oom killer at all and I'll go over how we Achieve that in a slide or two The second problem with with the oom killer is that it's not really configurable We have this amazing Amazing proc file called oom score and oom score adj for oom score adjust I love anything in the kernel which is called a score because it means nobody knows what the fuck the numbers mean And this is kind of true like you kind of sell it to like Thousand or like minus the thousand or like some value and you pray that it's bigger or lower or something than another one But you really don't know like how the oom killer will react or what it will kill So it often results in the oom killer completing completely killing the wrong thing And then kind of only eventually randomly getting the real thing which is causing the problem on the system So I mentioned this whole oom killer shenanigans works based on reclaim. Well, how does reclaim work? So reclaim is this process of trying to free pages There are multiple different ways reclaim can happen two common ways are this case swap the reclaim and another one is called direct reclaim Okay, swap the reclaim is done in a background kernel thread it's kind of not in part of any part of the application lifecycle and Essentially, we're trying to proactively free memory when we go over some threshold of used memory on the system This happens Impressive this happens passively when like your past say 97 percent or whatever of your memory And we try and prune it back down a bit Having this happen prevents going into this next and kind of scary phase of reclaim called direct reclaim Direct reclaim requires suspending the application which is requesting memory Which is really bad for latency because we've now tried to request and there's no free pages available So you your application has to wait and be suspended while we try and go get some This can result in your services having actually measurable freezes or if you're an Android user You click like by now and your whole phone freezes. That's not a good experience, right? So you want to avoid that where possible and that's what the case swap D thread is for Reclaiming some pages may also be easier or harder than others. So when we talk about reclaimable memory like I mentioned Some page tapes types may be reclaimable just not quite right now For example, some cash pages may be so hot that we don't want to evict them because we know the system performance will tank The same goes for anonymous pages, which if you have no swap free they have no place to go So they're essentially locked in memory unless the application itself does something about them Some page types might also be reclaimable right now, but we can't just reclaim them We have to do something else. For example, if you have dirty pages We have to flush out the modified data to the disk before we can reliably reclaim the page Otherwise we'd lose the modifications in the kernel and then we would kind of have data loss Which is not great. So sometimes it's not as easy as just reclaiming pages. It kind of gets a little bit complicated in Practice this variance and unpredictability and reclaiming It's typically really hard to tell ahead of time that we are running out of physical memory But if we wanted to know now, how would we go about it? If I were to ask you if your machine was overloaded typically the kind of things depending on your level of seniority and experience with Linux You would say to look out or like, you know The most basic one is free memory But free memory is kind of a lie because we don't really know how much of that we can take away We don't know how much of that we could reclaim the same goes for us the slightly more correct But also incorrect mem available in proc VM info Because that is kind of basically an estimation just based on page type If you're a bit more senior you might talk about something like page scans the problem with page scanning Which is how often we're iterating to try and free pages is you can't really distinguish from some pathological Behavior where the system is about to fall over from fairly efficient use of the system If you are efficiently using the memory on the system, you will also have a really high page scan number So it's really hard to tell what the outcome of that is But usually all of these metrics we come up with are just approximations of memory pressure So so what is memory pressure? What do we really mean by that? Well, we've never really had a metric like this in the kernel before We have many related metrics like the ones I've just gone over but even with all of these It's really hard to tell real pressure from efficient use of the system So PSI uses a matrix which is specific to a particular resource to tell you if that resource is becoming oversubscribed For example in memory, we use the amount of time which we were stuck doing this pure memory work To work out for example if I had more memory I could probably have done zero point two one percent more work in the last ten seconds And this is a lot more easy to reason about than something like page scans or whatever Because it tells us this thing is actively becoming the bottleneck for the system Mostly like these kind of things tend to be things like we're faulting like doing IO as a result of having to repage pages Sorry refold pages which we've just paged out because the system is churning over and over again, but it's not only for memory We also have it for CPU and for IO as well And these can also be really useful to you in developing high reliability and high availability applications For example, if you want to know in advance whether your system is about to run out of memory And you might want to do load shedding or blocking new requests from coming in without having to pause the process of your Pause the progress of your application entirely So it can be really useful to do that and you can't just do this by looking at free memory or page scans or stuff like that These part PSI metrics are also what powers our pre-oom detection We do this as part of a project which we've open source called umdi Umdi is a user space umkiller with a really fine grained policy engine So no more of this kind of um score stuff This allows you to encode policies about what to do in certain situations For example, we run chef on our machines to ensure that we have an up-to-date system But while chef is important if my web server is tanking because you know, we are running out of system-wide memory I am okay if we don't run chef for like five minutes. That's fine. That's not going to really be a problem And the same goes for other background work, which isn't in the critical path So umdi allows you to encode these kind of rules about what to do Based on these pressure metrics For example, we can monitor a best-effort applications memory pressure And if it starts to cause contention for others on the machine We can dial it back or kill us completely before performance to grades elsewhere and using these metrics We can prevent um and avoid invoking the umkiller Entirely and we have a lot more reliable behavior of the system when the system is at the peak of memory Another thing which you'll have seen me talking about if you saw my early security two talks was our limits That's this memory dot high and memory dot max stuff our initial proposal for secret v2 was to limit non-essential work on the machine But this has a few fairly fundamental problems One is that you have some applications which have really highly variable Resource usage and sometimes that's okay. Sometimes those applications do you know genuinely for some genuine reason have Variable resource usage, but it's hard to set a limit for those because if you set it too high most of the time The application is essentially unprotected and if you set it too low Once the memory spikes you'll end up umkilling the application So it's really hard to work out what to do there another similar problem is that the resource users of some processes is Really heavily tied to some other process on the machine For example, if you have an application which has a dedicated cache say It's like a something which does service lookups and it has a cache of the lookups It's done recently how much memory it uses is probably a factor of what it was actually asked to do by something else on the machine So it's kind of hard for those service owners who own a service Which is so dependent on another process on the machine to reason about how much memory should my application use What people really know is how much memory does the thing which I want to run on the server use How much memory does a web server need how much memory does a database need those are things we know We don't really know how much does this service lookup demon need if I do this query and so on and so forth So for that reason we've been moving more and more towards these protections. That's this memory low and memory dot min stuff Usually these are kind of more uniform and a little bit more static The way these work is that you define a threshold say like 28g say you want to protect 28g for your main workload So say HSV on my web server Needs that much memory you set memory dot low in there and we will if HFM requests it aggressively take memory away from other things on the system until HHVM gets that amount of memory We don't prohibit other applications from using it It's not a hard reservation But if it turned out that HFM does start to need them We will aggressively start to give it back memory from other people and this allows significantly more work conservation Demons can use as much memory as they like as long as they don't intrude on the needs of the main workload Which is running on the system So this is all well and good. I've gone through a whole lot of primitives But how do we actually intend to use them to build a coherent system? Well, this is where this FB tax to ask the risk. We don't really know what we're gonna call it This is where FB tax to comes in every tax to is our overall project for resource control at Facebook Secretly two is certainly one of the things which we need to do that But it needs to come with an operating system which supports its goals, right? One concern in FB tax to is to stop background services that the main workload doesn't rely on from affecting the workload running For example, if metric collection or chef go crazy We want to make sure that we have them dial back and we can deal with that We also want to have reasonable behavior if we start to exceed the capacity of the host I mentioned we want to run as hot on memory as possible, right? So that comes with the risk of going over the edge of of going a little bit too far and taking taking too much memory So we need to have reliable behavior when we do that It's also really important that we keep our efforts usable and lightweight It's no good if I can now give you 10% more ability to load your machine If our solution also takes 10% of load to do that and it's also no good if I produce a technically amazing solution But nobody wants to use it because it's garbage So I think these are the main three things we got to think about when implementing a system So every tax to comprises this wide range of solutions to compose a usable system We do need to be opinionated about the base OS and make sure it's capable of isolating resources If we're not sure about that, we might actually end up with a worse situation than the one we started out with We also have this early um killer um D running on every tax to machines on these machines It monitors for threats to the workload and prevents misbehaving workloads from affecting each other or the rest of the machine And it also prevents obviously misbehaving system applications from affecting the performance of the main thing which is running So let's take a look at how this actually looks at the base OS layer. So we use butter FS as the root file system This is needed because as mentioned in the in the sre con talk, which we didn't quite go over It's T4 has some fairly insurmountable priority and versions and the butter FS developers have been very receptive to fixing these It's not a coincidence that a lot of them work at Facebook But we had a lot of problems getting these fixed in the XT4. So that's one of the reasons why we're wrong with butter FS I also mentioned earlier about the importance of swap Usually we do disable swapping on the main workload on the machine But it really kind of depends on what it does if it can tolerate it Then it's it's reasonable to have it on as well We do need swap to make sure that we're efficiently able to reclaim all types of memory not only page caches We're also opinionated about some kernel tunables. A big one is writeback throttling also known as wbt A lot of you probably use desktop linux in like the early 2000s, right? Do you remember the situation when you would plug a usb drive in copies some stuff over and like your whole system would grind to a halt Yeah, I think we all went through that right and that's a result of writeback writeback is is a kind of i o which you can't really slow down So what happens when you're doing that? It's like This this i o becomes the most critical thing on the system blocking everything from else from operating So writeback throttling is something that enables stopping those ios before they even go in progress C groups are the bread and butter of resource control as such it also makes sense to go into our default choices in terms of configuration there So to get sensible resource control, it's important to have clearly defined roles from the top level so that you can delegate resources effectively for example, we have system dot slice which is a C group for best effort services things which are nice to have but which we could dial back at any moment if the machine needs it We also have host critical dot slice These are things which the host needs to operate also things Which we might need if we need to go and debug something in an emergency even if the machine is overall unhealthy And then we have workload dot slice workload dot slice is where the main thing which you're running on the machine lives For example hhvm for a web server And it's the thing we really want to prioritize the running of on this machine This is actually how this used to look you might notice this memory at high and memory dot max surface back here This so we're only punitively limiting work However, this is fairly brittle like system dot slice memory could legitimately spike at any point And we might end up slowing down or killing things in system dot slice Even when we didn't really need to even when the workload didn't need that It's also really absolute It doesn't allow ballparking any configuration You say like you don't say well my work my thing needs about four gigabytes of memory to run You say my thing needs four gigabytes of memory to run or I die That's not usually how anyone wants to live their life, right? I mean You need to be able to have some wiggle room and some amount of configurability there So this is why we've changed towards using protections instead of this these artificial limits So that's this memory dot low and memory dot min stuff This is kind of a guarantee that we have this memory available for the system We don't prohibit others from using it But if the workload were to then need that memory we will aggressively take it away from others You'll also notice the addition of i o dot latency I didn't really have time to go into it in this talk, but I mentioned you need i o control As a corollary to memory control if you don't have it your memory will just turn into i o And then the whole system kind of just grinds to halt So having these completion latency guarantees right back throttling and umdi kind of prevents these bad applications from overrunning the machine As the final thing I mentioned I wanted to talk about production success stories in this case We this is a little bit of a scary graph, but I'll go over it with you This purple line is without fb tax and the green line is with fb tax The essential point is this with a normal os setup with no swap and just using the umkeler The system stops working for literally like on the order of 20 minutes This is not even like a weird or strange. This is just just a web server With fb tax 2 we actually launch a leak a memory leak repeatedly And the system is not affected at all What we're doing here is launching a this uh this memory leak inside system dot slice repeatedly And trying to get it to take down the main workload But it cannot do so because umdi just keeps on killing it over and over based on the pressure matrix In the other case the umkeler thinks the system is making forward progress because it is killing something But it's killing the wrong things repeatedly and even with using umscore adjust this thing doesn't really get any better One thing I'm pretty excited about is that the tools and techniques in this talk Uh uh things that we've never really had in linux or unix before Um and I'm hoping that some people in the audience here will be able to build some of the container engine stuff and some of the Process management stuff which we need based on them Um this is one of the first places where we've presented this work uh cohesively like this Um so I'm really looking forward to seeing uh what we can build in the future based on these Uh i've been christian and this has been linux memory management at scale