 So now Johannes Weiner will present about Senpai, automatic memory sizing for containers, a tool to find out the correct C-group memory limitations. Yeah, so let's give him a hand. Thank you. Yeah, my name is Johannes. I work on the Facebook kernel team, mostly on memory management and C-groups. And usually in the kernel space, Senpai is a user space tool, although it's fairly low level, so that's something I've been working on. So this is a tool to automatically configure the memory limits and protection and stuff for C-groups and workloads that are running in C-groups. And the reason, the background for this is that it's a fairly simple premise. We have large data centers and RAM is really expensive. And so we want to pack our workloads, all the stuff that's running in the Facebook fleet, as tightly as possible to maximize our resources. And if we over provision them, we're wasting them, obviously. If we're under provisioning them, we get stability problems or issues during peak loads. So, yeah, the main thing is we want to pack as tightly as possible. And in order to do that, we have to know exactly how much memory, how much resources a workload needs before we fire it up. And we have a lot of workloads, so that's kind of hard to do, just to cover it manually. But even if you wanted to do it manually, one problem is that it's actually really hard for people to estimate the size of their memory requirements. So we have a lot of people that write high-level applications, and if you ask them, how much memory do you need to execute this, they don't really know. But even for somebody who works on the lower part of the stack, it's actually quite tricky to estimate exact memory requirements of a workload. I'm going to show this in an example. So here's a simple kernel compile job because I'm a kernel developer. I put it into a C-group, not for control, just for accounting, just for tracking what it allocates. And then I let it run. And while it runs, I'm just sampling the memory.current file of the C-group, which just gives you the total memory consumption, everything that's allocated to that C-group. And so after four minutes, it's done. And if you look at the peak consumption in our lock file, it shows around 800 megabytes. That includes everything, compiler, the source tree, everything. The job runs. At the end, the peak consumption is 800 megabytes. Now, I have a suspicion that this is not exactly the amount of memory that I do need. So I let it run again. I set a limit of 600 megabytes first, and I let it run again, and it takes the exact same amount of time. So the workload would allocate 800 megabytes. It clearly doesn't need it. So what's going on there? And in order to understand, you have to look at the memory access distribution of a workload. If you look at the graph on the bottom, I think it should be readable. There is the unique data that a workload accesses during its lifetime. And then on the y-axis, you see the access frequency. And not everything that is being allocated is used at the same frequency. So if you look to the left where the access frequency is high, the compile job, that will be things like GCC, G-Lib C, all the stuff that runs on every single source file, right? So it's pretty hot. Every instruction basically is touching that memory to execute the next line. And then as you move to the right, you get things like, for example, make a startup, or in the case of the kernel, the configuration system, it gets parsed first when you start the make job, and then once it figures out which source files it needs to compile, it doesn't touch that memory anymore. And then of course, after the source files themselves, you ask the compiler walks through the tree, it builds one C file into an object file, it never looks back. And what happened with when I set the limit to 600 megabytes, all I did, instead of when the compiler moves on to the next source file, instead of allocating more memory to cache it, it just goes like, okay, I'm hitting the 600 megabyte limit, I got to reclaim something, and I'll just reclaim the memory that was holding the previous source file that's not being used anymore. So even while reducing the memory, it can just basically time-share a smaller amount of memory and just use it sequentially. And so when you see that, obviously the question is, how much can you reduce and how much can you do this with multiple workloads? Like, how far can you reduce the limit before you hit that knee and you're going to hit memory that's really frequently used? So I can run it again, set it to 400 megs this time, and it's still kind of completing in the same amount of time. And then the question is, how far can we go? At 300 megabytes, I eventually aborted the job because it didn't look like it was going to finish, and it was pretty IO bound the whole time. So after 10 minutes, I'm like, okay, this is not going to finish. Yeah, so take away from this is, it needs somewhere between 300 and 400 to just complete normally, which is a lot less than the 800 megabytes that we initially thought. And obviously, this is a piece of data we would like to have for basically all Facebook jobs because if we look at this, the question is how much memory are we actually wasting, right? So the tricky bit is to do something like this at scale. One problem is that a trial and error process like this is really tedious if you do it at scale. But the other problem is you can't really do this with a constantly changing software implementation and also variable user activity, right? So the kernel job, it's the same files it's compiling every single time. I can run it as many times as I want. It's the same input over and over, and I can just modify it at one parameter and see what it does. But if we have a long-running service like a web server at Facebook that is completely driven by user activity, it's actually really hard to... It's really... I mean, you can't do trial and error there. So this is where Senpai comes in. And the basic idea behind Senpai is you create artificial memory pressure on a workload, and then you monitor memory health assets running to identify where you are on that graph that I showed earlier. Are you pretty much to the right? Are you just cutting off memory that is rarely used or not really reused, or are you cutting into that hot set on the left? And now the question is how do you identify memory health of millions of different applications? And so this is based on something I was talking about last year. This is a kernel feature called PSI, which is their pressure metrics, and the way they work is they record the time that a process that is trying to run is having to wait on the way of some operation waiting for resources that are congested. So for example, if you have a cold start of an application that's never run before, you'll encounter a bunch of pagefalls, right? But those pagefalls, they would happen whether you have infinite memory or not, right? It's just never been accessed, it's never been cached. But if you wait on a page fault for something that was very recently kicked out of the cache, it's called a refault, and something like that would not happen if you had infinite amounts of memory. So when a task enters a page fault and we can identify this was recently only evicted from the cache, we can record the time it takes for you to get that page back and record it as a stall event. We can say this is time that only is being spent in the process because there's not enough resources. And so by doing this, we can basically profile productivity of any given task in the system. We can go like this is spending X percent of its time waiting for resources or it's running really fast and it's fine. And the reason we originally developed this was to choose root cause regressions. We have machines where many things change during the day. Different parts of the entire software stack get updated and sometimes things run slower and it's actually really hard to say why they're running slower. And there are some indications. You can look at the page fault rate, things like that, but you're not exactly sure what the exact root cause is. And so PSI was kind of developed to go like, you're waiting for IO, you're waiting for memory. For example, if the memory access pattern changed, you're now waiting for memory. And you can tell exactly you're waiting 10%, you're waiting 20% of your total runtime. So yeah, the regression quickly identified this is where the time is going was one reason. And the other thing was to fix problems with total over commit and to automatically remedy those. And this is something that for example, UMD does when memory pressure gets too high and we're spending double digits per cent of the entire time just waiting on memory. Then we go like, okay, this is extreme. Just kill the workload. No, this is good at the high end of pressure, but at the very low end, PSI is actually fairly sensitive. So it can record events that take like microseconds. And this is where Senpai makes use of it. Because once we have something like PSI in place, what we can do is we just modify the C group memory allowance continuously and then in a feedback loop monitor the PSI pressure. And this is how we can tell when we're approaching that need, when we see one pressure kicks up and then we can back off instantly. We know, okay, this is the line. And the idea is we apply enough pressure for PSI and the Senpai to detect but within the tolerance of the workload before latencies go up too far or throughput drops. And so this is the same kernel job run with Senpai. And you can see the time is still around four minutes. I set it kind of aggressively, so there's a couple extra seconds. But for most batch workloads, you probably wouldn't care. And as you can see with recording the memory current, it takes about 340, 335 megabytes of memory. And obviously that's not, the memory consumption is not like a single value. In the graph you can see the blue line is the memory current log file of completely unconstrained kernel builds. And you can see at the very beginning it seems to read a bunch of data into the cache that it just never ends up using again. And then with the red line you see Senpai putting pressure on it and it's cutting away a whole bunch of that memory that is seemingly not needed for the entire duration of the workload. And so we put this on some web servers in Facebook and the blue line, the value doesn't really matter all that much. It's mostly indicating the request per seconds coming into those machines on average. And as you can see with the yellow line indicating the memory consumption of the web server software, it drops from 15 gigabytes to below 10. And the request per seconds are unaffected. So the load balancer doesn't see that the machines are struggling to handle requests, it just keeps giving them the same amount of work. But also interesting is not just the memory reduction or the seeming reduction in what we think it's using, but you can also see that when you look at the yellow line to the left it's kind of noisy and when you look to the right where Senpai kicks in the memory footprint follows the load that the machine is experiencing. So it's not just a reduction, it's also giving much more accuracy. And that's something that's another project we've been working on, Dan Schatzberg who was talking about resource controller yesterday was also working on this. We have a whole bunch of widely deployed binaries that run on every single machine in Facebook and because they're relatively small compared to the host size their exact footprint can vary and it doesn't affect the workload all that much, but obviously for development reasons they want to know if they suddenly need more memory than before. So they were interested in using Senpai to get an exact measure on how much are they actually consuming, how much are they actually taking out of the resource pool. And we had one binary that runs periodically to collect a bunch of statistics and put them into nice graphs, locks memory consumption, locks CPU utilization, all of that. And what they were using was a, they were looking at the RSS size of the main process to estimate how are we doing memory-wise, are we regressing, are we using more or less. And their own estimate was about that they're using 200 megabytes and we put all of this into a C-group and put Senpai on it and it showed that their actual footprint was like seven times larger, I think it was like one and a half gigabytes or so. And because it was all that memory they were missing, they were touching files in the file system, so they're allocating cache on the file system cache that they're not including that RSS-based estimation. They're forking off collectors, they're using network. All of that memory is being tracked by C-groups and they weren't tracking this. And, yeah. So, the current state of Senpai is it's more or less a, well it's kind of a proof of concept that's growing into a production piece of software and so right now there's a Python implementation and then it's been working on a umdi plugin to make it much easier to deploy. And there are a couple of plans that are more or less medium to long-term. One part is the sampling between PSI sampling and making adjustments is following a fairly short window right now and the idea is to be able to learn from longer-term trends. If there's a bad pressure event that indicates oh, we're way too low on memory right now, it shouldn't forget about it in like two, three sampling periods down the line and just recall and have like long-term trend tracking that it doesn't do. Then also, compressed RAM instead of having to go to secondary storage if you're running too low on memory because it would allow us to be more aggressive with tuning the memory limit because if we tune the memory limit too aggressively right now it means we have to go to disk before we detect the error and once you go to disk the minimum amount of time that you're waiting is like secondary storage IO which is pretty costly. So we have to converge fairly slowly and move slowly and with compressed backing storage we could aggressively shrink memory if it goes wrong it wouldn't be that costly but still detectable. Then there's a bunch of stuff on the kernel side we could do PSI annotations. For example, if you're causing memory pressure that's causing more paging that is taking out of the IO bandwidth so unrelated IOs that are not memory related could also be slowed down. That's also something that's not currently being tracked which it doesn't, right now the way we're using it is completely fine because we're applying pressure at a scale where the IO impact is very almost, I mean it's negligible but it would allow us to be more, all these things would allow us to move more aggressively and converge on the actual memory consumption faster. But yeah, so this is the GitHub repo for where the current Python implementation sits. If you want to go check it out. And yeah, this is it. Questions? Is it dependent on Tupperware or is it looking directly at c-groups and does it use c-groups v1 or v2? Oh, thanks, that's a good question. So I try to keep the dependencies very low especially on the Python thing, the Python implementation. It's directly working on the c-group2 interface and the reason it's c-group2 is because there's no PSI in c-group1 and another feature that it's using in the c-group interface is something called memory.high which is a memory limit that only throttles but doesn't oom kill, right? Because we would never want the Senpai to cause kills. We want it to be an undetected observer as much as possible. And so we only use memory.high which exists in c-group2. And other than that is Python's standard library. There's not really... Yeah. A quick comment? Just a quick question. How does it... Basically, how tight does the loop have to be for it to apply memory pressure on processes that are only running for microseconds? Oh. So the current sampling period is... the default anyway is six seconds. So it reads pressure and monitors every second but it doesn't do adjustments more than every six seconds because when you take away memory it's completely dependent on the workload when it will notice, right? You can take something away, it might be accessing the cache like a minute later and it's like, no, you don't know. So right now it defaults to six seconds which seems to work pretty well in practice. And yeah. That's something that could be sped up with if we have compressed RAM as a backing or as a secondary storage where you can just move more aggressively and if we make mistakes it's more forgiving. Yeah, so kind of a related question. Can you describe the refalting behavior? And the related part is that if we're looking at something every six seconds or whatever the period is like if the process was restarting every seven seconds perhaps would that mean that all the... Thanks, Chef. I've forgotten my question now. Refalting. Yes, so the refalting mechanism is the kernel remembers when it's kicking out entries from the cache and then when they come back then we can first we can detect them. This has been kicked out very recently and somebody is reading it back immediately so we can tell there's an event that means the cache is kind of thrashing and then PSI can measure how long it takes and we can then conclude like this is taking time out of the productivity of the task. Independent of the process. Yes, so the refal technically is a process independent thing. It's kind of a thing that the cache is experiencing but we can detect one individual task or waiting for a specific cache entry to come back. So you can have one refault and you can have multiple tasks at the same time waiting for that thing and experiencing their own memory pressure. What was the second part of your question? I don't remember anymore. Okay. So that's it. Thanks a lot.