 Okay, so hi my name is Chris. I already mentioned I work as a production engineer at Facebook London working as a member of Web Foundation I'm gonna be giving a whistle-tock tour of the new version of control groups added in Linux 4.5 Don't worry if you haven't the faintest clue what control groups are yet I guess since we're in the containers their room probably some of you have at least some kind of idea But I'll go into kind of what they are where you may have used them and a comparison of the old and the new in the next few slides So like I said in this talk, I'm gonna be giving kind of an introduction to control groups I'm gonna go over what they are where you may have used them Where you might have encountered them before if you already know something called control groups like you've you fished around Previously or you've used them before you almost suddenly have been interacting with version one of control groups control groups It has existed in the kernel since a long long time ago since around 2008 and it's been kind of building our Love of containers. It's one of one of the building blocks of containers as we know them So it's definitely got a whole bunch of good things in it. It's also got like a whole bunch of Problems usability foibles other kinds of things which are not so great So I want to go into like what those are and when you might encounter them And and how we try to improve that in security to security one is currently now kind of in mostly in maintenance mode Where it's secretly to is is more in active development. They share the same core It's mostly the the user-facing API which is kind of different And yeah, I'm also going to go over kind of what's what's still to be done And what's what's already been done the core of C group B2 is already stable since kernel 4.5 But we have a whole bunch of work. We want to be done a lot of a lot of the core work has been to enable Future work to it to enable new features in C groups, which we're already working on So first a little bit about me. I've been working at Facebook for about three years. I Like I said I work in this team called web foundation Technically where foundation is the team responsible for the web servers at Facebook But web servers are kind of not not usually extremely complicated like they're stateless. They serve you cat pictures and that's about it So we actually ended up becoming kind of a team which deals with the general reliability at Facebook This means we get involved with production incident discussions and all sorts of we generally act as kind of the the guardians of reliability at Facebook Yeah, so I like I said I work at Facebook London. We have a we have office there We also have an office in Dublin in Europe and in Tel Aviv Yeah, and we have a whole bunch of different kinds of people in web foundation We have people who are experts at the Linux and positive learning stack, which is basically my expertise We also have people who are experts in our cache architecture RPC And say like the web push and stuff like this especially a cross-functional group of experts Which all come together to deal with when shit really hits the fan So that kind of brings up the question why why do we as a team like care about syrups? And why does Facebook is a company care about see groups? So we have many many hundreds of thousands of service And we have a bunch of services which run on the service some of them co-located on the same service Some of them spread throughout service. It's all kind of all over the place. So Most kind of outages at Facebook are a few there are a few kinds of outages But like one very common one is failure in multiple systems and there are a few things you need to do to mitigate that another one of course is Actively mapping out your dependencies and making sure you have an understanding of your dependencies But another big one is making sure that when two services run on on a single machine You don't end up with a situation where one service completely Overrides the other and results in it becoming completely useless. So that's one of my main concerns about Secret between one of the reasons why I'm super interested in it So a typical use case for this is like you say on a on a normal web server We have we have three types of running processes same goes for most other kinds of web services other kinds of servers as well So you have your core workload your core workload is the thing which if you were to describe to somebody else What your server was actually doing you would say it does this thing it serves web requests It does load balancing But usually you don't end up only with this on your server You usually end up with a whole bunch of other stuff Especially if your company has been around for a while or you have some ideas about how you should architect stuff You end up with a whole bunch of non-core services, which can probably be used interchangeably with kind of system services This might be stuff which just comes with Linux like also the kernel workers or it could also be like Say cover us demons or it could be like stuff which is related to your business needs say metric collections that we can work out What is going on with our server like whether we're managing to serve users correctly? But it's it's really really bad if then this metric collection decides it's going to take up all the available memory in your server And then you can't actually serve users sure it can tell me it did that But I don't give a fuck if it like completely fucked up the whole web server, right? Same goes for cron jobs and chef I care about my server being up-to-date, but I would rather have some system Which which actually acts in a reasonable manner if chef starts taking a bunch of memory I would rather it got degraded service and still managed to run then it took all of the memory or took all The CPU and then everything was fucked like I would much rather have that that outcome Then we can have this third class like ad hoc queries and debugging These are typically things you don't know that you need and don't end up getting run on the majority of servers These are like the things which you end up realizing you need only when an incident is already happening And we kind of want people to be able to dynamically be able to determine the importance of those things As the incident is going on whether they want it to take precedence over the core workload of the machine or not So see groups is a very good use case Sorry, this is a very good use case for see groups Like I mentioned if you had some interaction with see groups You've almost certainly be interacting with version one version two has been in development for like five years now It only just got stable in the Linux kernel But even on recent kernels version one is typically what's mounted by default and the reason for that is as you'd imagine With a different version number the actual changes are backwards incompatible So I'm going to be going over kind of why we've made them backwards incompatible in a moment So the fact that we typically boot with your in its system only mounting the version one hierarchy Is is a testament to why I'm doing this talk like I know we have a whole whole Group full of container experts in this room And this is also kind of a sell to you about why you should give a fuck about see group V2 And why you should take the time to care about it and invest in it in your products So yeah, so in the previous slide we talked about multiple processes fitting into each of these see groups as well So a see group can be set as tightly or as flexibly as you like it can be all of the processes which are related to one service Or it can be a single process. It can be however many processes you like from zero to how many you like And yeah, the idea here is we don't impose a we don't impose a structure on you The one of the guiding principles behind V2 has been we want you to be able to choose how to use it We don't want to have a hierarchy which imposes how you should do things and we want the easy way to be the correct way So a see group is a control group. They're one in the same There's there a system for resource management on links like I mentioned a resource here It means like CPU IO memory and management here can mean for example Accounting like we know how much memory some particular some process in some particular secret for using it can also be limiting like We hit a particular threshold and we we take some violent action to curb that it can also in V2 There's also been some work towards throttling which I'll go in as well Generally, um killing or whatever is quite violent. So we want to have some Remediative actions instead of just just straight-up killing stuff Yeah, so The way that secret V2 secret V2 and V1 work are essentially you have this hierarchy at CISFS C group It's a book just a bunch of files and directories. We don't have a system call interface There may well be one in future, but we don't currently have one The reason the reason this is kind of a good thing is it's really easy from some user space application Even if it's not written in C or C++ and you don't have access to C library functions or system calls easily You can easily go and find out the state of your system when it comes to see groups You can just look at some file like I would hope that all of you are using Languages which could like open a file Make a directory remove a directory like yeah That's I would hope like whatever the new head kids are using still supports those kind of things So yeah, so each resource interface is is provided by a controller. You'll probably hear me use the word controller Resource and domain kind of interchangeably the idea is you write files in like some particular that apply to some particular Controller this controller takes the values which you provided and it provides them to the kernel and the kernel makes some kind of Decisions based on what that one values you gave it And so it's essentially the backing store for for all of the stuff which you input to see groups so as mentioned previously like Workload isolation is a really big use case for see groups So you might have many background jobs on a machine, but you don't want them to override the main workload That's the case which we talked about first There are also some other kind of other cases like say you have a tier which runs Which runs asynchronous jobs in the background we have we have that at Facebook And some jobs have higher priority than others priority is a very abstract concept But priority usually has something to do with resources It usually means either you expect it has more memory or or less if you or something like this It's very much up to you like what exactly that priority means You also have shared environments say you're a VPS provider or something like that and you don't want Some particular user to be able to use that container and steal all the resources from all your other customers Which are now going to go and leave you a bad review So yeah, so there are kind of a lot of use cases for see groups here And you might ask hey my my favorite my favorite application X like it already does this Why should I give a single shit about see groups? Well, the answer is if you have been using any kind of software in the last eight years I would really pray that it does it through see groups It probably does it transparently to you and you don't ever have to like talk to see groups directly But the the backing for all of these is see groups like that's how they do resource limiting So let's go concretely over how this works in in version one So in V1 Sisyphus C group contains the names of all the resources which which have a controller So this might be CPU memory pins that kind of stuff and inside these Inside these resource directories. There's another set of directories which are the see groups themselves So these see groups exist in the context of this resource and you put processes into those see groups For example, we have the PIDs one at the bottom. So because this is to do with PID PID resources it contains files in those directories and those files are related to how many PIDs you can have in a see group for example and we also have the concept of like Having its own hierarchy for resource distribution So in see group in see group V1 you have a resource and then you have a see group hierarchy underneath each resource So even if see group 3 here was called the same as see group 1 say they were both called foo from the kernels perspective They have absolutely no relation to each other This is kind of important because if you look at how for example system D sets out see groups typically You often end up with quite similar looking hierarchies in different resources and you might even be inclined to believe They have some relation to each other. Well from system D's perspective I'm sure they do but from the kernels perspective They do not and that results in a huge whole slew of like subtle issues, which cause problems at scale, which I'll go into in a moment See groups see groups are nested inside each other in this example So when a see group is nested inside another typically what it means is it can control some limited amount Up to the maximal amount of its parents So if you have if you have a memory see group And then you have a child of a see group, which is also that which is also another see group then you can limit Up to the maximum of its parent so The resource that the Controls these see groups or the see group hierarchy that they're in determines what kind of files there are I already mentioned that if you're in the memory hierarchy, you can only access files related to memory Like for example, that's this file memory dot limited bytes You can read the limited bytes from it or you can write to it and set a separate limited bytes And one pit isn't exactly one see group Per resource in see group you want so p2 here is is explicitly assigned to suit to resources a and see in secret one of five respectively But because we don't assign it to anything in resource be it actually goes to this rootsy group The rootsy group is kind of a special concept in in in see groups. It's essentially unmanaged territory How exactly it's manages up to the controller? But the idea is you don't really have the opportunity to set really any limits in the rootsy group because it's just The the starting point for resource distribution of this resource across your whole machine So yeah, you do get some kind of accounting, but in terms of limiting. It's basically useless So here is a concrete look at how this looks in see group V1 So you have Citifest see group and then you have the resources which are like block IMM mpids and then you have the see group names And we have nested see groups here a inside BG for two resources and be inside ad hoc for two resources So once again just as reiteration because this is really important that you get this from the Connell's perspective naming has no meaning like If it's in a different resource, even if it has the same name, it has no meaning and that has also some weird implications So here's how it looks in see group V2 by comparison So in secret V2, we actually don't see the resources anymore if you remember how it looked in version one We have resources under Citifest see group and now we actually have the see groups themselves under Cicifest see group So how do these see groups understand which resources? They are supposed to apply to if they're not inside a resource hierarchy. Well, the answer is kind of see groups at global now They're essentially a global a global set of see groups and you enable resources inside of the see groups This means that we have one hierarchy to rule them all and the idea is like you write to a special file You tell us what particular controllers you want to enable and we enable them for your see group We don't require you to create disparate hierarchies each time you instead create one hierarchy and enable see group controllers at will So this is how the previous example now looks in see group V2 So in secret V2, like I mentioned, we now have the see groups directly at the bottom but you write to this special file see group dot subtree control and that enables the In the shot in the children of that see group that those controllers are enabled Essentially if you were not to enable them there, but they were enabled the next level up It means they would compete freely for those resources that you didn't enable So yeah, so here's the version one hierarchy again for comparison and as you can see here. We have resources first and Remember that in version one See groups with the same name again don't have any relation to each other In secret V2, we have this unified hierarchy And you enable resources for a see groups children by writing plus memory plus page plus CPU plus IO that kind of stuff To see group dot subtree control and when you do this the files appear in that directory instantaneously Oh, and another thing to mention is in real life You also need to enable the memory pids and IO controllers at the top level for this to work But for the sake of simplicity, I'm not out of them here So the fundamental differences are obviously the unified hierarchy Resources apply to see groups now instead of see groups applying to some some distance across the hierarchy for a resource This is very important for some kinds of common operations in a nooks like for example You have pagecast writebacks and pagecast writebacks Transcend one particular resource they they happen across a whole bunch of different resources And we need to be able to we need to be able to consider these actions together in unity to be able to form reasonable limiting or other actions We also have granularity at the thread group ID not the thread ID level This is a contentious point, but it's it's kind of important In cigarette V1 you could essentially put different threads from the same process into separate see groups This has a whole bunch of weird implications like for example People would put Say different threads from the same process into different memory see groups I don't know how the fuck that's supposed to work like You have like basically the entire shared memory between these two fucking secrets I know people have done insane shit with see groups This is this is the main the main thing is like we want to guide towards a reasonable implementation because it's not that these people are Stupid that's not the problem. It's just that see groups in V1 were like quite over complicated. So it made people do some insane shit So now limiting the paid level kind of gets us like a more reasonable approximation of what people generally want Also without extensive cooperation even for the resources where maybe it's would in theory make sense to to have Different different threads from the same process in different resources. It often ends up being like You you have to have some way to communicate which thread from your process is doing what and there's no standardized way to do that In the next right like you can set the the calm of your thread to like Some value and then then look at like the value of somewhere But it's not standardized and it's like really really fucking hard to reason about So yeah, it usually doesn't act in any reasonable way in general Like there's been a focus on simplicity and clarity in in v2 over like ultimate flexibility v1 was invented Kind of the dawn of containerization people didn't know what they wanted. They just know that they wanted something and they wanted it now So v1 was kind of a solution to this problem and v2 was kind of more like a more a more developed approach to the problems We know we're having for sure now So another new feature in v2 is the addition of this no internal process constraint So this means that C groups with processes and controllers enabled cannot create child C groups Essentially, this means that in in simpler words These red C groups either have to have no processes or they have to have no controllers enabled in that part of the hierarchy This is for a few reasons it's kind of hard to reason about how that should act usually in in v1 this in v1 this was allowed and The problem is now you have two different types of objects kind of competing against each other So say you have processes in I and then you have some child C groups Underneath I here Now you have processes which are in I Competing against C groups which are its children and it's difficult You have to make some kind of you have to make some kind of judgment about how we will treat processes Compared to C groups, maybe we can consider each one its own C group Maybe we can consider them like I prime like a separate C group But it's quite hard to reason about and for most cases the better solution is just create another C group So this is another guide towards like helping people create a sane hierarchy And the root C group is a special case the the Controllers themselves have to decide how they're going to handle resources in the route So clearly breaking the API is kind of a big deal like C group The secret papers is pretty fucking big deal Like the fact that we want to create v2 instead of instead of just improving v1 obviously needs some good some good reasoning there So v1 works okay for basic situations, but it gets kind of exponentially complicated when you're getting more and more complex as I mentioned in v1 design kind of often followed implementation and Trying to rework kernel APIs after the fact is really really really hard Like you can't change the kernel API the fundamental nature of the kernel API which people rely on day-to-day in production That's just not something you can do Even for stuff, which was designed upfront like I mentioned like generally the use cases for containers and C groups We're not really that well fleshed out yet. They originally started as as like only for CPU And then it grew and grew and grew kind of naturally It was generally hard to gauge at the time how C groups will be used So now this is an opportunity to redesign them and work them how how we actually think they should be So to fix these kind of fundamental issues you need to have an API break and that's why v2 was created So I want to go over some of the actual practical improvements because I've talked a lot about like the theoretical how we've designed it But I also want to go over why we've designed it in that way and what that actually means so Okay pop quiz when you write to a file in Linux what happens It's not a direct question. That's where you're gonna follow this. Okay, you get a file descriptor Okay, that that that was that was possibly before what I wanted. Okay, so does it write directly to the disk? Okay, so where where does your data go? I got about five different answers. I'm not sure what note they were Okay, so it goes you write a dirty page into into the into the page cache You have like some some some set of dirty pages now and your rights his call came back and it said everything's everything's great And from your applications perspective now you can go on pretending. Yes, it was written to disk And my rights is cool succeeded. So it must be written to this can you can have all this class of wonderful belief But ultimately it's not been written to this right? It's it's still sitting in memory somewhere And if you shut down the machine right now shit's gonna go hey, wow So yeah, so there are multiple operations here to be considered first you have the your rights is cool which then goes and writes it at dirty pages returns to you and Then later some carnal worker like PD flush comes and it says okay now is the time by some magical standards like the I know dirty ratio I've decided I'm gonna flush out these to disk in v1. We don't have any account We don't have any tracking of this. So if you if you wrote dirty pages We don't know where they came from afterwards when we flush them to disk So that IO goes to the root C group and we can't account it to your process or your C group Simply because it wasn't tracked. It wasn't tracked where this page comes from in v2 It is tracked and we can actually count these towards your your limits And we can also make kind of reasonable decisions about like oh you have IO contention And this is what I should do based on that or you have memory contention This is what I should do based on that when you're trying to do a pagecast right back V2 is also generally like kind of better integrated with subsystems So most of the actions we could do based on thresholds in version one work Kind of crude or in the case of the memory subsystem like quite violent You set like a limit with memory dot limit invites and what happens is you you So your application has like a one-minute spike in memory usage and then the umkiller goes along and goes Oh, I'm gonna kill you and that's like the standard method of dealing with things like Usually like processes don't particularly like being kill-nined I I don't know like Maybe like there's some kind of statistic processes which like that, but I Mean ultimately, it's not not a very a very good way of limiting results uses What you really want to do is like tell it okay Can't the fuck down and like stop allocating memory or in in the better in the better case where you can't tell the Can't the fuck down you want to tell the operating system. Hey, that guy's going fucking not so I would like to now like take some action against this guy, but that action doesn't have to be like slay him where he stands Um Yeah, I would like to think in human society. We've come past the point where like the penalty for any kind of failure is instantaneous death So so yeah, so now in secret v2 we have like generally kind of better thresholds here We have we have a new thing called memory dot high Which instead of killing a process we still have memory dot max which is very analogous to memory dot limit invites Which uncaled your process, but we also have memory dot high and what memory dot high does is when you pass this threshold We start to do throttling and reclaim when for every single map memory allocation So when you go over memory dot high and you want to do another mallet or you want to get grab some more memory Then what we do is we break into a separate path in the kernel and we say hey I would like to I would like to dial back this user So what I'm going to do is go to the tail of the inactive list and start reclaiming pages So if you fail to reclaim any pages, it's still kind of good because you took a while to to scan the page Cache it took you a while to scan the page cast We slowed down your application now and we've done it in a way, which is kind of transparent to your application But if you do manage to reclaim pages then we also win because now you've retained some pages And you managed to get some memory free again In fact, this is like a generally a much-sane a way of doing things and this is like Using this on web servers was like a big win when you see like these spikes in resource users So notifications, I don't know like how notifications are like one of the more edge cases for C group since it usually ends up being people Like system D which end up using them But notifications are essentially a way to say hey something in my C group change state It could be like oh, I have no more processes in my C group. So all of them have finished It could be like oh one of my processes ran out of memory and I'm gonna take some action based on that Ultimately, it's a way to get information about what is happening in your C group system D Uses this for example to track which processes are running and the state of the state of your system and the state of the services Which you're running The problem is on v1 for for release notifications Which are the notifications which are sent when your C group has no more processes, which for example means like oh we exited You have to designate what's called a release agent and this release agent You it's just like giving a call dump utility you tell it like here's the path to my executable And when you have no no more processes go and execute this this this thing with these arguments The problem is now if you have if you're using C groups as a utility where say you have C groups Expiring like a thousand times a second. You're now like forking a thousand times a second as well Which is pretty bad like it's generally pretty expensive And it doesn't make a whole lot of sense since the rest of the the rest of the C group API used slightly more sane methods You using a bent FD So now we have I notify support everywhere Since if a C group looks like a false them it kind of makes sense that it supports our notify We still have event FD support so you can like poll and find out what the answer is The idea now is you you can have one process to monitor everything you like you don't have to fork a new process every time An event is created. This is just like a straight-up grade really So utility controllers are another thing utility controllers are controllers essentially most most controllers in in C group Related some particular resource say you have memory CPU IO there are other ones. However, like save perf Or freezer, which I'll also into us in a second Which are not related to some resource, but you put the processes in that C group based on some actions You want to take to them as a group? So the idea is like basically you want to group them together so some user space utility can take some action based on that group um Perf is a tool for performance monitoring and tracing and I guess quite a few people probably probably heard of it And the way it works is when you say I want to have a certain set of C groups Is you give it a C group C group path and it says okay? Here is like some particular set of processes, which I'm going to map into my own C group hierarchy So now you have sysfs C group perf And inside there is a completely separate C group hierarchy, which only relates to perf This usually doesn't make a whole lot of sense usually what people want to do is monitor an existing C group hierarchy not create a new one So people result ended up resorting to like tons of hacks like oh, I want to like copy over this hierarchy to the other one They would run like a tool to copy it from one one hierarchy to another and like you end up with all these race conditions And like horrible and it was like really really bad So now having a unified hierarchy means we don't have to sync you only have one hierarchy So there's no way this could possibly go wrong touch wood So in v1 there's also like a lot of inconsistency between controllers This usually comes in in kind of two forms one you have inconsistent API's between controllers, which do exactly the same thing So you have like the CPU Both CPU and IO are essentially weight-based or share-based you give a certain amount of some resource Based on a relative amount to some some other particular C group But the API's were completely different the API you had to learn two API's to do one thing Which is really really not ideal So there's been a lot of focus on trying to unify the API's and also unify the naming like and generally like It was a bit of a crap shoot in v1 And now we have the opportunity to to rethink those names and generally standardize them a bit more So V2 is generally more intuitive up front. Oh Another one is is kind of inconsistency group semantics for example root C groups for example, sorry most C groups Inherit their parents limits So when you create a when you create a child C group of some particular C group it usually it usually can only use up to its parents limits But some some controllers didn't do that some controllers did their own thing for some controllers This whole idea of a hierarchy was like an imaginary thing and it just created a new C group and didn't care where it was It was all a bit of a crap shoot. So now with one unified hierarchy. It's it's more difficult to fuck it up I guess So V1's V1's over flexibility also contributed to like a whole bunch of API problems For example, when memory limits were first created They only limited a few like a couple types of memory and they were they were in this file memory dot limited bytes Then as more and more memory types were added They ended up getting like their own files one by one So you ended up with memory dot limited bytes memory dot K mem dot limited bytes memory dot K mem dot tcp Limit bytes member MSW the liminabytes and the really bad thing about this is now Yes, I have very granular control, but it's not very useful Because now say I want to set a limit on the maximum number of TCP buffers which is set with memory dot Memory dot came on dot tcp dot limited bytes now Say I have like 10 gigabytes of page cache free and I set a limit on the number of TCP buffers if I said like Oh, you should only get say X amount of TCP buffers if you go one over we will kill you Even if you had 10 gigabytes of page cache free or something like that It's not a reasonable way of operating usually like most people don't want like most people don't care like you allocated one TCP buffer too many they want to give some kind of idea about the overall memory use and like reasonable memory use Generally unified limits are only reasonable way to approach that so yeah again another trade-off kind of in favor of in favor of usability over ultimate flexibility Generally if you do want to limit these things say you wanted to limit some particular kind of resource like the pit controller is a very good example The pit in the early days of C groups it was considered Maybe we could limit the number of pids by like limiting certain types of kind of memory, but that turns out to be really really fucking hard Like so another the way that was fixed was we now have a pit controller and that specifically controls this resource So if you do want to do very specific kinds of limiting like some TCP buffer or something else Then you should do that through a new controller. That's that's the reasonable way to do that So if you go to Facebook comm right now, there is a one in ten chance. You are going to hit a server with secret v2 We are running a secret v2 pool in like the tens of thousand machines now And we're we're investing heavily in secret v2 for like a number of reasons My main concern like I mentioned at the beginning is limiting the failure domains between services I really care a lot about making sure that we don't have cascading failures or anything like that on a machine And also being able to kind of manage the resource allocation in your data center, especially at Facebook sale is really important Like if we can suck at this that little bit more resource efficiency The data center then that's that's a really big win We run secret v2 manage with system D One of my teammates David Capico was sitting back there did a talk about this at systemd.com for last year Called I believe it was called deploying system D at scale correct me if I'm wrong. Yes, it was So yes, we're really big contributor to to the core of secret v2 We have two of the core maintainers working at Facebook And we will continue to drive innovation here like this is a big a big thing we work on right now So I already mentioned that secret v2 is kind of new Secret v2 has been usable for a little while now and the core is all very stable But that doesn't mean that there isn't still work to be done here Like a lot of this is kind of building the building blocks for future work So the core API is a stable But there's definitely functionality to be worked on when thinking about see groups like most people think of kind of three things I guess which is IO memory CPU Those are like pretty much the biggest ones and two out of three of those are merged currently Pid the PID controller is also much for the CPU controller like it's very important But there have been some disagreements with the CPU subsystem team about how to merge it They have some disagreements about the no internal process constraint and also like Having process granularity instead of instead of third granularity and stuff like this Yeah, there's a very juicy drama filled thread at that link As all Linux kernel mailing list threads are it's probably still better than usual We also may end up with some thread based API for some particular kinds of thread operations that make sense That would be like in the works Another kind of big bet that we're doing right now is what one thing that the Linux has never really had is a good metric for memory pressure Like we have many kind of related metrics like the amount of memory have free or that the amount of memory used You can also look at stuff like Certain kinds of page scans or like yeah, it's it but it's all very heuristic and ultimately none of these metrics prove That you actually are encountering memory pressure because they can also happen in a bunch of normal scenarios So our proposed measure is to track page refalting so we the essential way this will work is we track pages which are consistently refalted back into the active queue and Essentially the way to explain how this whole thing works So you have the inactive set those are pages which the kernel considers Are probably not being actively used by any process Then you have the active set which are considered are more likely to be used by some some process And it's essentially one big list and the way it works is when you have a page fault You go to the head of the inactive list if you got if that page was accessed again Then it gets moved to the head of the active list which means it's protected from from reclaim And when we do a reclaim we go from the tail of the inactive list So we we take the pages which we consider the least likely to be used of a reclaim So what happens if we keep on? Faulting pages in and they get so far to the end that we keep on reclaiming them and then they fall back in again immediately That probably means that we have too many pages like at once for our system to handle It probably means that we can't we end up pushing them so fast off the edge that we simply don't have the resources to be able To deal with this number of pages that is probably not a bad metric for for for memory pressure And it's one which is currently being worked on as part of the security effort as well Because we do want to have those metrics around memory pressure not just memory usage Which is kind of only tangential to the thing which you really want to know so Sorry question. I Thought I had a question so we We also have Tracking now of page caps right backs what this first point means is we have the tracking of memory in Iowa But we don't have tracking for CPU yet Like you spend some time on CPU waiting for network packets to come in on the network We and we don't know who it's for yet because it didn't get rooted yet And we can't account for that yet And we also can't account for the CPU we spent like doing a doing a page cache right back yet That's something which is going to take quite a bit of effort But it's something which we're definitely working on at this point Yeah, and as for the second point I already mentioned like the the refaulting generally the idea here is that we want better metrics for Memory pressure because right now we only have something tangential Yeah, those you who use V1 are probably also know about freezer freezer is like an alternative to umkilling for example like you can freeze some set of processes in their state and then Go and decide like oh, I want to raise the memory limit or I want to kill them or I want to stop So some new processes it was essentially a way of freezing them in time and having some other process come and decide What to do about it? in in secret we want this didn't work at all basically like if you if you Used freezer one of the most common things you might want to do is go and like get a stack trace and look out What they were doing when like they kept on allocating memory or whatever it was that you froze them for But like it's a very common situation that you would end up with like say GDP if you would try to attach GDP It would end up in D state, which is not really the ideal result if you want to like find out the stack Of some process which froze like that's generally the complete opposite of what what I would like The reason is like the freezer implementation in v1 doesn't guarantee that we stop anywhere reasonable We we often stop in a stack which makes absolutely no sense in the kernel So in v2 the idea is to have like a more kind of sick stop style mechanism sick stop is very well defined And where it stops is very well defined as well So it's it's kind of a more reasonable solution to use for stopping processes So the the v2 implementation of Friso will be more along those kind of semantics So I talked a lot here about trying to sell you a secret v2 But I should actually tell you how to get it probably at some point during this talk So hopefully you're interested in trying it out yourself So here's what you need to get started with version 2 first. You need a kernel above 4.5 Before that we do have a developer flag which you can go find I won't tell you because it'll like eat you if you try and use it and you don't know what you're doing But I really wouldn't recommend using it before 4.5 4.5 is the first point. We have a stable API Once that's done more or less two things to do you need to turn off all of the controllers for v1 And you need to turn on and mount the the file system for v2 Typically you want your init system to do this for system D use it with this flag You basically put both of these on the kernel command line But if you're crazy or you want to try it yourself You can also manually mount it and cry when things break So yeah, so if you're interested in hearing more about c-groups come talk to me I'm happy to go over any of what I've been talking about on v1 or v2 And yeah, I'll be happy to go over any questions you might have and if you've used v1 in the past Which I guess many of you have and you found it lacking in some areas Please do try out v2 and let us know what you think. Thanks