 Okay, so Hi, my name is Chris down. I work at Facebook London as a production engineer Gonna be giving kind of a whistle-stop tour of the new version of control groups added in Linux 4.5 Don't worry if you haven't the faintest idea what what a control group is yet or you only vaguely know We'll go over like some basics in the next couple of slides So like I said in this talk, I'm gonna be going over the next version of see groups I'll give a short introduction kind of what they're for where you may have seen them if you already know something called see group You've almost all certainly been using version one of see groups version one has been out since 2008 And in many ways, it's kind of helped to kickstart our love of containerization and process management It's kind of the backbone of a lot of systems like system D and Docker and that kind of stuff So we've been using it all over the place since then so obviously it has a bunch of good functionality Unfortunately, it also has a ton of caveats and issues and kind of usability shenanigans Which make it really difficult to use sensibly Secret v2 is our attempt to fix these and improve it and it's under a lot of active development now Secret v1 is mostly in maintenance mode So I want to go over kind of why we needed to introduce a new major version of see groups Why we couldn't just do more improvements to version one I also want to go over some of the fundamental design decisions in control group v2 and Another thing is that control group v2 is being made also to enable a bunch of future improvements So I want to go over what's ready for use in production And what is still kind of in the pipeline and the general idea is that the core is ready But we still have a whole bunch of more goodness yet to come So first a little bit about me. I've been working at Facebook for about three and a half years now I work in this team called web foundation technically web foundation as you'd expect by the name is responsible for the web servers at Facebook But web servers are generally not a super complicated thing So we also kind of act as probably the closest thing that Facebook has to it to a SRE team So we delve into the whole stack at Facebook. We deal with production issues Basically all kinds of issues across the stack. We own instant resolution in production at Facebook As you'd imagine then if we own like this very large piece of like Facebook in general We have a whole bunch of different types of people to support us in that. So we have system debuggers That's mostly what I do. Most of my background is in system administration and system debugging We also have the main experts in our in our cache architecture RPC task scheduling. We also have experts in things like hack and HHVM, which is our jit compiler for PHP And we're not working on C groups. Most of my time is spent like dealing with these kind of systemic issues across Facebook And dealing with service issues as they come up So this brings me on to why I give a shit about C groups. So we have many Many many hundreds of thousands of servers at Facebook And we run a bunch of services on those servers and I care a lot about limiting the failure domains That we have across Facebook. The reality at Facebook is that most outages are not a single service having problems They tend to be Failures across multiple services sometimes cascading failures and we want to be able to restrict those We don't want multiple services to be able to affect each other like that There are a lot of things that you need to do to get to be able to get to the stage where you're comfortable saying Cascading failures are somewhat mitigated One thing you need to do is understand and isolate your dependencies and try and minimize the dependency chain of applications But another huge thing that you need to do is to be able to stop other processes on the machine from affecting the Thing that you actually want to do from stopping you from doing that So if you look at kind of a typical server So You have the core workload on pretty much every machine. It's the main thing that you want to do on that machine For example on our web servers, that would be HHVM which runs our PHP code Or on our load balancer it would be for example like HAProxy or Proxygen, which is our load balancer This is the thing that you do on that machine This is the thing that if you were to describe to somebody else what that machine does you would say it was this There are also a bunch of system processes most large companies and even small companies nowadays have this kind of tax Which is essentially processes, which you have to run to work inside your infrastructure These typically help the core workload in some way. They might be a dependency for the core workload or they might be run to keep the system working But they tend to be less important than the main workload For example chef is really vital for an up-to-date machine But if there was some bug in your cookbooks or you can't you can't run successfully and it's constantly thrashing the machine You don't want to stop web requests being run for that. You want to deal with that separately Then you have these kind of rarer things Like ad hoc queries and debugging these are things that you typically only know that you need Reactively when you're dealing with an incident so these can vary in importance some of them You want to run in the background some of them You actually actively want to interrupt the main workload for and we want to kind of give people the power to dynamically determine the importance of these things So this is a kind of problem that is really a very good use case for C groups So in the previous slide we talked about multiple processes fitting into each of these groups a control group can consist of as many or as few processes you like and you can set limits and Thresholds as tightly or as flexibly as you like for a service or application For example, you can have all processes which relate to a particular service in one C group Or do whatever you like. We don't impose a structure on you That's the idea of C groups the idea is the framework should be flexible to your requirements and not impose a direct structure on you So a C group is a control group They went on the same as you probably guessed by now There are a system for resource management on Linux where resource means something that processes share CPU memory IO that kind of stuff Management is a bit more complicated. You're probably already thinking of things like the umkele that is one option you have We also have some kind of more subtle things which you can do with C groups like throttling We also provide a bunch of accounting So every single part of the secret hierarchy exposes metrics about what's going on which makes it much easier to debug C groups are typically not a very complicated thing to manage because they are Directories at SysFS C group We don't have a system call interface for the very core parts of C groups. We do have for some more esoteric parts But this means that creation deletion modification that kind of stuff is all doable by Your typical application all you need to do is like MK da armed a right I mean, I hope whatever hits the language people are using nowadays still supports those things And this this makes it really really trivial to interact with C groups No matter what you were no matter what you're using you can just go into your shell and cat some files or a lesson directories And you get information about the secret hierarchy Each resource interface is provided by what's called a controller This controller essentially provides files which you can manipulate and it interacts with the kernel So say the memory controller provides a file called memory dot limit in bytes Which allows you to set a memory limit and when you set a value inside that file just by printing a string to this particular kind of s file You change the way the kernel will behave you you tell the kernel. Hey, I want you to do this for this particular C group So as mentioned previously workload isolation is a super of a large use case for you C groups You might have one thing on your machine which you want to run and a bunch of background services Which you want to de-prioritize The same is true for things like asynchronous jobs at Facebook We have a large asynchronous tier which runs things which can be processed in the background in a different queue And some jobs may have higher priority than others Or be longer running than others, but what priority means here is typically very case specific It might be that we We give it more CPU. We allow it more access to the disk We allow it more access to memory or anything like priority is generally discernible in terms of resources Another third use case is these kind of shared environments like VPS providers Which run containers where you don't want to allow one particular customer to Override the needs of another customer and start to affect or dip into their their allocation So you might be thinking at this point. Hey my what the fuck is this guy talking about my my favorite product already has this Functionality why do I need to talk about C groups? Well, it might be true that your your favorite product does have this functionality But if it's been made in the last eight or nine years It almost certainly uses C groups under the hood C groups are the most mature interface that we have in the kernel for managing resource management resource allocations and These are generally the way forward I think it's generally pretty accepted by now that despite Secrets being kind of the feature which kernel developers love to hate. They are kind of the way forward And even if your product doesn't use them it almost certainly should be at this point So let's take a look at how this works in version one It's very important to understand how version one works to be able to understand how version one doesn't work So like I mentioned if you've had some interaction with C groups in the past is almost certainly been with version one version two has been in development for over five years now six years, I think now and It's only become stable though quite recently in kernel 4.5 Even on recent kernels you'll discover Version one is typically used by default and what I mean by that is the kernel boots supporting both But typically your in-it system only mounts the version one hierarchy The kernel by default also only typically enables the controllers like the memory controller or the CPU controller For the version one hierarchy these controllers can only exist in either version one or version two you can run in this kind of mixed mode So in recent versions of system D. We actually mount both of the version one and version two hierarchy But we don't actually use version two for resource control We only use it for some system D internal stuff And what we really need is this resource control, which is what we're going towards The reason that we still mount the version one hierarchy and still use it mostly is for backwards compatibility most applications don't give a shit which at which secret hierarchy use they don't look at it, but Applications like docker for example or system D as well have to support Secret v2 if they're going to actually try and use the hierarchy to do things for most applications It's completely transparent But for those higher level lower level applications it tends to be quite important So this talk is also kind of a sell on why you should care about secret v2 and why you should work to support and understand It because understanding how version one works is really key to understanding the improvements that have been made in version two So in version one Sysfsc group Contains controller names or resources as directories at the top level resources like CPU memory PIDs IO that kind of stuff Inside these directories are hierarchies for each resource you can see inside here We have the PID controller Which contains a bunch of different slices just a system D terminology, but these are all essentially C groups They are directories which are C groups And each directory inside here will contain files which are related to the business of controlling process IDs So each resource here has its own resource distribution hierarchy Resource a here could be memory resource be could be CPU And one thing to note is even if C group three here in resource B had the same name as secret one in resource A say they're both called food or slice From the colonel's perspective they have absolutely no relation to each other even if they contain the same processes Which has some really interesting and somewhat negative implications, which I'll come back to later You might also know that the C groups are being nested inside each other in this example For example C group 2 is a child of secret one Generally what this means is that secret 2 inherits the properties of secret one and can set more restrictive limits inside its own C group So one PID is an exactly one C group per resource in secret V1 So PID 2 here is explicitly assigned to resources A and C, but we didn't explicitly assign it in resource B So it's in what's called the root C group The root C group is at the base directory for this resource controller for memory. It would be sysfs C group memory The root C group is essentially limitless. It's not very useful. It's generally for things which we've not categorized at all You still get some kind of accounting, but that's basically it You don't really get anything these things are essentially unlimited So here's a concrete look at how this looks at secret one like I say I really want to iterate this because otherwise the rest of this talk is gonna make no fucking sense So the C group file system is typically mounted at sysfs C group Inside you have these resources like memory CPU that kind of stuff You can have a single PID in C group foo in one resource But C group bar in another you don't have to have them in the same C group in different resources And again, even though we have two C groups here It seems two named one name ad hoc and one named BG. There are actually four C groups From the countless perspective, even if they have the same name, they're completely unrelated and this has a bunch of negative effects So let's take a look at how this works in secret V2 now that we've talked about secret one So in secret V2 you might notice now at sysfs C group. We no longer see the names of resources We used to see memory CPU IO that kind of stuff now. We just see background dot slice workload dot slice We just see that the C groups themselves. So how does the C group know which resource it should apply to? So the answer is it doesn't The way this works is almost entirely inverted So now C groups are not created for a particular resource resources instead are enabled or disabled in a particular part of the C group hierarchy This means that we have a single hierarchy to rule them all We don't need to have disparate hierarchies for every single resource and which has a bunch of positive effects Which I'll go into in a moment This means that you explicitly opt in to say having the CPU controller enabled in a particular Subtree of the C group hierarchy and once you've opted in for this We give you files like how much CPU we should give this application compared to two other applications So in secret V2 we have a similar hierarchy here, but note the differences instead of having four C groups like this We now have two like this Instead of also having a secret per resource We now have resources per C group which allows us to opt in to resources that we care about on the fly You don't have to build these things as you go along As you can see here we in version one we have a secret hierarchy per resource that is C groups only exist in the unique context of a particular resource. They are not universal And remember again that these C groups even though they have the same name have no relation to each other So the way that this works in secret V2 is you write to this file this magical file called secret dot subtree control And you write say plus memory plus CPU whatever which particular resource you want to enable and when you do this files Related to that resource appear in that C groups children for use So what are the fundamental differences we're talking about here? So obviously the big one is this unified hierarchy where resources apply to see groups now instead of see groups applying to resources This is extremely important For some extremely common operations in Linux a classic case is kind of a page cache write back These operations which transcend a single resource because for example a page cache write back is CPU I owe and memory all at the same time and it's previously really difficult to decide What operations are sensible to perform to reduce pressure? It's also really difficult to account for these things since we have different hierarchies for each resource We can't tie in version one one see groups actions in one resource To another see groups actions in another resource because they're not required to contain the same processes So with this single hierarchy we now have a single thing to rule them all and we can make decisions with much better context across the system We also now in version to have granularity at the t-gid not the tid level the reason for that is because Without extensive cooperation it generally doesn't make sense to have thread granularity for the secret control The reason for that is Generally, you need a secret manager a single thing in your system Which does secret secret distribution across the system and you need to expose your in program intention somehow You need to expose this thread does this and you should put it in this secret and this thread does that and you Should put it in this secret and so forth. There's no real standardized way to do that in Linux You can for example set the comm of your thread and and somehow set something to regex match on the comm of your thread But this is all kind of like sideways because the the real problem here is also that a lot of resources Don't make any sense at the tid level like in version one There was a non-trivial amount of people who were setting Different memory see groups from different threads of the same process Doesn't make any fucking sense in the vast majority of cases. It is kind of vaguely deterministic But it generally doesn't work and doesn't do what you would expect So we do actually in version two have also some more restricted APIs For thread control where possible to John who is one of the primary authors of secretry to recently introduced this thing called our group Which is essentially a way to do thread control for some resources, which makes sense But these things are local to the process So this is kind of limited to those use cases where it makes sense And has to be implemented per controller. You can't just willy-nilly put them in a particular controller where it doesn't make any sense We also have this major Focus on simplicity and clarity over ultimate flexibility in many places in version one Design followed the implementation because it wasn't clearly known like what the use cases were at the time Some flexibility in version one made implementation really really difficult For example this this per-thread control like people putting threads in a in memories different memory C groups And trying to account for things that cross multiple resource domains And the idea here is that we should provide a framework that guides towards a correct solution by default You shouldn't have to muck around in the documentation forever to work out how your thing is even going to basically work Another new feature in security to is this thing called the no internal process constraint So this means essentially that C groups cannot create child C groups If they have processes and they also have controllers enabled to put it another way the C groups in red here Either have to be empty or they have to have no controllers enabled at all They have to have no memory no IO that kind of stuff. This is for a number of reasons One of the primary ones is that generally child processors don't make sense to compete with their parent for resources And generally doing that can be kind of hard technically Another reason is that we have to make some implicit decisions about what this means say I put a bunch of processors in C group I here and then I also put a bunch of processors in C group J Now we have to make a decision about how we're going to consider two different types of objects one a child C group Which is J and to a single process which is contained within the I C group We can do all sorts of things one of the things we did in version one was implicit C group creation So if you put a some set of processes and I there would be this implicitly created I prime C group Which would contain those processes and they would share kind of C group contention But it doesn't make any sense to do this implicitly and it's usually not what anyone thought would happen So this is why we've moved to you have to kind of explicitly put things at the leaves You might also notice that the root C group is not read the reason for that is the root C group is a special case For general system consumption for things which we have not categorized how the root C group is handled is entirely up to the controller The controller has to make a decision about how to prioritize the things which have not been categorized at all from the things Which have been categorized so obviously breaking the API is kind of a big deal. This is a very major kernel API So you need a good reason to do this So the reasoning here is like version one worked Acceptably in some basic scenarios, but it gets exponentially complicated and not very usable in complex use As I mentioned before in version one design often followed implementation and the problem with that is we're working kernel APIs after the fact is really really hard like It generally you cannot change kernel APIs after you've defined them clearly So we kind of needed the API break there even for stuff which was designed up front and had like explicit design goals The use cases for C groups in 2008 when it was invented were not really that well fleshed out yet It was kind of hard to work out at the time how C groups would eventually be used This led to a bunch of over flexibility in places that you don't want it And it also led to a whole bunch of complexity in places Which should be simple even the basic building blocks of C groups so to fix these fundamental issues in C groups We kind of had to create C Group B2 because it fundamentally changes the way we think about resource control So I'm hoping you're still with me because I've gone over a lot of what we've changed but not a lot of why we've changed it So it really is important to understand what we've changed because otherwise the next set section is not going to make any sense at all So I want to go not only into what we've done, but why we've done it What does C Group B2 bring us that we didn't have in version one? So pop quiz it's Q&A time when you write to a file in Linux, what happens? Don't be scared Kyle would you like to give an answer? He would not like to give an answer over there user space buffers and the program and then things can get buffered at Like in the kernel level and then usually after the buffers kind of trickle down There's the block disk rights, uh-huh absolutely correct. So that's that's totally did everyone get that so basically the basic principle is There are a lot of layers of caching and buffering and yeah, the main one. We're looking at here is the page cache so when you write to a file in Linux you issue a rights is cool or whatever and Your rights is cool may return almost immediately And that's because you've what you've actually done is not write a file to the disk You've written a dirty page or some dirty pages into the page cache into memory And at this point your rights is cool as returned with success. So hooray your process can continue But of course in the real world, it's not actually done your your application can continue pretending It's done, but it's not actually done So eventually this dirty page needs to make its way back to the disk It needs to make its way back to the storage device which is supposed to go to but when does it get rid of this? How does it get rid of this? Who writes it to disk? Well, the dirty pages here were made on behalf of your application But the flush to this could happen an indefinite amount of time after depending on your particular syscuddles But the main point is these two actions are kind of disconnected like the eventual right to disk is completely Disconnected from the right which you first made so in secret we won these pagecast right backs went to the root C group They're essentially completely limitless For some workloads this can be a huge amount of IO and memory like a lot of IO and some workloads can just be doing pagecast Right back, so we couldn't account for them We couldn't even tie them back to your application and not accounting for these means not only a bunch of IO is not available for Accounting but a bunch of memory is also not available for accounting. We can't account for these dirty pages We can't tie them back to your application So in secret we to we actually track these actively and map the request back to the original C group So we're able to account these pagecast right backs back to your application and say this application was Responsible for the pages which are now being written to disk and charge it say to your IO controller So now we can also understand the relation of IO and memory for a right back Which you previously couldn't do since we had different hierarchies for every single resource This also applies to some other kinds of things like imagine you're receiving a lot of packets from the network That takes a non-trivial amount of kernel CPU in some circumstances a lot of it can be offloaded But in general the asset does take some amount of CPU and it's also difficult to account for that In secret V1 because we simply cannot say this action which occurred in the past is now related to your process We couldn't tie those things together because we had no way of tagging those packets the CPU that was involved as being related To your process eventually so in secret we to we can now do these things and perform some kind of reasonable methods of reclaim or Whatever you want to do to your process based on the limits more things can be accounted towards your process limits V2 is also generally better integrated with subsystems. So in version one most of the actions we could take For example in the memory controller were pretty violent They were pretty pretty crude generally for example Pretty much the only sensible action you could take against the process if it violated some memory limit You said was to um kill it Which is generally not what applications like generally applications don't respond very well to being kill 9'd That's not usually like the way applications like to be treated There is another way which is we also had this thing called freezer So what freezer would do is instead of um killing a process who say okay? We're gonna freeze it at this point in time So we would essentially leave it there and some other system with some other context would come and make a decision whether we should unfreeze the process by raising the limits or whether we should do things like Kill the process or get a stack trace from the process and kill it. It was totally up to you. The problem was freezer in v1 literally more or less Stopped you at whatever stack you were in like we could be in some very deep kernel stack and you would just be told stop And the problem is a lot of these things are not resumable. You can't just stop and expect it to go Well after you start again We also had a whole bunch of problems with like one of the key things people wanted to do with with freezer is go And grab a stack trace and then kill it But in a lot of cases these processes would just go into D state and would never be able to come out of it again So when you try and attach gdb to that process, it would also go into D state Which is not at all what you wanted at all So it was kind of all sorts of fuckery and shenanigans Going on with this with this freezer in v1 And yeah, it was just not workable really so really the only option you had was to kill shit Outright, which was not ideal So We did have this there's a tiny note about at the bottom We did actually have a soft limit in version one. However, it doesn't work It's like it's very difficult to reason about how it will work at any point It has a bunch of heuristics around local secret local memory pressure global memory pressure The phase of the Sun That kind of shit. It's like very very hard to reason about it basically is impossible to reason about So we can just pretend for the time being it doesn't really exist So in secret we to we we have much clearer thresholds on these hard limitable resources For example, we have memory dot low memory dot high memory dot max where low and high are best effort and If we had min and max then they would be kind of absolute thresholds For example on memory dot high we do direct reclaim. So direct reclaim is Essentially where we try and scan the the page table and find some pages to reclaim We do this when you allocate some more memory. So say you malloc or s-break or whatever And if you were above the memory dot high threshold already We will try and scan and reclaim pages from this from the working set This works whether or not your application actually successfully reclaims pages because if you successfully reclaim pages then good We're back under the limit again, and it doesn't matter. It's like nothing ever happened If you don't reclaim pages, we have to scan a whole bunch of the page table before we allow your application to continue So it acts as this kind of primitive slowdown for your application, which is kind of agnostic to your application This works well in some scenarios and doesn't work very well in some other scenarios But it allows you to have kind of a more granular control of how you want to treat applications Which behave not as you expect One way that you can use this is to deal with temporary spikes and resource usage By slowing down an application instead of just killing it outright For example, if your application at a certain point of execution always spikes to a certain point of resource usage Instead of just killing it every time it gets there you can slow it down for that short period and then continue running We also have a new notification API So notifications are essentially a way to tell Some something when a C group has changed state So it could be that we have no more processes in the C group Which means all of the processes there have ended or something oomed or Generally some action occurred in your C group System D uses this under the hood to track processes and track process state It uses it basically to manage which services are in a particular state and keep track of the system So in version one we actually do support this but it can get really expensive So in version one to know when you have no more No more processes to run you have to specify what's called a release agent ahead of time This release agent is literally a binary. It's just like you give a you give a path to C groups And they will exec that path every single time that this C group has no more processes in The problem is there are some asynchronous workloads which will create thousands and thousands of C groups a second legitimately And that means you have to do thousands and thousands of clones a second as well Which is a non-trivial amount of resource usage just on cloning shit So We also have other events which you can look at in secret be one like say if something oomed in the C group These are done through the poll interface with the three event FDF interface, and this generally works But since these are files it also makes sense to have I notify to support So now we also support I notify events which makes sense since we're treating the secret hierarchies a bunch of files and directories And generally this is kind of a more intuitive API. We do still have the old ways of doing Doing this, but generally I notify is a kind of a more sensible way of doing this overall And this makes getting these notifications way less expensive than they were in V1 So Utility controllers kind of also make sense now Utility controllers are controllers that don't manage a resource directly, but for whatever reason want to have their own secret hierarchy Generally, they allow a user space utility to take some kind of actions based on the hierarchy For example in version one the the perf tool has a secret hierarchy called perf event And perf is this tool which does performance tracing in Linux and the perf event controller here has its own hierarchy to monitor and Collect events for processes in its secret hierarchy The same goes for freezer which also had its own secret hierarchy And that that encountered some problems because typically what you actually want to do is Take the secret hierarchy from some other particular resource and Mimic it in the perf events hierarchy or mimic it in the in the freezer hierarchy So you would have to do all sorts of crazy things like copy over all the different processes or it was Like bound to a bunch of race conditions and generally didn't work very well So in version two, this is not a thing anymore because we have one hierarchy So perf and freezer and everything all share the same hierarchy. You don't have to do any copying anymore Whereas in version one, this is kind of prone to failure and esoteric bug reports on mailing lists In version one, we also have a bunch of inconsistency between controllers This kind of manifests in two typical forms one is inconsistent API is between controllers, which do almost exactly the same thing For example for CPU. We have this shares API and and for block IO We have this weight API and they're completely unrelated to each other even though they basically do exactly the same thing So in version two, we've made a an explicit effort to have API's that have similar possibilities Implementation to be as similar as possible. This has both been an intentional goal and Generally having a unified hierarchy makes this an in like an obvious path to take The second inconsistency between controllers is inconsistent semantics between different kinds of resources So most C groups, especially the core C groups Like I mentioned inherit their parents limits if you have a child of a C group It inherits its parents limits and you can set more restrictive limits in the child But some some some resources treated the C group hierarchy as almost like a dream or like a Something which it didn't even have to think about the net controllers were kind of a classic example Which didn't really care about the secret hierarchy. They just treated it as if it was one flat thing So people were really confused when they tried to use these controllers So the unified hierarchy Can helps us towards avoiding these inconsistencies and controllers and we apply the same rules to controllers equally They generally cannot deviate from the set of expectations that we have Another very severe problem is that some things in V1 were just simply impossible For example when memory limits were first made we had this file called memory dot lemon in bytes And we went whoopee we have a memory limit bytes but the problem is Eventually it was it was known that this covers a very limited set of memory types And we couldn't really add other types of memory to be accounted for in memory dot limit in bytes because again It's a kernel API. It's a stable kernel API and you can't really change it So you eventually ended up with a bunch of different memory types each in their own file So you didn't just have memory dot limit in bytes now You have memory dot K mem dot limit in bytes memory dot K mem dot TCP dot limit in bytes one for swap One for socket buffers. You had a different limit for every single type of memory This poses an incredibly bad problem So you have two choices now either you only set memory dot limit in bytes and you accept the fact that your application Is not actually bound by that limit in bytes because it only accounts for a very small number of memory types Or you set limits on every single type of memory type and you cry when you allocate one TCP buffer to memory too many because you're going to get umkilled because you allocated one TCP buffer too many I don't know about you, but when I'm writing an application. I don't usually think to myself I guess I have a very specific number of TCP buffers in mind for this application Generally, this is not how people think So in in version to this unintuitive behavior has resulted kind of in more unified Limits, we just have this thing called memory high memory dot max We've tried to make these things encompass all the types of memory that we possibly can And this is kind of a trade-off between this flexibility and overall usability from practical use and from talking to people We know that merging these into a global memory limit generally makes the most sense for most workloads This means you don't get those nasty surprises like oops I allocated one too many socket buffers and I got umkilled and if you really need separate limits The proper way to do that is to have a new controller You you create a new controller which does this particular type of limiting and that's what we did for example with the PID controller Because it was originally thought you could limit the number of PIDs by limiting the amount of kernel memory in certain in certain things But it turns out that's really really hard So we have now a PID controller which does that separately So if you go to facebook.com now You will touch a web server which has secret v2 We're running a secret v2 pool in the tens of thousands of machines Easily the largest secret v2 pool in the world We're investing heavily in secret v2 for a bunch of reasons Like I said, my main concern is limiting the failure domain of applications and getting kind of this better handle on how system services are working across Facebook Also being able to manage the resource allocations in your data center more efficiently as a big win, especially if you have a huge number of servers We run a secret v2 manage with systemd my my friend david a cavaca over there I gave a talk yesterday about that, which I'm sure if you didn't attend then very very sad But you'll be able to find the video later And we're a whole huge contributor to the core of secret v2 And see group systemd is see group support and we're continuing to kind of drive innovation here We have a lot of open issues against systemd and a lot of development which is being done So secret v2 has been stable for a little while now that doesn't mean there isn't still work to be done here The core APIs are stable, but there's still a bunch of functionality We're working on when thinking about see groups. Most people think of three things CPU IO and memory The CPU controller is very important, but unfortunately. It's not merged until 4.15 The reason for that is the CPU controller folks had a number of reservations about some things which we were doing To John especially has been working very very hard to mitigate their concerns Which is one of the things which led to this our group API being made So now we have it merged in 4.15, which is not even stable. So eventually we will get there As kind of a bonus I do want to go over one thing we're using see group v2 for And one thing we want to provide as part of cigarette tube So one thing we've never really had in Linux is a measure of memory pressure We have a bunch of related metrics like memory usage and buffer usage And we can also look at the number of page scans, but with these metrics alone. It's hard to tell the difference between Extremely efficient use of a system and overuse of a system. It's kind of hard to tell the difference So one proposed measure here is to track page refueling. The way it would essentially work is when you when you Continually reclaim a page and fold it back in again We will account for this we will measure this we put it in a particular in a particular Counter and then we look for did this page get revolted x times in say 100 milliseconds or a second And that's a good measure for things like Are we exceeding our limits? Are we constantly reclaiming something because we consider it's not in use and then having defaulted in again a second later? So this is one place that we're exploring as a potential measure for memory pressure So as for future as for future work, like I said We currently have IO and memory accounting for for paid cash writebacks But we don't have CPU accounting if you spend CPU there We currently can't account for that and that's something we're working on V2 also has a bunch of different Improvements in what types of IO we can account for one thing we still can't account for is some kinds of file system metadata So if you are cough apple and you store all of your files in extended metadata, it's probably not going to end very well for you I Also just talked about this refault this refault metrics for detecting memory pressure and another thing We're working on is this this freezer for v2 Which will use semantics which are much more similar to sick stop instead of just freezing you where you stand and possibly never coming out again So I've talked a lot to try and sell you on secret v2 Hopefully you're interested in trying it out yourself With system D. These are the flags that you need you essentially need to disable the the secret v1 Controllers and also tell system D to mount the new hierarchy You need a kernel above 4.5 to do this before that we do have unstable support, but it basically yeah I wouldn't recommend using it before 4.5 So typically having your init system do this is a good idea But if you do want to play around you can also mount it directly by using the fastest and type secret 2 So if you're interested in hearing more about control groups come talk to me I'm happy to go over anything that I've been talking about I think I have no time for questions But but if you want to come talk about it come talk to me And if you've used version 1 in the past which you almost certainly have and you've encountered the problem Kind of problems that I've been going over in this talk Please do come talk to me and let's see how secret v2 can work for you. Thank you