 Yeah, so I'm actually going to start this a little bit early with there's a lot of material to cover here So I'm I just want to say a couple things. My name is Larry Whitman. I'm one of the engineers in the Kernel and memory management group under Perry and Denise Organization and this presentation is it's it there's a lot of material recovered It's hard to streamline it so it's really easy to follow And the second thing I was going to say is that there was another fellow who has worked in the performance group Who is going to come in here and help us and he had to bail out He had a medical situation so he couldn't come and then my final Disclaimer is I've caught the Bruno plague. So you're in here at your own risk. You might get it and die just like I might so So Okay, so in terms of what I'm going to cover here, but it's going to talk about The evolution of rel from five six seven and eight. We have customers that run all versions We've customers that run all versions and as far as tuning is concerned over the over the past I've been at Red Hat for like 17 years and it started out obviously earlier than well five But it the tuning has really evolved quite nicely. It used to be really cumbersome between the system and Just like when the previous discussion the previous Louder, okay, the previous Presentation when wayman talked about the CPU C groups all the C groups some of this stuff is pretty The interface is pretty strange It's hard to sometimes wrap your head around it and the same thing is true for tuning the whole system And the two is sort of converging together CPUs C groups V2 will sort of converge so that you can manage C groups just like you manage the whole system It's taking time to evolve it's becoming more automatic But it actually is happening and then as far as other things I'm going to talk about to me To me is our mechanism to automate some of the tuning so you don't have to turn all kinds of knobs yourself Some of the more tunable aspects of the system are the new the new aspects of the system You make it you can get a big performance difference on doing the right thing with new versus doing the wrong thing I'm gonna talk about that I'm going to talk about latency versus throughput Some a couple of tools that we use huge pages huge pages is another big Performance boost if you use them right in the right situations You're gonna get a big boost if you use them wrong you get a big loss And then I'm going to talk about how how how the kernel tuning uses some of the features that wayman talked about in The CPU in the C groups themselves talk a little bit about disk and biosystem I own performance just a little bit because I'm not an expert there and then a little bit of a sneak peek in In terms of what we did what we have going on in LA So this is just the evolution from rel five six seven eight the like I said the the in the earlier versions the Features were sort of elementary so you could really only pin stuff in new month There was no automation involved at all and then as we migrated from five to six seven and eight It became more and more automatic all the time it gets to the point now where you can pretty much Install a system and choose what we call a tune the profile and it's going to be real close to what you what you really need You don't have to go back and adjust all the knobs and all kinds of stuff Although there are cases in which if you want optimal performance You need to do that and that's what I'm going to talk about here and talk about some of the features that you can do So in terms of latency versus throughput, I didn't come up with this slide Throughput is obviously just getting the maximum amount of work out of the system Where latency is almost a measurement of how quickly you can respond to events in the two are often mutually exclusive when it comes to Kernel tuning a write-back cash first the page cash is a write-back cash And if you adjust the parameters in one way, it's a very high throughput system It really wails when it comes to minimizing doing this guy Oh, and and all that but what happens if you if you run off the edge you pay the price You can get a really high latency spike. So that's not acceptable to some people some customers. So we have profiles that actually Make trade-offs between latency and throughput So tune D this this is just a set of slides that I grabbed out of some other set You can install a default to the profile for a database server and it used to be just a date There was only a handful in that one time, but it's we have a fairly rich To the profile library now so you can actually take your system and find the profile that closely Barely close to the matches your system in your workload apply it and it's going to be real close to optimal performance Once again, like I said, if you want the system to be very high throughput, you're gonna have to trade off some So the so that the tune D Profiles are actually hierarchical. So you start with one and you can actually You can actually micro between different pieces and I'll show you I'll get some examples of this toward the end of this presentation So here's an example of like a throughput versus a latency I said before that the The the tuning parameters to control the right back effective or behavior of like the page cache control directly control Latency versus throughput in this VM dirty ratio They're different. So VM dirty ratio one throughput says don't do a whole hell of a lot into dirty like 40% of the memory of the page cache and what that does is it allows other Mechanisms to kick in and start flicking pages back to disk So you basically never have to do anything And the system just sort of takes care of it The problem is is if you let all that memory get dirty and you do all of a sudden throw a workload at it And it needs to start doing a lot of this guy out It's gonna stop it's gonna stop the processes and it's gonna stop writing them out And those processes are gonna incur a high latency spikes and that's undesirable for a lot of workload type The high speed Trading and so forth or closer to real-time systems. So I'll talk about some of these I actually talk about these parameters on in the tune to your profiles and the ones that you most likely need to tweak So this is just an example. I took from one of the performance guys They this is the performance difference on picking on picking a profile that matches the workload that you get on a significant performance boost on Like by applying the correct to me profile versus the default in this particular case, I don't remember exactly what this is this was some Iorations per second and you can see that picking the right one makes a big difference So the next the first piece I want to talk about that is probably the most Most effective tune you the mechanism we have is the new system Non-multiform memory access is the system is built with a set of building blocks CPUs ran and then Some local DNA access and some links between the between the nodes so if you And I'll talk a little bit about what wayman is talking about here and in a newer node if you Successfully bind the CPUs in the memory to the same node. It's going to run a lot faster And if you unsuccessfully do that, and I'll show you how the Sit how we do that. So this this is just an example of of something that isn't really properly tuned, right? it isn't the the placement of threads tasks and Memory are not all in unison with the incorrect and what's going to happen It's going to run at a slower rate and then what happens is we have we have multiple mechanisms that actually move Either memory they either relocate memory or they relocate threads and passes on the CPUs to align the memory reference To be local versus remote So this is just that an automation where we we showed what would happen is these as the system moves the Processes from one to the other That these are tools to display CPU and memory so LS CPU will talk will tell you the architecture I'll tell you how big how many CPUs are in each node the amount of memory in each node and these are just the The rudimentary tools there's several of them to help you to help the whoever is doing to me most of Most of the time you do this level of tuning for some high-performance benchmark Probably you probably let the system do most of the work for you Especially in the later system if you go back to like a rel 5 or rel 6 it doesn't work as well So you have to do more to me, but as the system evolved that the kernel has evolved it gets a lot better so this This just they are new my CTL minus minus hardware is another one This will tell you that the nodes will tell you the CPUs on the node will tell you the memory Sizes and it will tell you some information about where the memory is LS taco is a pictorial view of the same thing So we have Like I said before we have as the system has evolved we in earlier systems you see you used to have to do a lot of this Manual you have to go in there and use bind to bind CPUs to bind processes to CPUs and memory locations and all that to Minimize the memory access, but as the system is evolved from mel 6 7 8 It becomes much more efficient. It does it automatically and it doesn't have pathological that when when we first started developing this It's hard to eliminate a bunch of pathological conditions in which we constantly moving stuff around We've gotten a lot better at that. So the picture at the top just shows if you have Like on each node you have four processes or four Virtual machines running and they're scattered all over the place. It's not aligned properly And then you can see down the bottom after the system moves everything and make sure that the memory is in CPUs or local Then then they're basically on node one this the first QEMU is only running on node one It's not making a bunch of cross-memory references and this has a big performance boost when you do that So techniques for for placement like I said earlier ones used to have to bind at you is that but and you People still do if you're doing really optimal that there's no solution for a sledgehammer So to speak so there's no I'm sorry There's no replacement for a sledgehammer if you want the system to really make sure that you have local memory accesses then you you go and you bind everything that's a lot of work now. It's a lot of Manual work and it just really doesn't It is isn't necessary as the systems involved tasks that this is this that you can do this Programmatically sked get and set affinity and buying this is all these are all system calls that allow you to change your code to Bind stuff together. This is what you just have to do in the earlier version In order to get this stuff to work right and then finally the piece down the bottom is see you see groups the one of the the way that that Some of the tools that we have that migrate They actually use CPUs and There's a User program called new or new D. It's just a demon that runs around and it looks at the memory In where it looks at the CPUs the memory and where the references are taking place and if they're Cross-reference, they're not local it moves them. It actually echoes just like wayman said it echoes the pit into Another memory C group and then that migrates the task over there. So if you have memory and CPU in the same node So these are the two that are Numedy we still support numedy. Although it's not as necessary in Because of the inside the kernel so numedy uses Move stuff around and it monitors these C groups and to figure out if it should adjust up But inside the kernel we have a new auto balancing schedule And it does not use C groups it uses the internal kernel data structures to figure out if it should move Processes around and so forth. It works a lot. It's a lot more efficient. It doesn't rely on user code running. It's faster In an overtime it has gotten a lot more effective in early versions of it had some pathological cases In which it would move stuff around more accessory than it should and it There's a video in here. You can watch how it works so this is just that a Picture of if if you do this correctly you can see that The bet newer balancing on versus off and this is the laps now obviously lower is better So in terms of the internals of the kernel itself so the kernel itself Maintains some kernel data structures Zones and nodes so the way that it works is node zero contains all the memory from location zero up to So it contains a few other zones We have zones of memory for that are less than 16 megabytes and less than 4 gigabytes These are for older devices that are legacy isa and isa Devices and then this normal zone is what the majority of the memory is most new most systems now have many many gigabytes of Ram in a given human node So the I couldn't really expand this but if you could expand this to really look like it should these two pieces Are really small toward the bottom if they're just there was difficult to get them all in one page and as far as that as far as the kernel data structures and algorithms are concerned each new mode has its own Memory plane has all of its own algorithms and demons. So this is basically the paging paging plane dynamics and how you This stuff they sell on each new mode. There is a Kernel thread called case walk be in its job is to Balance everything within that node. And so when you run on a new system, you have to you have multiple independent memory reclaims take in place and Because of that you can actually run VM stat and have a lot of free memory and notice the system we play memory at the same time and the reason for that is because One of the new mnows the free memory might be depleted were another one other one It's not so but if you run a tool like VM stat it shows you your total summation of memory So that is a normal situation to run into So I just want to talk about the interaction between Some of the VM tunables. These are the ones that you're gonna see in the unit and in the to be profiles These are the ones that are most commonly Adjusted in the profiles and the ones that you're most likely to have to adjust if you want to really fine tune Or hone in on some tune tuning the ones that are dependent on the Numa on the Numa system of these we claim ratio swapping is in three k bytes and Zone reclaim mode and then the ones that are system. Why they're independent are the catch pressure the dirty background ratio And the reader head and these are the ones that way when we talked about they'll be these will be in V2 see rooms like that especially the the background Yeah, right So like I said before it's kind of on our canes the tuning limits is kind of an arcane system It's you use the proc file system in the seat in the sys file system to go in there and there were files in there One of the files is proxies VM swapping is and this controls how aggressive In the min free k bytes determines how much memory it keeps free the zone we claim mode I'll talk about that that that's a really big hammer that The upstream kernel has turned on and off over the last Several years, I think it's disabled by default now because I'll talk a little bit about that as we get there Swappiness like I said swapping is controls how aggressively the system and claims memory from the page cache versus the anonymous memory So the default is 60 it is it is the percentage. It's just an integer from zero or hundred They set the default to 60 and what this does is it? Controls when the system runs out of memory how aggressively the system tries to reclaim page cache memory versus Anonymous memory if you decrease it down to like we saw those 2d profiles decrease the way down by 10 It will more aggressively claim the page cache memory So on a on a database system, you don't want the system to swap because it's going to try to write the Anonymous memory back to swap space and that's a that's an undesirable thing to do So you'll see the 2d profiles adjust this to a lower value if you increase it Will more aggressively reclaim an anonymous memory swap. There are applications that you want this to happen, but the majority of them This can these and like I said these until we get to see group B2 these are Controls Is it the months again? It's not as necessary to tune the latest systems it is the older so that the just to explain like that the Most common things and just the memory reclaim watermarks. There are three Three watermarks pages hyaline When someone the system boots all the memory goes on the free list and it's very rapidly depletes the free list It uses all the ram to cash every file in the world and into the page cache and then the system immediately ends ends up running a memory deficit mode so And when the when you're above this page is high, which is a very small number 1% When you're above this page is high The case what he doesn't do anything. It just sweeps That it does absolutely nothing and then is the free list gets depleted further and further down it goes below low Every time you allocate a page of memory Doesn't wake up the case what day and it runs in the background with high priority runs in the background Trickles pages out and reclaims them once again It typically doesn't swap it Because the system is tuned to do that and then if you push it all the way down to meet you over Well, there's only one case Well swapping out human though you have had thousands of user processes run it so you could overwhelm even though It's a higher priority internal thread. So when this happens You can get down to pages minute when that happens every process becomes case what the so it just says screw it You can't have anything until you reclaim memory and help the system So this is the way that the whole thing is designed and when you so the setting of min-free k bytes and min-free k bytes But I talked about controls Controls this so if you increase it that increases the number of The number of free bites on the system and if you increase the bottom it scales all of them up accordingly So the distance between pages min in zero is the same as pages low and high And it's actually 2x that and then 4x the pages high so the system So what this does what this does by the way is it is when the system runs out of memory Determines how drastic the cliff is that you've Performance that you fall off the one that the system runs out of memory if it if you set it really low then it's not going to keep much free memory around and it's not going to be it's Basically going to be a high food quite instance That's why you'd see that in a high food put instance It would lower this and then if you fall off the edge of that it's going to the performance clip is really steep Whereas if you increase pages min the it's going to keep a whole bunch of free memory around You don't want to have wanted to be very big because free memory is like removing memory from the system, but but it In other words, it won't use it to cash the contents of file systems. So So this is a trade-off between throughput and performance just like I had talked about before So this is this is just I to show you what happens is min free k bytes It's I just chose a system that was in our lab and a two-node system and I just cat it out and then if I echoed a Double the size you can see basically double all the parameters in between This this is a this isn't a desirable thing to do unless you really are sensitive about Performance that you run up run into when you fall off the off the edge because it is effectively So we've had customers that have gone in here and say I want I want I bought the memory I want it free so they'll increase it some Some massive value in the performance Just it runs with a fraction of the memory that it should have so So so we basically tell people not to mess with this thing very much at all I must say have a really latency high memory hard latency sensitive application So we claim mode is when so there's a couple of critical decisions the kernel has to make when it comes to You talked about the new Numa nodes in cross-memory references when you run out of memory on the node you basically have two decisions You can either go off still on to another node and allocate the memory there and start doing cross memory references Or you can force the system to reclaim memory on that node and not impact the other ones so that the So that you make local memory references versus remote. It's a trade-off between The program start up time If you if the system if the system is if you tune it with this own play mode To reclaim on the local node on the local memory node. It's going to It's going to impact how fast the program starts up But once it's up and running it's going to be doing all local memory references So it's going to run faster As it runs Like I said, this is a really arcane Parameter it's a pretty big hammer. You can even switch it one way or switch it the other We're still working on refining this and getting it right This has been actually turned on and off a couple times over the past couple of years in the upstream current somebody Somebody will post a patch that said I want to turn it on because when the system runs out of memory I want it to be playing local memory that can be a real performance If it's if you really overwhelming the system, so they so by default the upstream kernel in well I didn't know seven have it all turned off now And if you want to get a higher running a faster running program And you're willing to pay the price at startup time This is this point is changed in the new profile and you can mess with it yourself So this is this is just another Slide that I found out of the performance slide that if it's if when the if if this thing is turned on it runs in it Starts with playing your memory in the local node. You get a lot of High you get high latency spikes because rather than just jumping on to another node and suck the memory off And they are dealing with the remote memory access Which is what you would do if you were if you are trying to achieve a low latency Go file It would reclaim on that node and it would run faster, but it would take longer to fire the program up So it's a trade off Once again, this is another one This is the new month the auto new a balancing that that takes place You can see there's a big difference on Enable this is all by the way, this is all Profiles mess around with this but if you must read the tuning guide and really know what you're doing Be careful messing around with this stuff because you can really you can really mess the performance up So but you can see if it's on There are applications are there are there are 2d profiles that disable the auto So it doesn't move anything It's just whatever you wherever the program landed when it started up It stays there for the program and there are applications or a database applications And so what did one better like that because their their memory footprint might be larger than a Given single-numer node and it can spend a lot of time chasing its tail trying to move Pages and of memory around in order to optimize the performance So there are 2d profiles that shut this off. I just wanted to say that So just in terms of latency versus throughput This is just let's see So what one of the things that one of the parameters that that controls latency is also the no hurts full This what happens is in it on the default system What would happen is like every a thousand times a second whatever we set parts to you would encourage an interrupt the timer interrupt And the timer service routine would say no, I don't go back to sleep but one time out of One one or two times in a hundred it will actually have work to do so it will You know trigger some time or advance or schedule or whatever is necessary so what what we've done is to rather than rather than Waking up a thousand times a second when we insert something on the timer queue We figure out how far in the future you have to wake up and then wake up only At the time has gone by never interrupt the CPU or clock interrupt The downside of doing that is you latency is higher if you have some external event the granularity of the timer is coarser if you have some external event that is you need to You need to catch if it's more of a real-time system You're it's you're not gonna catch it you may not catch it in time But the downsides of ticking every a thousand times a second is it it's a nice catch watch routine Every time it runs it dumps part of the cash in the floor when you have to fall the whole thing back in so it's through It's through putters lower, but his latency is also lower And this is just a picture of the tick list so in other words rather than ticking a thousand times a second if you Don't need to it basically sets the the timer interrupt to go off in the future ones when it's necessary But once again the schedule is not gonna the schedule is granularity is coarser, and it's not gonna do everything that it potentially should So just a little bit of performance who's we have there was a couple of talks yesterday and On the perf The perf utility allows you to zoom in and see what the system is doing So if you have a system that goes I just made something it goes to a real high system You can run perf on the system a perf top and it will basically tell you It'll zoom right in on what's going on here So it's not necessarily useful to people unless they if you try to provide the kernel you have to understand the Various routines and so say for instance up Reclaiming memory or something like that you're going to see some of the case walk the routines and memory Reclaim routines way at the top in the red zone, and that would be indicative of an issue that you need to adjust some of the memory Pre-members, and we we've documented a lot of them a lot of this we have some good tuning guys It's not really possible to go off Just too much of it we have some good How to adjust the parameters how to zoom in and look for what's going on Perf record is is enough so perp has several features and perp top Perp record allows you to Basically rather than updating me if you just run perp top. It's a top every second It'll tell you what that's with the routines that are spending the most time. It's like throwing like a thousand dots at the system and then Every every second it'll tell you which subroutine incurred the highest number of dockets and That's picking you really difficult if you're trying to really get an idea of what's going on perp top Can be really difficult to trace through so perp record records And then you can go and play back in in the background and figure out what was going on for the duration of that It's got a fairly low overhead. I don't know. I guess you're you're good Would know more about the overhead than I do. We use this all the time. You get a system That's running it's doing some weird stuff spin locks and so forth a big hits when it comes to The best we use this to determine hot locks and bad caching techniques and so forth In a perfect perfect Perf record creates a file and then perf report Opens the file and then prints it out for you. You can it collects all the data Just as if you're doing a perp top, but then you can go back and say she let me see what happened between seconds five or twenty Distribution of where the system spent its time and very useful This this is just another one of the options that I didn't spend much time So the next the next area of tuning the system that has a big a big impact is page eyes the Intel architecture supports 4k 2-meg in one day pages and if you use a TLB entry Maps a single TLB entry maps a page you need a maps 4 megabytes on 4 kilobytes 2 megabytes or 1 gigabyte So the big of the line with the page size you can use the fewer the TLB versus you don't get so this really impacts how fast how this really minimizes the TLB misses the downside is though is if you have a large Sparse virtual address space and you touch one bite if it's if you use in two megabyte pages It instantiates two megabytes right there instead of 4k So if you're depending on the application that you're running in the memory in the memory The virtual memory region that you're in you might want to enable or disable huge pages in there because it will When it allocates a page is going to zero the thing out so that the start up time is high and The memory footprint is Typically much larger in a Spartan spots other space because it's using larger amounts of pages So this is just a picture of of how it works that the TLB like I said a given TLB entry maps And you can enable and disable these things on the fly so we have something called Transparency huge pages so huge pages come in a few different flavors the conventional huge pages that are used for system 5 shape memory are Are basically use a metafile and they these pages are You have to sort of set the system up or tune it to use these things to begin with We're transparent huge pages are used for Anonymous memory and they're also using the page cache in LA we're using more of these things in page cache But as you can see here is if if you if you I just wrote this really stupid program that danced the memory and If I turn off transparent huge pages echo never and desist this is I guess it is stuff is our If I disable Transparent huge pages you can see that the system runs It took it took 12 seconds instead of seven seconds. There's a big performance So you can see there's a speed up of 1.7 77% speed up the problem is the problem is with the speed up So how the two big bikes standard huge pages are used for system 5 shape memory It just so they just have an example in the echo 2000 into proxies VM and our huge pages and When you do that the system that by default the system will run Robin through all the new notes So you get even the solution through the new notes And then you run the program that uses them you can see in the in prockman info that it actually use them and You get a performance It There there are times in which she might not want the system to round off But you do rather than Scattering it through all of the new modes you want to buy a program to a new mode You do that you don't do that through the proc file system you do it through the sys file system So you can see I echoed a thousand into proxies proxies VM and our huge pages and it put 500 in each one but then on I Reclaim all of them and then I echo a thousand into the node specific And our huge pages and then I'm going to do a cad of the huge pages all of 1,000 of them went to the Zeroes node So once again, this is what this is used for database performance It's used to if you want to pin a database to a new mode or a set of new You would lay out This is what you'll see in the 2d profile. So if you mess around with that you can And this this just Basically that that you can do the same thing with one gigabyte huge pages in rel 7 and 8 you connect its dynamic and actually Do it on the fly whereas in rel Earlier versions of rel you if you allocated one gigabyte huge pages you need to read with the system That's no longer true with rel 7 and 8. However, if you if you think you're going to run, you know, this big application and suck all the memory down and think that it's going to be able to get the One one gigabyte huge pages even easily It's that's probably not going to be able to because the memory will be fragmenting up So if you're going to run something like this, it needs to be done fairly soon after boot before you run a bunch of applications So that it won't suck all the memory down before trying to use them So this is the This I did the same thing here. I showed how to use the one gigabyte huge pages on a per node basis versus a system-wide basis and how you can you can get the Excuse me, you can You can control the system and tell to allocate huge pages from a certain node and once again The reason I'm going over this is what the 2d profiles are all made of. This is what you need to sort of understand This is the performance this just give you an idea of how much of a performance If it's dense memory You you will get a real boost in performance by using a single TLB entry with a single TLB entry You probably will be an entire I'm sorry with a with a TLB you'll probably get be able to get the entire working set of a process in the In the TLB if you don't it's going to start including TLB misses, which can be a pathological slowdown So the other part I want to talk to you see about the C groups and this sort of I'm trying to dovetail what Wayman said and show you how C groups are used how the performance enhancements use C groups and how you use them in order to to buy memory and CPUs to So this is I so I had a section in here on C group to be to and I took it out only because I have too many slides to begin with I took it out because Wayman was doing the same thing Anything and more than that. I didn't want to say something that was completely wrong because he's the expert So anyways in real six and real seven they're like it's a deal sort of arcane Mechanisms to set up to see you have to mount the C group file system You have to go in there and then CD down and when you're down there You have to make a directory in there and that duplicates everything in the directory to a deeper level of the directory And it's a it's when if I remember when it first saw that But I mean, it's it's sort of become the de facto standard. It's no worse I guess in the way to the system anyways, it's complicated. It's it's ugly and it requires you really to wrap your head around how the system works So so and I just this is just a simple example like this is old I had a 16 gig system in my office that it's finally died Before I threw it away Created a 2 gig of my CPU subset of this 16 gig 8 CPU system So this is how we get new a CTL minus minus hardware to get a picture of what the system looked like I Mounted the C groups. I went down in there and then under the C groups I created my own test I CD down into there and then the CPU set mems This is the numinode. So I what I did is I said I want this CPU set to have numinode zeros memory and When I did the suit CTL Numis a TL minus minus hardware numinode zero had CPUs zero one two three So I bound those CPUs that and then I set the limited bytes to two gigabytes And then in order to get everything running I didn't echo And then then I ran the thing and as you can see what happens here on this What happens with this CPUs zero one two and three are pig. They're a hundred percent utilize It's using I I made sure that I caused the system to over commit memory. So you'll see you'll see that they swap even though there's abundant memory in the rest of the system So this is just that like I I said before that the binding the CPUs and memory together and making sure we have You Cross-memory references can be done several ways one of which is with C-groups And that's how the new the tool works and as you can see if I correctly do this I set numus. I set that the memory and the CPUs for numinode zero. You can see What does it does a tool that numus that will tell me how many hits and you can see I get a very high hit ratio But if I do the opposite I purposely shut it off and I do cross memory references I get a very high miss ratio. So this is just a Illustration of how this is used So anyways, that's all and I also I Also measured the performance. So if it's if we did local memory accesses I wrote this little program that allocate the memory and dance the memory and I did a time of it It in a guy counted the number of faults and all that, you know stuff It's important. You can see if we if we bound it correctly. It takes 1.6 seconds if we found it incorrectly It takes almost two seconds to run. So it's a there's a significant performance enhancement in doing that Wayman also talked a little bit about CPU shares the CPU shares part of the C-group allows you to to Consume a certain amount of CPU bandwidth for a secret But it allows you to go above that if nobody else is using this you so if that if no other process is used if so if I set the thing to So I said to 1% so this The current shares share counted 1024 if I have 10 into there, it's only going to use 1% of the CPU in the system unless nobody else is using it if nobody else is using it allows it to go over So if you if it'll like I said it allows it to use as much as Possible, but it'll guarantee 1% it's the minimum minimal it'll guarantee the 1% that for that C-group and this is just a I Think this this should have been before the previous slide This just shows you how well Oracle did Biling these things correct versus even correct And then that the other type of C-groups is the quota and this is more relevant to Cloud computing right so if you get you get a in the cloud somebody buys You know an hour of some percent of your system Even if nobody else is using the system You don't want it to go beyond what he bought what they paid for so this is done via the CFS is a period of quota and This is microsecond. So this is just like wayman had said this is a thousand. This is a hundred mikes and It's disabled by default so you can have as much as you want But in the C-group if I echo a thousand into this quota It's going to be 1% so basically a thousand divided by a hundred thousand is 1% And you can see when you run this that it only allows 1% of the total system CPU to be used by the C-group So that's how all of these things are used by the kernel tuning and all the tools and so forth We do like I said it takes a while to wrap your head around this and figure out how this stuff works And then this was a like I said this you would do something like this in a cloud computing model in which AWS probably does this all the time if you have If you buy, you know 5% of the system's CPU resources for 20 minutes It's that C-group that actually restricts it and prevents it from hogging up more if you want to pay more you can probably use the other the the other C-group and then that the having a senior limit here The shares you can use the shares, but they would pay you they would charge you more to do that because there's more computing resources that you're using And the last thing a little piece I wanted to talk about so that this guy Oh this this in these 2d profiles. There are elevators in the There are elevator algorithms that determine the order in which stuff is shot out to disk and I just This was one of the pieces that Joe was going to go over some of us The So so the each 2d profile selects a different elevator algorithm for the for the disk So and it's based on the type of higher that you want if you want high bandwidth If you want low latency, it's based on that and this is just an example of Performance gain that isn't that is incurred when you turn the system correct So that basically what happens the way that the right back Code when you talk about the right back code the page caches uses pages of RAM to Cash file system data and when you do a file system write operation It just copies that your contents of your data from your user buffer into the page cache And then sets another dirty page in the page cache And then once this once the page cache gets up to a certain level We start spewing those pages out based on an algorithm that you choose the latency versus through More accurate with the 2d profile chooses them So these are what they are the right back parameters, and this is what we talked about in that Controller so there's a dirty background ratio in a dirty ratio, and there's a read-ahead Paramet. I don't know is the read-ahead parameter part of that controller Not yet. It probably will be in the future determines how aggressively the system reads If you go out to disk the disk operation is obviously the most expensive part So I just want to really quickly go over these two are probably the most important parameters in the 2d profile if you are If you're messing around with disk IO, which obviously most applications do the the the background ratio In the dirty background ratio There they control to show a picture of this this works like the dirt the free list works when that when there is It when you have little or no dirty pages page cache We don't do anything, but once you go up and you hit the dirty background ratio, which is the default is 10% currently We actually wake up some background Flushing threads there's a pool of flushing flushing friends you wake them up, and they flick pages out and interview once again There's a pool of them. You can have many more user You can overwhelm if it overwhelms them you'll get up to the dirty ratio Which is 20% at that point it stops the process, and it says you're going to become a background writer so that so that it's harder to overwhelm the system in the That so that these two parameters Really directly control latency versus throughput if you if you increase these two parameters really high The system's not going to do anything until there's a whole bunch of dirty page Dirty page gets memory gets written back several different ways There's a what the old equivalent of the units update game and runs in the background it runs every five seconds And looks out pages that are 30 seconds old. It's pretty it's pretty arcane in today's performance and speed Operations, but it's still there. So if you if you were just normally using your system, you're not overwhelming You're not overwhelming it Then you'll stay down below the dirty background Updating pages out if you throw a bigger workload at it You'll go above it in that it'll manually wake up the background writing threads if you go even further And then just the kernel the rally kernel features that you'll see PML 5 is not the five-level page Is in rally So the current Intel I shouldn't say in box a 664 architecture The curtain one has four levels. It has a page directly a page up a directly directly page table And that allows you to that allows That if you do the arithmetic it allows 2 to the 47th that's of user space and 2 to the 47th that's kernel space or 128 Terabytes of original space and then the curls broken in half the union mapping window, which is 2 to the 64 game that's where the limit of the 64 game memory comes into With the four-level page table So we have the five-level page table It's off by default because unless you have unless you either have so much memory that you need To map more than 64 terabytes or you need more than a hundred time 28 terabytes of user space Turning it on would be a performance impact. I don't know how much of a performance impact We don't know that yet, but but turning it on Because the TLB this handbook code now has to walk an additional level The page table has to de-reference more memory locations The other thing is C group V2 I'm not gonna say anything more because we're even talking about that Memory mode dim any dim support. This is something fairly fairly recent that is in It it uses It's an option to allow you to use and be dims for RAM But because they're slower than D-RAM it uses the system's D-RAM to cache the contents of the In the dim memory so it We're just starting to wrap our heads around the performance consequences of something like this It's it's very it's just as fast as D-RAM unless you over commit the nodes D-RAM So unless you over commit that cache once you over commit the cache It's in the direct math physical caching scheme and you there's a performance I've measured some of that and seen some numbers I can't talk about numbers or I'm not prepared to talk about any numbers here because I don't know what I did Performance measurement is correct yet. I just wrote a simple program understanding how this this hardware works and how the kernel uses it and So I'm not prepared to talk about that to make about huge pages in the page cache this allows file system pages to be And then the kernel address space randomization that we've gone through a bunch of security nightmares over the past year or two and You know, there's there's so many ways that you can crack into a system I think this address space randomization is is an admission of of not knowing how to fix that So if you just basically shuffle the deck every time you put the kernel then you won't be able to back your head around where In memory of a certain kernel data structures lie or certain Subroutings lie because it's going to be different every time you put it. So so that's some of the features that is in Relate And it's in performance white papers that I most of them are from last year Oh Yeah, I think he's going to make this somebody took the presentation So that's it, I mean, that's all I have to say I don't know it's hard to Describe some of this stuff in 15 minutes. It's just it's complicated. It's messy I tried to explain how the kernel between Mechanisms are used how they use the various features