 too much hand-tuning. So Larry's going to go into the details of things called huge pages and how transparent huge pages are demons that run automatically and to give you an x86 to make page etc. Tune D I'll go through briefly. CPO affinity has been around pretty much forever in Linux which being an old hardware guy I like to be able to control the destiny of where you go and get scheduled to so we just heard about new schedulers but in general most users don't want to have to manually tune so lots of improvements there. Numa pinning huge x86 over the past 10 years as well so non-uniform memory is part of almost everything pretty much then probably the next gen laptops are gonna have multiple memory nodes and then finally how to also move interrupts around. So our team spends a lot of time on various workloads various benchmark workloads but Red Hat doesn't publish benchmark results which is which is actually good we just want to make sure people that use Linux have the ability to do the tuning do the optimizations they want and let the open source world compete with actually open source code as well so so we study this stuff with databases Java engines low latency opera you know fields across the variety of things and and the tuning actually is comes in about two different flavors next slide Larry next slide again so tune D was a thing we invented in Red Hat 5 initially and it became a product in Red Hat 6 not a product per say just a package that basically collected a lot of system cctl parameters together to tune things for usually two categories either latency or throughput I'll show you what that means in a second so in Red Hat 7 we actually do turn on what's called a throughput profile so a default rel 7 system is more geared to we think more of the average application prefers throughput and next slide so here's kind of like the difference what's the difference between latency and throughput right so latency to me is how fast you can get from point one eight a point B I was able to drive on the Autobahn recently and I was quite a thrill because we didn't have a speed limit so as fast as your car could go but take one person put in a car get there right where throughputs the other side either more lanes in the highway how many packages can you put into a truck say and get them all from point A to point B and so that usually involves some sort of aggregation at the software level and that usually hurts latency so how can't you sometimes can't actually tune for both automatically although you developers do give it a fair shot to give the best of both worlds and final slide is so there are latency tune deep performance profiles on the right-hand side network loads tend to be one of those cases that are most sensitive if you want latency don't aggregate all my packets together with nappy that's great to fill up the wire but it's not necessarily if you're that first packet in you don't want to have to wait for all the other packets to fill up before that's shipped over the wire so latency is more tuned for that small bag and things like NFV and other use cases here the differences and some of the these are all user mode you're welcome to log in use them and Larry's gonna go into the guts of it so I think I'll turn it back to you Larry we just got one more you want to talk about that oh yes it does this matter doesn't really help you know let's look at if some lab data so by default oh wait a minute that only helps the XFS file system well actually like changing the deadline elevators on storage bumping up read-aheads doing some of these tuning values to have a better out-of-the-box experience it turns out all the four file systems we support and ship with our OS here GFS is a clustered file system so it has some extra metadata overhead but the point is it having with the 2d profile set to throughput actually helped improve performance across the board okay so what I'm gonna first all start out is talking about Numa systems the one of the main one of the biggest areas that we get performance improvements or for performance disasters is understanding Numa and understanding how it works and how the memory access times and so forth which by the way the next presentation to get into a lot of the details of that so what a Numa system is is it's made up of building blocks that are interconnected years ago when they invented SMP systems they started to realize the more CPUs just stuck on the system the more saturated the bus became and you got to the point where if you added CPUs you actually got nothing out of it and added even more CPUs you slowed it down so they started creating systems that were built of smaller computers so to speak that were interconnected with high speed high speed buses and the benefit is you can get several CPUs all running at very high speeds at the same time the downside is if you're doing memory access that is not local to the node then you get performance degradation because there's a lot of data and bus access going across these links so it just allows much larger systems by distributing the memory and CPUs local memory access is obviously that what we're trying to optimize for versus remote memory access most systems now are Numa and like Czech said within a very short number of years if not now all systems will be Numa and there's no way of getting around it so we have to do a really good job in the kernel and in the tuning profiles and so forth to make sure that we get the local memory access and we've been over time we've been doing an increasing amount of automation to make this automatic unfortunately there is no one solution that fits everything so at this point right now we still do a fair amount of tuning in order to get the optimal performance so the first thing I want to talk about is some tools just a couple of very simple ones LS CPU you can see this is the same system LS CPU just shows you a four node system it iterates the CPUs enumerates the CPUs I'm sorry down the end just describes the cache sizes and all this and then Numa CTL with a flag minus minus hardware will give you a similar picture but it also gives you a very important piece of information called a slit table down in the lower right which is the the cost of doing memory both local and remote so it's normalized to 10 so if you're on node zero and you're accessing the memory on node zero you get a factor of 10 if you go from if you go off node then it slows down the memory bandwidth slows down by a factor of over two so this is basically the it's basically a picture of this if you go off off the node on some systems I just wanted to say it might not even have this X in the middle so in order to go from node zero to node three it might have to make two hops and if it did have to make two hops the node distance from node zero to three in that table might show something like 31 or 41 it could be that much of a slowdown so then and then there's a there's a utility called elastopo that shows you this in a in a graphic format shows you the same same information but in a graphical format the utility Numa stat was at least enhanced by red hat and it allows you to look at a system and really understand what your what your memory access patterns are looking like and what you're getting for local versus remote memory access if you have a system that is say four nodes you're running four processes if you're doing if you just let the thing you know just start them all up and don't do any Numa work at all then you end up with what's over on the left here I guess yeah you end up with this so basically what happens is say this process 37 runs on node zeros and then its memory is distributed across multiple nodes so on the order of 30 percent 35 percent of its memory is local the remaining 60 or 70 percent is remote so you get a performance degradation associated with this if you do a good job though you can get the processes if they fit on a node the operating set they either our utilities manual tuning or our utilities or the kernel and later kernels will force it on to the node and you'll get total local memory access in the this Numa stat utility shows what basically shows that so in this particular case was a four node system each running four versions of KVM and you can see that the basically the number of megabytes that are distributed on each node if you did absolutely no alignment or nothing at all versus if if you either manually pinned it to the to the nodes or you'll let the utility and I'll talk about that in a minute or the kernel and later kernels move everything around then it'll move everything around and do all local memory access and no remote obviously if you overload the system there are more processes more memory demand than there is memory on the system it's not going to look as pretty as this but it still basically works so good tips for new but good Numa performance is there's a lot of Numa systems have a BIOS flag that allow you to turn Numa off and we actually have some performance results that are better with Numa off and we will maybe I'll talk about that a little bit toward the end but basically you don't want to turn it off because in the hardware there's no way of disabling Numa if you are there might the hardware might be able to do something like every other cache block is distributed among every other node but there's no way of of centralizing it all to one location turning it off will make the operating system make the kernel boot up as if it sees one node and it won't have any knowledge of the of the slit table whatsoever and won't be able to do anything and know the workloads so if you have a if you're running multiple instances of KBM on a host or something like that to know to understand the size of them understand the topology and so forth and make sure that the you're aware of it and you do all the all the right things did you want to say anything about this Gary here this is an example of yeah sure so so what Larry's alluding to in in rel 6 he was comparing the schedule of rel 6 in rel 7 we're going to talk about an auto Numa scheduler that's also on but default and again been pushed upstream a few years ago so how can it actually help again so proof is in the output of actually running multi multi-stream workloads in this case spec floating point streams and it's a whole suite of different applications and they all have to run simultaneously and it pretty much has to saturate the core so the blue bar show that if you did not apply Numa tuning you could you would be leaving significant performance on the table and in fact more recently you know you see the gain in the red bars is you know what is that about 35% performance on the same system by getting this scheduling and hopefully the the auto Numa scheduler and real sexual you know also talk I think about a user space demon called Numa de so but so it is real the next thing I wanted to talk about is the way the kernel processes Numa information it the kernel itself has some of its data structures are based on a per Numa node basis so even the kernel is very careful so that if you are if you run a process and it starts on a given node then a lot of the page allocator the slab cache a lot of the memory allocation code is no is Numa aware so it will if you fork a process it starts running on a given node it'll allocate the kernel stack and a lot of the slab cache data structures and so forth so that when you trap into kernel mode and start doing a bunch of kernel activity even the kernels kernel or run Numa aware so it does interrupt processing IMO IO and DMA capabilities and Numa aware the memory zones and Numa aware the page reclamation so k there's a case flop deeper Numa node and they run independent of each other and I'll get into some of that that has some strange consequences if you're not aware of it or if you do the wrong thing and there are a lot of other Numa aware threads on the system you may need to check the status and look at times if you're really trying to optimize and get a good performance result you may have to really be aware of the way the Numa system the way the hardware is laid out in the way that the kernel is distributed among that and the user space is distributed among it so the first thing I wanted to talk about is the the way that memory is laid out in a very simple two node x86 64 system node zero is going to have three zones just like it has an anonymous system that's going to have a six DMA a DMA 32 zone and a normal zone and then node one is a is going to have only a normal zone so the the what this what this implies is that the case WAPT that runs on say node zero is going to have to deal with three zones where the case WAPT that runs on nodes one through n minus one where ends the number of nodes only has to deal with the normal zone so so there is also the paging dynamics also are are Numa aware and work on a Numa node basis so basically this algorithm right here that that allocates memory from the free list and doles it out either to the page cache or for anonymous memory and does paging and swapping and dirty write back and so forth happens on each node what this implies is if you have multiple nodes one of the nodes becomes exhausted where the other nodes are not exhausted you can have you can have swapping going on on one node where there's no swapping going on another you'll take a utility like VM stat and you'll see it with potentially many many gigabytes of free memory yet the swapping that's and if you see something like that that's indicative of one of the nodes being exhausted the other nodes not being exhausted and there are ways of working around this and dealing with this so but the next thing to talk about is what so when when Shaq first talked about new tune D tune D is a automated demon that takes and for each profile sets a bunch of kernel tuning parameters and I was going to I picked a half a dozen or so of these kernel tuning parameters and talk about how they interact with Numa and how what we do to get optimal performance out of the system and I guess we're just going to go from there so this so I kind of put I grabbed like a half a dozen of them that you'll see in a lot of the tune D profile so I was going to say that all of these are documented by the way in documentation slash VM or documentation slash kernel in the documentation area there they're well documented and we've enhanced the documentation to talk about how they interact with Numa systems so when the the ones that I want to talk about that are dependent on Numa are just three of them the zone reclaim mode the swappiness min free k bytes I don't know how many people in here have messed around with some of these tuning parameters but these three depend on Numa and or they are they interact with Numa I should say where the where the independent ones like the VFS cash pressure and the two dirty ratios the control right back are system-wide they don't really depend or have an interaction with Numa at all so the first time I want to talk about that you'll see in the the the k-tune I'm sorry the tune D profile is a flag called zone reclaim mode this flag controls what happens when a given Numa node runs out of memory it's got two choices it either hops on to the next node and starts allocating memory there which is going to induce cross-node memory references when the application runs or the other choice is to stop and start reclaiming memory on that node and start swapping and paging and dirty right back and so forth on that node while the other other ones aren't doing anything or don't don't have any page reclamation going on this flag is probably the most important or performance sensitive flag that you can run into if you if the it's set by default at boot time and it's set based on that slit table if the if the kernel determines that the cost of the the average cost of cross-node memory references are more expensive than reclaiming memory over the long run then it'll turn this flag on so that it'll go off and so that will do local memory reclaiming versus hop and on the other hopping under the other node and this is also a trade-off of if you have a lot of short-running applications they just run for a while you fork a process and it runs for a second or two and then terminates this little benefit to having it swap and page and reclaim memory on the node versus if you have a long-running application like a database server that's going to be up for a long time it probably makes a lot of sense to pay the penalty at initialization time to get that thing to to reclaim memory and fit it on the node versus doing local doing remote memory access so this just going to talk a little bit about this you can see what it's set to by looking in proxess VM zone reclaim mode you can turn it on and off manual dynamically on the fly without rebooting the system it's set by default based on the Numa factor and though in the way that red hat has done the development of the automatic Numa code we have turned it on and off and various releases along the way of the default on and off and various releases along the way and if you look at some of the 2nd profiles they'll flip it back on or off the other way depending on what the profile is if it's latency sensitive or bandwidth sensitive so this basically what this shows is the consequences of of not setting this thing correctly so you can see if the I guess this is just a bunch of benchmarks right yeah this is again the spec suite again and the effect of if the workloads are running and essentially exhausted physical memory if you had the zone reclaim mode set the cost of essentially swapping on the same node is very painful in performance versus the baseline is allowing it to to schedule memory across the Numa node so yeah so if anyone experienced Linux swapping on your laptop's PC run out of memory and yeah nobody ever experienced okay anyway we'll keep going okay so the next so the next parameter I wanted to talk a little bit about and these are the ones that you'll see a lot in these Kate in these 2nd profiles and you'll see them on and some and off in the other is it or adjusted in some in one way and adjusted in the other the next one is called swappiness this is the typical this this parameter controls how aggressively the system reclaims either page cache memory or anonymous memory it's an integer and it's set between 0 and 100 if you lower it it lowering it causes it to avoid swapping and then therefore reclaim the file system cache memory the page cache memory much more aggressively than the anonymous memory increasing it just does the opposite so if you go back to that little picture that I had that had the the page cache and the anonymous memory it determines who contributes to the memory that's going to be freed when the system runs out of memory and it also also I was going to say is each node uses this thing individually so in other words the case swap D and the page reclaim code even though it's a global parameter if one node has a lot more page cache memory in a smaller amount of anonymous memory and the other one has the opposite it's going to it's going to act based on the global setting on the global setting so and this is important to know as you look at these these systems if you see a system that's swapping with but there's a lot of free memory then chances are this parameter right here is the one to go to and adjust and like I said this is set in almost all of the 2d profiles and it's once again it's a it's a integer between 0 and 100 the lower you make it the the less aggressively it swaps the higher you make it the more aggressively it swaps and what what it does for one if it swaps aggressively then it does not reclaim page cache memory available so it is aggressively so if you get something that has a big file system footprint and you don't want the file system cache reclaimed you want it hot and staying in there all the time you'll typically increase this so that it will swap and reclaim anonymous memory and leave the page cache alone and vice versa if then the last thing I wanted to talk about here and in the memory claim code at the memory tuning is the the the free list watermarks the like I said this exists on every node in the system so there are three watermarks page reclaim watermarks min low and high so when it's when you start off life all the memory is free and as the memory gets used it gets chiseled down to high once it it doesn't do anything until it crosses that down to low after it crosses low it wakes up case swap T for that node and then it case swap T works very very hard at trying to free memory back up to pages high however as this one thread case swap T's one third kernel thread per node so what can happen is you can overwhelm it now that happens the free list goes lower and lower until it hits men once it hits men basically every process becomes case swap T it starts running the same code so before we'll let you have a page of memory it'll force you to free a page of memory or it doesn't in batches it forces you to free a batch of memory before it let you have any of them so these three parameters here that have a really profound effect on the way the system runs when it runs out of memory does it fall off a cliff or does it actually degrade with some elegance and so that though that those three parameters are controlled by a tuning parameter called min free k bytes in what men free k bytes is it's this so if on a numer system that has like four nodes if men free k bytes is set to some value that value is defined divided by the number of nodes and then distributed through each of the nodes and then the distance between men and low and low and high is scaled up accordingly and you can see this if you play around with the tuning parameter if you cat out men you know men free k bytes you'll see it's set to this this is a kid number a kilobyte so it's what 90 megs and on node zero the sum of these is equal to the actual minimum and so basically half of it is on the on node one and the other half is distributed between the three node zones of node zero and then the distance between men and low and low and high as you increase this scales up accordingly so if you in if so if you I went in here and doubled it and you can see these values doubled as did the distances between the between the two watermark so and then some of these 2nd profiles you'll see it will be set so what this does is it did it what increasing these values is generally a bad thing to do because if you if you increase men free k bytes the only code the only processes are code that can allocate memory below men are privileged kernel threads so it's effectively like removing memory from your system so you don't want to crank this up to gigabytes you want to do a little bit at a time which is what the 2nd profile does but what it but but also what it does is when the system runs out of memory it doesn't fall off a cliff it is a gradual degradation so this is basically like a trade-off between throughput and latency so that so that the other thing I wanted to talk about here is there are the as far as numit tuning is concerned there are three different solutions you can either manually you can actually there are four just don't do anything about it and let them let the cards follow where they may you can in earlier versions of red hat enterprise Linux we had a user space utility called numiti and what numiti would do is it would it was user space so it would actually use CPU and memory C groups to try to move processes to the different nodes so it would create for if on the for node system it would typically create for CPU slash memory C groups and move processes between them if it notice cross memory references and basically what this is that's what this does it's a user space process and or use space program and that's how it works in red hat enterprise Linux 7 which is a 310 base kernel the 6 is a 2632 base kernel with a bunch of back ports from upstream but anyways the the rel 7 kernel has it built into the kernel itself so the kernel is with the one watching in moving stuff around so you don't have to run something like numiti and which creates the overhead of the C groups and all this stuff the kernel itself is doing it for you just does it on the it just basically doesn't on the fly however this one there are 2D profiles that shut the numibalancing off because there are cases in which you have very large applications that span several nodes or consume the whole system and if the kernel numibalancing code might sit there and fight with itself moving processes from one node to the other and even though you can there is control on how aggressively it moves and in some cases we get the best performance by just simply shutting it off and hand pinning or something you want to talk about this yeah that's that's right indeed we are definitely working to make even the large loads have no low overhead but it's autumn autumn human does have to scan pages to look for opportunities auto auto balancing though should be deciding at schedule time you know which numinodes so dynamically much less overhead than doing in user space the other reason and so you can see again some gains here this was actually a set of the bench BW email is a SAP application and by having auto numus set so this particular vendor never wanted to do manual pinning and so again the bars are the performance the line going through is the right hand side is the percentage gain the right side of the chart so you can get substantial gains one other case to turn some of these automatic demon transparent huge pages or auto numoff is to keep determinism straight so you'll see some of our low latency profiles disable some of these more exotic features because you basically don't want demons popping up in the middle of you know potentially low latency applications yep and just one other thing I wanted to say is this basically illustrates the how much better the current automatic kernel balancing works than the then oh that was without it okay sorry so they the next thing I wanted to talk about is some of the way the CPU scheduler works in its tunables I won't get into a lot of these but there's tunables in the in the schedule that control things like the quantum and how quickly the the system will decide to move from one CPU to another this this really becomes a lot more complicated with numeric systems because for instance this like some of the the fine-grain schedule attuning in rel 5 there was a the equivalent of a quantum the quantum really doesn't exist in the later kernels so it's divided into how frequently the system looks at waking processes up and how frequently it looks at putting them to sleep I'll talk a little bit more about this later here there's something called load balancing if you have all the see all your CPUs have something running on it or have several processes running on it so the load average is significantly above one if one of the CPUs goes idle initially you might think like oh well just move some of the processes to the other CPUs and start running them on a numeric system that can cause this imbalance to occur this can cause cross memory references to occur so rather than just simply blindly doing this there are parameters that are adjusted by tune D in the schedule that that determine how much time it's going to allow the system to remain in this imbalance before it goes and starts shuffling stuff around and these once again are adjusted they're well documented and adjusted a lot in this in the tune D profiles there's another one called the child runs first this is a parameter that determines when you fork a process if the parent sits there if it's if there's not enough CPU available for both parent and the child to run who runs first and once again this has a fairly profound effect on a numeric system because it might just decide to run the child on another process a processor instead of it instead of itself and once again I'll get into some of this in a little bit of time and then the other area that I want to talk a little bit about is file system in disk IO do you want to talk about that sure just when you thought you understand the bars if you go back bar charts if you go back to you this benchmark was done in time so lower is better so the red bars are better lower being better in here but the same thing that the line across is the percent difference of altering the sked migration and so just briefly on on disk IO a couple slides here in the future Larry we'll go back and so really like to run huge applications all my life with with our team and the ability to choose just like we're sharing about CPU scheduler dynamically changing quantum's and stuff for CPU well now you can dynamically change which IO elevators they call them or IO scheduler so CFQ was for completely fair scheduling and it also was mostly for interactive response times for maybe laptops but back in the data center we just had a talk on deadline scheduling for CPUs but the deadline IO elevator has really been determined on at least more modern storage hardware to be one of the best because it again has the input queue and output queue and can schedule those deadlines still allows the IO stack to do aggregation and things and again that's our default in row seven so the no op again you see that being used for really high-end and you know SSD so like NVME cards and you know NV dims and stuff like that and then and there used to be a fourth one that added a delay and fortunately I think we no longer support that one because I would never try to tune that because it adds a delay for every IO but I'll let you go Larry. Okay so and then the next little section I wanted to talk about were some of the parameters that you see adjusted in the 2D profiles a lot that are really not numer related even they do they do have some number of facts and they are the right the dirty right back parameters there are several of them but I'm only going to talk about a couple of them here the the background ratio and the dirty ratio and then there's some read-ahead parameters that are also there they're system-wide and they don't have a lot of effect on normal although there is some so basically what happens when you open a file and stop writing to the file reading and writing to the file the contents of the disk obviously from the file system are read into the page cache so when you start doing file system operations you're just sitting there doing memory accesses copying from your buffer to the page cache and vice versa if it's an M mapped file it's not copying those pages in the page cache it's just simply mapped via the page tables in user space and it's just updating the page cache on the fly and it sets dirty bits in there when you when you do a file system write operation it copies over data sets dirty bits and then later on in the background it's a right back cache so later on in the background it starts flicking the pages out and the more depending on the tuning the more more pages are dirty the more aggressively it happens so and so the the two parameters that control this are the dirty ratio and the dirty background ratio and this is basically when is when uh if you do a sync basically it writes out all the contents of the page cache writes it back to disk and you start out with zero percent of the page cache being dirty you start doing file system write operations and those pages start getting modified so this percent of dirty goes up at the same time this is happening by the way the old equivalent of the update demon is which is controlled via a couple other parameters writes everything out that is some number of microseconds old and it and it does it every certain number of microseconds and the defaults are like 30 seconds and five seconds by default so anything anything that's 30 seconds old and the page cache gets written out by a different mechanism than these so basically but over time you can get to the point where the background ratio the lowest one which is 10 by default and that's changed on almost all the tunity profiles once you exceed that it wakes up these background demons that start flicking the page cache pages to disk and then it it works very hard on keeping us down to this level right here and then between here and here once again it's like the update demon that's flicking those pages out well if you once again if there's there's a limited number of kernel demons that do this and even though they run on a perneumatode basis they're all pretty global they're all working in unison because you it it even though the the those demons do write the pages from a given node to uh to disk each individual one of those do that it's possible that several different processes running on several different nodes could be contributing to them so the two are sort of independent here but after a while what can happen is you can overwhelm this mechanism to write stuff back and just like the page reclaim code once you once you exceed this dirty ratio the higher two or 20 percent by default then the um you should use that so um the once that once you exceed that that level then the process stops and it stops just writing to the to the page cache and it's forced to do the same work as the page out the page writing demons and forced to do so it actually throttles the writing at the same time and once again because you can over even that can be overwhelmed these are usually set to 10 and 20 percent by default but these k-tune the uh sorry toondie profiles adjust them and this is another well you can almost look at this as another trade-off between throughput and latency if these are set really low and you don't allow much of the page cache to be dirty then it does a lot more frequent writeouts and the work that it does is small so so that basically decreases the latency but also decreases the throughput you don't get as much work done whereas if you crank these things up really high you'll get a lot more work done between iterations of this but when it comes time to pay the piper you do a lot more work on writing this stuff out so you get latency spikes out of this so you see the latency sensitive will pull these things down the throughput sensitive will increase them and this just tells more about what how they work probably everybody's probably familiar with these things anyways this controls the flushing of pages and and there's an also another with with very large system we support terabytes of physical memory sometimes you want to set this on a lower level than 1% you know 1% of a 12 terabyte system is a fairly large amount of memory and you might not want that amount of memory to be to be the threshold so you can actually control it in terms of bytes so if you want to go say less than 1% and you want to get a less than 1% resolution you can control it with bytes instead of with uh and you'll see that stuff in the tuning people files too I don't see we want to talk too much about this but if 10 minutes go ahead okay so the next thing I want to talk about a little bit is the page sizes in the kernel that the Intel hardware supports three page sizes 4k 2 meg and 1 gig and there are multiple ways of using these by default we use typically we use 4k pages there is a mechanism that in which via huge teal bfs you can decide to use either 2 megabyte or 1 gigabyte pages I'll get into that in just a minute and there's another mechanism that'll automatically do it for anonymous memory called transparent huge pages these things are turned on and off in these tune d profiles a lot too because there are consequences to using big pages if you have a big page obviously each TLB entry maps between 4k and 2 meg it maps 512 times as much information so you get a lot more you get a lot more information mapped whereas if you if you have a but the but the downside is is if you touch one byte in that 2 meg page it instantiates a whole 2 megabyte versus 4 kilobytes 512 times as much so depending on the application that you're running it might be desirable or not to to turn have these on and off and once again they'll be turned on and off in these profiles so this is just basically how you use standard 2 megabyte huge pages via huge TLB fs there's a you just the you'll see a lot of scripts that do this typically don't do it by yourself but like database scripts will do it it'll echo a number into proxys vm and our huge pages and then if you do a cat mem info proc mem info you'll see how they're used if you actually use it you'll see the the free becoming reserved and then actually used the that whatever I wanted to say that on a on a system that was this stuff was all designed before Numa sort of came into the picture and if you say I want 2000 pages oops if you say I want 2000 pages if it's a two node system it'll go out and put a thousand on each one of the nodes so that's basically what this shows is if you say a thousand in proxys and our huge pages then it'll put a thousand distributor on each node well with the with with Numa especially if you're tuning if you're going to run a database on node zero and you want the sga the oracle sga on node zero then you probably would go into the sys the sys file system of libraries which you control and see the Numa parameters a lot more accurately so you can see that you can say I want all thousand of them on node zero and that's where they are the transparent huge pages are are a are a mechanism in which the anonymous memory uses two megabyte pages it and as you can see when you when it's on by default you run this app you run an application and if they're if you shut it off you can say I don't want to use it you shut it off and you can see the system time off versus on is almost twice as much you get a big performance drop on on doing this um the but once again the downside is is it if you have a very sparse memory reference pattern the the the performance benefit might actually exceed the the cost of doing that and then finally uh one gigabyte pages uh they're up until recently they were allocated at boot time if you allocated them at boot time you could never get them back um once again they're exactly the same as the two meg pages they're just 512 times larger it always scaled up by a factor of 512 on uh rel 7 um in in the upstream kernels you can uh you you don't have to boot the system you can actually allocate them at runtime allocating that runtime there's a demon that runs in the background that coalesces memory together moves everything around and builds a bit a bunch of very large pages and that's how this this code works so you can allocate them and free them after boot time however if your system is up and running and it's a lot of memory has been allocated it might not work if you really want a lot of them the bit the memory might be wired the kernel might be using the memory and it might not be movable so it's better to do it earlier versus later um then the last thing I was going to talk about if I get time is c groups the um being a performance talk it's how many slides per second Larry can well we got into well we need questions yeah there's no questions on here this stuff so anyways so there are two there are two different mount points and the older kernels rel 6 you can see that the the mount points were in etsy and on rel 7 they're in the sys file system um and but they basically are they're very similar and in terms of what a c group does that these are the foundations of containers and uh that's why it's really important to understand this stuff and know that you can build subsets of your system using c groups or containers and uh so say for instance I this is just a really simple thing that was easy to do and easy to see I had a 16 gigabyte 8 cpu system which is obviously really old in today's standards and I wanted to create a 2 gigabyte 4 cpu subset of that this is the this is basically what you do you go into the the c group and you create a another c group down in the in the hierarchy you go down there you set this this mem sets the the numenode the cpu's are the cpu's that you want to run on so if we went back and look at that first example that first example node zero had cpu's one to three node one had cpu's four to seven and so forth and then I say I want to limit it to two gigabytes in size then run all the processes in there and you can see that even though I fired up 110 processes it it ran all 110 of them on only half of the cpu's and the other cpu's are idle here and you can go even further oh that what I did want to say is that the the way that a lot of this tuning is done especially when it comes to containers is uh you can you can use the the c group the memory in cpu c groups to control the numabinding so what you can do is you can say one that's done correctly so to speak is node zero cpu zero to three and then you run a program you can see that the when you start out the uh you you can see basically that you get all hits in other words local memory accesses and no increase in misses which is a remote memory access and if I do it actually at 100 wrong so I say I want to run on cpu zero to three on node one which is all remote memory accesses and do the same thing again the the hit the hits doesn't change at all it's all the misses that change and that tells you that you're uh out of time um I I don't know if we want to try to scurry through some of this or answer questions or what we what we want to do what we eight slides so um just and then we're kind of on the same thought as the scheduler uh talked before here the um oops uh so that the skate there are inside of the cpu c groups there are multiple scheduler options one of which is the shares so in the case of uh shares here you start out you create a cpu c group it has a thousand 24 shares and if you say I want I only want to allow this c group to get one percent of the shares you basically turn it down to 10 so 10 divided by 10 24 is about one percent so if if as long as nobody else is using the cpu's on the on the system then it will allow um it will throttle this to one percent so you can see that it only so I wrote this these this set of this program called useless and all that did is a while one and I wrote in like 99 copies of it and and I wrote one called useful which did the same thing and you can see as long as all these useful useless processes are running it'll throttle the useful one to one percent just like we told it to um and then this is a uh this is basically an example of oracle I forget time I'll go back to this and then the the other the other uh type of cpu c group that is used in these profiles a lot is the quota so in the case of the quota it's just the opposite you start out with there are two parameters here period and quota in the period is set to 100 000 the quota is negative one so that's off so basically what you can do is you'll if you run a program or a set of programs that consume all the cpu if it tells you told it don't throttle it at all even if uh regardless of anybody else is using the cpu it uses all of it but if you say I only want to use one percent of it so in the quota you you put in one percent of the period uh even though there's nothing else running on the system at all it'll only use one percent of the cpu um and then this this section right here was just talking about uh oom killing that happens in in c groups you can once again um oom kills oh what the system does if it's backed into a corner it runs out of all the resources and uh you don't it it has basically three choices hang crash or kill the lowest priority task and the choice by default is to killing the lowest priority task and it's called an oom kill and when that happens in a uh in a c group um so basically what i did is i set this thing up so i said i want to set that there you can control how big the memory is in a c group so i one gigabyte of ram two gigabytes of swap space i run this program that allocates and trumps on 16 gigabytes so what's going to happen is it's going to try to touch all the pages it's going to use a gigabyte of memory and then swap out two gigabytes and then it's going to finally kill the process and when this happens you can actually see in the um the vm stat when the kill happens it then it puts all the memory back on the free list and in the d messages file the difference between one that has been killed because of a c group limit versus a global is uh um it'll tell you that task in slash test which is the name of the uh c group that i created it'll tell you the the reason that it killed the process um i know you want to say anything about the summary here um this this is a lot of information to be to swallow i understand that it's a week long class that uh i'm not kidding almost but yeah no i mean this is this is a good summary i think probably if you guys do have um i mean you can read it for yourself if you guys do have questions yeah i'm sure we can uh probably try to take them outside the room is that yes okay i think we're in everybody's way so um this is like i said this is a lot of information to try to show in a short period of time this stuff is all well documented but we tried to give you an idea of when we have these uh 2d profiles where we get this information where we get it from what we adjust and there are many more than these six parameters by the way there's a couple dozen that we adjust all the time and uh so anyways i think that's it what's that yeah he's a wild man it was not only full of people as um as a session it was full of information as well i know it's a so we started you started early you know well that was a good job it usually takes we usually do this in like a two hour presentation we should have asked for two hours but we have two hours we would be able to get into examples and a lot of other stuff but it's just not enough time sorry no no no it's no worries it was a great one uh