 So I'm gonna I'm gonna get started on this here. So my name is Larry Woodman I'm a an engineer in the rel kernel group I have been for about 18 years now and if there's anything you don't like that I say My boss is right there. So throw tomatoes at him and not me So this this presentation is I've given we've given this multiple times and it's I just want to stop by saying It's an overview. There's a lot of complexity in tuning in tuning the kernel and then all the parameters associated with it So this is by far an overview. It's not really in an instructive way of tuning the kernel It's it tells you how we do it It gives your pointers to resources so that you can you know and get involved yourself as much as you want So I'm gonna stop by saying that the performance of rel has evolved over the last I've been here for 18 years and I'm leaving off of probably the first half of that or something So over the past close to 10 years the performance has evolved and in both good and in complexity too, so What so what how gonna go about this thing is to talk about How the kernel is tuned how What how the how rel 8 is the performance is improved over rel 7 and how we improved it? And we'll get into the first section is just going to be a bunch of a bunch of Graphs and so forth showing how much better rel 8 does than rel 7 and won't even get back into the rel 6 timeframe, but it's just a contrast of rel 8 versus rel 7 rel 7 was 3.10 based and We I don't know how many people in here are red-haters and are not but red hat Backports a bunch of upstream features into these kernels so we come out with a major release every two or three years and then Over the life of the the product itself We backport and we do a small dot release every six to nine months And in the course of doing so we maintain the kernels binary interface the KBI We make sure that any changes that we make don't change Kernel data structures in such a way that if you have a module that you've tested that you don't lose the testing for that it's still compatible for the entire life of the of the Product itself in the course of doing so it's complicated because we sometimes have to omit Fixes and so forth that we would like to get so we just have to simply wait until the next major release if that happens We're going to talk about Some disc IO database obviously database and IO Performance is really big and that's one of the major focuses here Then we're going to talk about some of the changes that took place in rel 8 in terms of hardware support and in terms of other features that we've taken in five level page tables and VDM some of the Numa and Huge page support that we have in the more in rel 8 over rel 7 So this is some of the evolution that's happened and you can see in rel 5 things are pretty static in rel 5 We had static huge pages the CPU sets and the tools that we used were pretty static You literally literally built the kernel with a lot of these things in it and over the as it evolved from six to seven To eight everything came a lot more became a lot more dynamic We had good ways of adjusting parameters on the fly We have much better ways of picking a what we call a profile So if you know you're going to run on a database server or something you can Out of the box you can select what what you want the system how you want the system behave To behave and it's fairly easy from that point to go forward and make small adjustments some of the parameters that I'm just going to talk about here and by the way this is a this a utility or tuning Technique we called tune D that we called tune D and tune D is Basically it applies a bunch of tuning parameters at boot time and some of them can be adjusted after boot time do and if you look in These tune D profiles you see a bunch of kernel tuning parameters, and that's sort of what I'm going to talk about here. These are The scheduling parameters that control how long a quantum is obviously if the quantum is longer you trade off throughput versus performance I'm not so you trade off throughput versus latency So the same thing is true with a lot of the other parameters if you If you want a system that's very responsive Versus does a huge amount to work in a shorter amount of time then you tune it for latency if you want the opposite Then you tune it for throughput We'll talk about p-states and the CPU states and all that this does parameters to control the right back The control the read ahead and all this other stuff will sort of get into some of these as we go So tune D that the tune D tool we have multiple tune D profiles that are sort of set up to Or pre-selected so that you can say this is this system is going to be Well, just come up with some examples a laptop is our workstation is obviously balanced between throughput and latency there's specific profiles for guests and hosts and high performance throughput for database servers and for low latency Low latency for like stock trading applications and so forth So the way that the tune D works is it's hierarchical if you you basically select that you can basically select a Profile with the latency performance profiling is going to be low latency So it's going to be it's going to set parameters so that you So that it Comparate it'll maybe compromise some of the throughput so that you get very fast response times And then on that you can basically come up with a profile you can include like in the this example here I included the The in a network low latency profile tune D profile We included the latency profile and then over what you can either in addition to some of the tuning parameters You can set or you can override ones that are in not in the In what you've included in it's hierarchical like I said you can you can say I want a A network a low latency network profile and you start with the low latency profile then you add the network Tuning parameters in and you can keep going and more and more of a hierarchy and that's how you develop Utility profiles as you keep adding More layers to this to this profile So the next little section. I want to talk about here is sort of out of the box Testing so what we did or the I didn't do it one of the performance on I help him do this Kind of went through and determined how rel 8 behaves out of the box and how If you apply these tune D profiles to to the kernel how it runs and so forth And we get a bunch of graphs to sort of show how much better rel 8 is or how well we revolve We have evolved over rel 7 and previous versions and to do that. We're going to talk about CPUs memory We support a lot of memory now. We support up to 48 terabytes actually even larger systems We're testing right now and we'll get into the need of expanding even beyond the four-level page tables various networks disk IO and security. We have a bunch of Data here on the CVE's that have been exposed over the last couple of years. They they have solved the the Security violations, but in the course of doing so this frequently performance consequences to doing that so with this here is is it shows Just it shows The typical gains that we get that we've gotten in rel 8 over rel 7 so you can see for CPU stuff streams lin pack between we get on the order of 5% gains just out of the box in with The profiles that we have in place for Java for a spec JBB for JavaScript and all that for aim and so forth We're talking about a 20% gain in a lot of these benchmarks. So it's substantial disk IO Databases OLTP and all that we're between 15 and once again between 15 and 20 cents as a percent as a big gain And there's a lot of work that's been done in the networking network loads and especially in smaller packets And we'll get into some of that but TCP UDP is small packets We get you know, it's not unusual to get 25 or 30 percent gain in some of these and then finally the network bypass code DP DK and so forth our DMA It's better, but because it bypasses the kernel the kernel really isn't necessarily in the way anyway So we're this a lot of this might be due to hardware that we support in rel 8 that we didn't support in rel 7 But not all of it Let's see what I want to say here So the performance star likes these pull almost looks like a polar coordinate graph of how we've how when rel 7 was I'm sorry when rel 8 was evolving and we were pulling pieces in and back porting it You can see that the the blue was Early versions of rel 8 and as it's evolved and we've back ported more and more of these features into rel 8 you can see that the performance has expanded out even more and This is for you know So basically from this point on we'll go forward and talk about some of these benchmarks and and what we've done and how we've achieved Really good results. So this here is aim 7 Running on XFS multi-user and as you can see the difference between rel 7 and 8 is we get a really big boost on As we get a larger number of users For shared throughput as far as databases concerned, it's better It's not as substantial or as much it's not as dramatic as the shared throughput or the file server stuff the file server stuff is also a much more extreme for a variety of different reasons some of the stuff was tested with Envy dim that we didn't support in in rel 7 and so it's a combination of Algorithmic changes in the kernel tuning and hardware that we now support in rel 8 over versus versus rel 7 This is This is The exact details of why like aim runs so much better This is that like the output of the the profile and you can see that the in this particular case a lot of the Algorithms have changed and the hardware has changed and so forth so that we support so that the spin locking and so forth is out of the way There's a lot more lockless code in rel 8 than there was in rel 7 and because of that We don't you don't spend as much time spinning on locks and so forth There's a lot more parallelism that takes place in the in the kernel and in user mode as well Just because you're not waiting on spin locks and so for the spinning on locks and you can see the results are you know I don't know a 10% gain or something like that of rel 7 of rel 8 over rel 7 This is the Intel This is the advanced vector instructions the abx instructions I don't know if anybody's familiar with these but The CPUs that we support include these advanced vector instructions and you can get really big boosts really big boosts being In terms of actual numbers, I don't know but the instructions themselves in these CPUs allow vector processing and so they start to rival some of the FPGA times that we get in With these with the CPUs that use these vector instructions And this is just supporting this type of a CPU in rel 8 over rel 7 and using it in various libraries and The compiler that we see that we use actually generates these instructions in rel 8 whether it didn't in rel 7 this is Once again, this is like an out-of-the-box picture of what we do for streams the for network loads for a small Let's see yeah for 64 byte packets. You can see we get a you know a big boost You know 20 30 percent and as the packet size gets bigger it flattens out because the optimizations were made in in the kernel for smaller packet sizes, so It the performance of smaller packet sizes in rel 8 over rel 7 is very good See XDP for performance this shows That Basically getting the kernel out of the way with dpdk and so forth. You can see you get really big gains here over Over rails of rel 8 over rel 7 because we didn't support a lot of this stuff in rel 7 and it's not even possible so All right So this is another one of these sort of polar coordinate things that he the Performance are likes to put together and it just shows that the as we evolved rel 8 over time that these Some of these other benchmarks gained a lot of performance over over time Okay, so for now, I'm talking a little bit about those are those are benchmarks I want to talk about database tuning tips, which are really real-world applications, so the first thing you talk about both Maria db and postgres db, or I should back up and say that all databases have a fairly common Set of tuning parameters that we do so if you look in their tune d profiles, there'll be a database tune d profile and then there'll be One that includes the database tune tune d profile that is unique to the database you're actually running yourself so running on So like for Maria db that they're all in all databases You almost always want to use huge pages huge pages I'll talk about this in detail a little bit later, but when you use huge pages They're we actually remove from the page list so the system can't reclaim them It's they're just it's like you booted the system without that memory and set it aside for the database to use And when you do something like that, you just basically tell the kernel get out of the way and that's basic That's what it's doing. So it reduces tlb misses it wires down the pages It prevents any kind of swapping that could take place the worst thing you could do in a databases swap is sit there and reclaim memory So it totally eliminates that and then there's then we do what they would call dirty Background writing and then what what dirty background writing is is if you have a page you we Cache the contents of files in something we call the page cache and it's a write-back cache so there are tuning parameters that allow you to determine how Early or how late the system stops flicking pages back to disk and then if you exceed The capacity of the other the system to quote flick pages back to disk You can actually tell tune the system so that the process that's actually writing is forced to block and write those pages back to disk and Lowering making the system write-back pages earlier produces a lower latency Result, but generally with a lower with a lower throughput as well because if you if you did Absolutely nothing because you didn't because you were never going to need to write anything back to disk The last thing you'd want to do is write anything to just so you would increase the the tuning parameters with the tune D profiles Would increase these tuning parameters to a higher value and that's what the throughput profiles do The latency profiles do just the opposite they lower this value so that when it goes to write pages back to disk a Lot more of the page cache is clean and it has a lot less work to do on every iteration But because it's doing more work overall it has a lower throughput so and then the size of sizing the pools and Sizing that that each one of these database Databases has their own user level tuning that takes place as well And they control stuff like how big chunks of virtual memory are allocated to the thing How big the their equivalent of the page cache is and with the Oracle calls the SGA and so Just from just by just you can see just the out-of-the-box performance This I just wanted to say the actual data results you'd have in terms of performance You'd have to look at this polar graph so to speak, but you can see that that that the Maria db Performance of rel 8 over rel 7 is substantial. It's it's double-digit percentages The same thing is true for postgres Postgres isn't as it isn't as extreme because of its Postgres is a multi-threaded versus a multi-processed Database in multi-threading Even in rel 7 took advantage of a lot of the lack of scheduling and so forth that takes place You don't need to schedule and Context switches much so to speak in a multi-threaded process because the address space is resonant on all the CPUs Versus Maria db. It's it's a process-based database. So it it's gonna It's it kind of it may context switch just as much but when it context switches it doesn't need to change the address space Okay, so far as oracles concerned The once again, this is going through the tune d profiles for Oracle Once again you implement you make sure you use huge pages so that you reduce TLB misses You wire the page cache. It's the Oracle pagecatch of the SG and memory you prevent swapping at all In the case of Oracle we do a couple of things that we don't do on other databases We turn off what we call Auto Numa the Auto Numa code in the kernel Trolls around and determines if it should migrate pages from one Numa node to the next so that you get local memory accesses versus remote memory accesses in the case of Oracle the Generally speaking it's configured such that it's larger than any of the any single Numa node And if it's larger than any single Numa node, you're not going to be able to do anything in terms of moving stuff around anyways It's gonna be just going to spend all of it's not all of its time But it's going to spend system time valuable system time moving pages around only to have to move them around again once the once the Database engine itself starts making other remote memory accesses So with Oracle you're gonna see the Toondi profiles will shut off Auto Numa It'll also turn off what we call transparent huge pages transparent huge pages as a mechanism in which we use the large This is all Intel based. I'm not we can talk about other architectures after but this one. I'm talking about it It was all Intel based so Intel supports 4k 2 meg and 1 gig page sizes if by default The anonymous memory that is allocated to a process if it can will use 2 meg pages But that that's a positive thing in that you touch a 2 meg page Once and it instantiates the entire 2 megabytes of memory the downside is it is it doesn't say it instantiates the 2 megabytes of memory So it's going to consume more memory and when the system runs out of memory It's going to have to stop cruising through these huge pages breaking them into small pages So there's a background demon that runs and that background demon introduces latency So in order to get the database to perform to its maximum we we shut off Transparent huge pages terms of The way I talked about this before these dirty ratios since Oracle uses the page cache you can actually a Database can either be designed to use the the operating system kernels page cache or avoid using it in the case of Oracle It uses it so in order to lower the latency we lower the background The background ratio which determines when the system starts flicking out pages But we increase the dirty ratio So that that's the that's the threshold at which it stops the process and starts making it Contribute to the right backs as well. So opening that window provides a higher higher performance for Oracle Numa pinning this this is the whole section I'm going to talk about about Numa and pinning and on all this that the The probably the single most area that you can gain performance in any of these systems is in Numa pinning is and making sure that you Don't do remote memory accesses and I'll talk about that in more detail in just a couple of minutes here And then this SGA once again This is the this is basically the page cache of Oracle so you you have to make sure that the systems parameters are in unison with the the Systems tuning parameters are In unison with the database tuning parameters if you don't if you don't do that if you don't make sure that they're together They're going to fight with each other the database Cache but they go the SGA is a system global area is what it stands for it's the page cache of the database It's and if it is larger than the than the operating systems page cache It's going to over commit the the memory and the system is going to swap so it needs to be adjusted based on The Numa characteristics of the system and based on the amount of RAM that the system has So his this is rel 8 versus rel 7 Oracle Oracle 12 this is just that I just grabbed one of the the slides that sort of Illustrated this and as you can see most of the time we do better in rel 8 than we do in rel 7 But there are sets of users in which we ran some workload and then with 40 and 80 users There was actually a slight decrease of rel 7 over rel 8 and that had to do with just algorithmic changes in the kernel So if you run into something like this, this is this if you ran into something like this This would tell you this The system is ripe for additional tuning it would say you do you should take a 2d profile and stop messing around and and Figuring out what you could do to if you if you're in this 40 and 80 User load and you want to improve the performance That this is this tells you that it's just sort of ripe for learning and I'm sorry ripe for tuning and this I'm going to show you there's some Pointers hyperlinks to papers that tell you exactly what to focus in on This when you when you're doing this and then next thing I want to talk about is the Microsoft SQL server we in rel 8 so we supported in rel 7 but in rel 8 there's a whole bunch of changes algorithmically in the kernel that we made that were red hat made in which also backported from upstream to Optimized that the performance of that so the Microsoft that's a completely different beast the way that they Microsoft port backport are ported the SQL server on to Linux is Is it's got a whole layer of software in there that emulates rather than changing the source code so that it layered on top of a Unix system it was designed to run on a Windows system and then rather than Redesigning the database they came up with a layer of code that translates from Windows and the Windows environment to the to the Linux environment and we spent quite a bit of time with them I worked with them and other people that the in the group work with Microsoft in terms of optimizing this they There is no equivalent of like system 5 shared memory in in Windows and some of these other features that Linux Supports and that in are the ones that give us the optimal performance so So in the court case of Microsoft SQL server you can see this is once again You have to go back to that that little polar coordinate graph to show how Exactly what these numbers are but in all cases we've measured The performance of the SQL server running on rel 8 is better than rel 7 and then let's see what else This is just a sort of a performance summary. We talked about micro benchmarks databases Java SAP so this that deeper in the the This presentation I'm sort of skipping through some slides so I can make progress here in a one-hour time frame But there's more information on here in here on how to tune Java systems SAP SAS all of these and there are In addition there are hyperlinks in here the point you to papers on what to look for and how to how to go about this so that was the first section just based basically talking about Out of the box performance and performance improvements using the tune D profiles that we talked about earlier for a variety of different benchmarks and Real-world applications the next section. I want to talk about here is Numa like I said before Numa systems This is probably the single biggest area for performance gain. You can get in Any tuning endeavor so systems are now they're all like this even your laptops are made up or Numa systems They're Numa building blocks are what we call nodes each node has several cores It has shared cache it has a bunch of RAM and it has some sort of links that connect to to interconnect the nodes and to connect to the outside world for DMA devices and so forth and how well we do in terms of placement of Processes in the memory that the process is using is the single largest factor in determining how well the system is going to perform so This is this is what they look this is what a Numa You know look system looks like and like I said even laptops now now they have Numa systems in chip So inside the chip even though you might say oh this only has one CPU in it A lot of these CPUs of our Numa systems internally they have multiple memory controllers and everything in them so there is a command Numa CTL and There's several different options you can pass to it and my hyperlinks in here to tell you how to win how to use it and how To interpret it, but if you wouldn't do a Numa CTL minus minus hardware on some system I just grabbed this thing and it says it's a four node system each node has 64 gigs of memory. It's a small system. It just I forget what it was 256 gigs or something it was a And it tells how much memory is on the system how much memory is free It tells you which CPUs that each one of these nodes Includes and it also tells this little table down at the distance done at the bottom We call the slit table the system location and I forgot what it stands for but anyways It tells you the relative memory bandwidth and latency performance associated with that With with the system so in this particular case if you What's always necessary is that is to minimize the memory Latency time by making sure that you that you're executing on the same node that the memory is allocated on Another tool that is more graphic is called LS top. Oh, it shows you the same amount of information, but it's graphical There's so this Numa CTL command allows you to see the system But it also allows you to do pinning it allows you to say I want I want a program to run on This node I want the memory to be allocated on this node and I wanted the CPUs that it executes to be running on Hopefully the same node, but you can actually do the opposite and move you can actually Force the system to run this for the execution to run on one node the memory to be allocated on another node And if you did that you're going to get higher latencies and lower throughput by doing that So this is just once again. There's a hyperlink later on in this That tells you how to use it and how to evaluate the output of it and so forth this is just an example of We talked earlier about disabling new what we call Numa balancing the kernel has internal algorithms to to monitor and see where a program is running and Where the memory is is is allocated and if you by default if you allocate a program if you run a program and you That program allocates a bunch of memory and for some reason it's not Allocate it for some reason it ends up doing remote memory accesses the kernel will over time Monitor this and start to migrate the memory and or the threads so that they're running on the same CPU And this is an example of that so this this at the top you can see that these guests were Just scattered all over the place and then over time what happens is over time you can see that the This particular process migrated so that it's all its memory was on the same node So that all the memory references are now local versus remote and you'll get up optimal system optimal Performance of the application by doing this once again We have there are hyperlinks in here to tell you exactly what to do and how to how to monitor and manage this stuff So as far as inside the kernel is concerned I just want to talk a little bit about how this is implemented in the kernel and why it's tunable Why it's necessary to tune it so this is what basically memory looks like in on a system the Node zero always contains a couple other what we call zone nodes contain zones of memory and Node zero always contains two node two zones that the other ones don't and these are the first 16 megabytes in first four gigabytes of of memory for for DMA purposes These are called the DMA zone and the DMA 32 zone But other than that every so the first four gigs of memory on the system always go to nodes Zero and then all the other memory is Scattered around all the new nodes, but this is always a given so and then Inside of the so the kernel itself maintains a Paging thread we call case swap D in case there was one case swap D per new mode So in these so it's almost like a little network of computers inside your system in which the Kate You can so that what this means by the way is that? You can allocate you can have a program running on node on a given node that uses all the memory And actually can be reclaiming memory and swapping or the other ones are completely far I have an abundance of memory. So and this is a fairly normal thing to see and there are if if you If you incorrectly pin the the process to Given node and allocate the memory from another node then you'll see this this Case swap D will run it. It'll be Migrating pages around and all that so forth. So I just wanted that this is important to understand this when looking at how to analyze the system so There is that this is why I want to talk about some of the the tuning parameters You'll see in these tune D profiles There's a there's a section of them that are sort of dependent on on Numa and these are what we call swappiness determines how aggressively the system reclaims anonymous memory mean free k bytes is how much free memory the thing the node has and Zone reclaim mode is a I'll talk about that in a minute is a is a really big switch to determine if you when the system runs out of memory if you want to reclaim locally or you want to When the system runs out of memory if you want to reclaim just the memory on that node or you want to allocate memory Through the other nodes and this is a this is a trade-off of throughput versus latency If you have an application that's going to run for a long period of time It's probably desirable to make sure that that you all the memory even though It's more expensive to start it up that yet that all the memory on For that process is on the same node node even if it means reclaiming memory on that node versus local versus allocating it remotely this You said five minutes this goes to to 55 right 12 of 155 Okay Okay for Q&A just looking for time here So this like I said this this this is a big one here This is own reclaim mode I just said before if if you set if by default it's set to zero on a lot of systems and What what this means is if you exhaust the memory on the node that you're on rather than Forcing kSwapD to reclaim it'll just simply step on to another node and allocate the memory there And then it'll let the new my balancing code in the kernel deal with it it'll move stuff around causing high higher latencies and If you set this to one what this says is it says if we run out of memory on a given new mode Force that node to stop memory reclamation which can require page cache reclaiming and swapping and all this stuff is Is it as it goes? like I said the Default is set to zero. It was set to one, but up in the upstream kernel The maintainers started to realize that that the majority of the time You want the thing to allocate you want the thing to spill over onto another node and allocate and let the numeric Balancing code deal with it since we shut off the numer balancing code in a lot of these Tundi profiles we have to be very careful that we don't That we don't end up causing the system to be in a state that it does a lot of remote memory accessing Okay, so that was a sort of a talk about Numa and the what you can do for Numa to make the system allocate and Perform better the next thing I want to talk about is huge pages This is a probably the second biggest area that you can make a performance difference The like I said the Intel architecture supports three page sizes 4k 2 meg in one gig Each TLB each TLB entry is A is associated with a page size. So if you have Form it so if I'll just make something up here if you had 512 entries In this in each entry Monet maintained 4k 4k times 512 is what two megabytes The the that would limit the size of the TLB to to two megabytes before it started doing TLB reclaiming and it's all transparent, but it slows the system down if you if the If you use two meg pages, then it's going to give you 512 times as much memory So it's going to be half a what half a gigabyte instead of Two megabytes and if you use one gig pages if you force the system to use one gig pages It'll the TLB will mean manage and maintain 512 gigs all all very positive things the downside of doing this though is Like I said, if you touch one bite in a page It'll instantiate either two megs or one gig and the system will potentially run out of memory faster and will end up with more What's the word I'm looking for you'll end up? It'll it'll basically force the system to start reclaiming and Trolling through memory in order to break it down into smaller pages if you allocate to two meg page system runs out It has to break it into smaller pages the 4k page size in order to do that And this introduces a lot of latency and this is why some of these profiles especially database tuning profiles Disable transparent huge pages, so this is just a picture of the of the Different memory sizes, and this is how they're used in the database in database tuning They use this that you'll see a lot of this tuning use the Proxys VM and our huge pages it'll that what that does is it removes a bunch of memory from the normal Paging list and sticks it on a look-aside list that only the database or the application that I'll that requested that can use and of course of doing so it prevents the system from reclaiming any of this so Doing this has a really positive side effect in that it's impossible to page a swap any of this memory It also has a really big negative side effect, and it's like pulling memory out of the box So if you if you if you haven't it this is where it's important to make sure the system tuning matches the database tuning if you tell the database if you tune the database to use I'll just make something up a gigabyte of memory for the database Displot cache And you tell the system if you use take two gigabytes You're not you're gonna read it's like removing a gigabyte of memory from the system gigabytes are small in today's world So I should have used terabytes I guess but but anyways It's the same it's the same thing this is so this is a description of how to use Two meg huge pages in one gig huge pages in rel in earlier versions early rel seven We back ported some features to eliminate this but you had to you had to allocate the one gig pages at boot time and they would never reclaim back in in rel eight and In the later versions the later dot releases of rel seven The system can dynamically Expand and contract the one gig pages, which is a big positive Which is a big plus because you don't have to reboot the system use that to reboot the system if you wanted this memory back And that's a pretty big problem in terms of So this is just pictures on how to do this you can see when you use when you allocate the memory And when it gets used what it looks like this is the proc the Proc mem info the proc file system shows you where memory is in a lot of these hyperlinks that you're gonna see in here tell you refer to this stuff so Transparent huge pages of the previous one by the way this these huge pages are only used by Huge steel BFS which which means it's a file system It's it's a metafile system and system five shared memory Which databases all use for the disc walk cache is layered on top of huge steel BFS So the only way to use these standard huge pages is through the huge through huge steel BFS and the System five shared memory by default Transparent huge pages or anonymous memory So you stack your heaps and all that stuff if you do an M MAP anonymous or an S break of a malloc and it's Two gigabytes. I'm sorry two megabytes or greater. It'll use huge pages for that once again You get all the benefits of huge pages and all the negative sides of huge pages so you have to make sure that if you're really concerned about latency you want to shut this stuff off so you don't have Demons and threads running in the background coalescing pages and all this that introduces a lot of latency it introduces performance as you can see By using by shutting off huge deal by shutting off transparent huge pages I wrote this stupid little program that allocated a bunch of memory and danced around in memory and you can see it took 12 seconds to run the thing when I when I didn't use Transparent huge pages, but when I did use transparent huge pages the time went from 12 seconds to 7 seconds really big really big performance boost, but what that did is it is it Introduced a lot of latency so if you if you if you're running a Latency sensitive application like a data bit. I'm sorry a stock trading application, which is really important. You want to sell your Whatever stock I won't even say what company But if you want to sell your stock and and you really are concerned that that the stuff happens And in a given amount of time without the high latency spikes you typically shut this stuff off So I I'm just going to go another five minutes on this unless somebody has questions I'm just going to sort of continue to blast into this because some of the stuff is important if you're reading the the Just the papers that we have on tuning systems. So This is just this is just something we put in here for this is only for for benchmarking When the system after a while the system you boot the system after a while it you do Run different programs and it pulls it consumes all the memory most of the memory gets consumed in the page cache It's easily reclaimable, but it consumes it gets consumed in the page cache, which is caching disk block data it also gets consumed in another cache called the slab cache which is Basically operating system and file system meta daters the kernel data structures, and it's the it's the data structures associated with not the file system data, but the the Management of data on disk there are you can rather than rebooting your system to get this back if you're root You can go in there and echo into proxys VM drop caches You can echo a tricep. They're bits if you echo a one in there It dumps the page cat all the page cache memory which is gonna be a Substantial performance hit if you were if you cared about the contents of that But if you're trying to run benchmarks rather than rebooting the system all the time You can just dump the page all the page cache memory out the same thing is true with this lab cache You you don't want to do this unless you're running benchmarks, but in order to expedite the tuning process in in order to You know you're messing with the tuning parameters And you want to say how did this work rather than rebooting the system you just echo this in the procs Into drop caches and boom the page cache and the slab cache go away on you all the memory gets freed This is far as how stuff works when you do a reader read or write operations We have a data structure or a set of day structures in the kernel called a page cache It's a right back cache So what that means is when you write to it It just copies in there and says okay. I'm done And then there are background threads that flush the contents of this page cache to disk on demand and they're also read it There are and you can tune this you can tune how aggressively the system writes these Modified pages back to disk and you can tune the thresholds The ones that I wanted to talk about here are the background and dirty background ratios So the basically the the page cache has three levels to it so to speak when when there is Almost none of it's dirty when almost none of the page cache is dirty. It doesn't do anything It doesn't write the pages back to disk or anything But when you exceed this background dirty ratio, which is a tunable parameters 10% by default every time you go to write a page and it Crosses that threshold it does a thread wake up of the the flush demon that flushes pages out and when that flush demon runs it flicks pages out and It in an attempt to keep the the dirty page lists down here If you overwhelm it by writing several programs all writing at the same time you can overwhelm it and come up to this background The I'm sorry to this dirty ratio Once you hit the dirty ratio every time you write go to write a modified Every time you go to modify a page and exceed that threshold it makes your page it makes your process Become a flush demon so it stops you and boom you're obviously introducing a lot of latency to this for the thing So this is where the tuning of this is so is so critical especially in and around databases and file File system benchmarks Memory tuning there's a there's a parameter that you'll see in all of the tune D profiles called swappiness What swappiness does is that determines how aggressively the system reclaims anonymous memory versus page cache file system page cache memory? It's it's not a percentage or anything. I don't know why we put percent it's an integer between zero and a hundred it's set to 60 by default and Basically what happens is when the system when case WAPD runs because the free list gets down low Uses the swappiness to determine how Whether it should go after page cache pages or anonymous pages if it's a obviously if you are running something that's highly file system sensitive then you want to lower this to to force the system to reclaim file clean page cache pages instead of of Anonymous memory pages anonymous memory pages always need to be written to swap space So there's a real big negative aspect of doing that so this this is just that I would this this is referenced by one of the One of the white papers we have since later on in this presentation and this is just a picture of of What the VM stat output looks like VM stats a tool that tells you where the memory is on the system and how How much time it is spent how much time is spent in user mode system mode idle and so forth? It shows you the difference between Fiber channel storage and SSD storage SSD storage obviously being a lot faster So you can see over there the use with fiber channel storage the same process of the same benchmark has you the user time is you know between 25 and 50 percent this it's obviously there's always some system time and The eye if you're waiting for disk I out you're gonna have idle time But since the SSDs do almost no waiting for IO at all you can see the user time goes way up so this is an example of Quote tuning that involves substituting SSDs for rotating media This is just an overview of these The tuning that we thought that I talked about here This I think this is the same thing I said before these are the these are the most common tuning parameter Which you're gonna see in these 2nd profile see the ones are white papers describe the most we pick the basically the biggest The the parameters that have the biggest result and and these are the ones that you're gonna see in there these are the white papers in the in the if you go to the Presentation itself these are hyperlinks to give you all of our tuning guides and blogs and you know we have tons of of write-ups and descriptions and Documentation of how all this stuff works and then basically I think we're pretty close this goes to 55 one 1350 Okay, so I guess that I guess I ran out of time This was the last thing I was gonna show here that I don't have time for is things we introduced in rel 8 the Five-level page table support allowed you to increase both the virtual and memory size and the physical memory size This is necessary because of what we call persistent memory and V dim support and V dims are much larger than RAM By a fact by an order of magnitude or even more what this means is we're gonna run out of virtual and physical address space so in order to Be able to deal with these much larger chunks of memory coming up in the very near future We need five-level page table support so I just this is just a picture of what it looks like what the What the page tables actually look like talk about persistent memory persistent memory can either be used as storage or RAM and I don't think I have time to go through this these guys are giving me the evil eye So any questions if not I'll I'll be outside if somebody wants to chat about this stuff I can definitely give you a point just everything that I talked about in much more detail there isn't enough time in a one-hour presentation to do anything other than skim this stuff and there's a lot of reading left up to the