 Let's see, does this work? OK. Well, the slab allocations in Linux kernel are a bit of an esoteric subject matter, because most end user space programmers never see too much of it. On the other hand, most of the kernel programmers constantly have to interact with it because they need small objects for their kernel code. And so probably it's wise to first talk about, what is this thing, and why do we have it? Fundamentally, the basic memory allocation in the Linux kernel works only with page size frame sizes. So you can ask for a 4k memory block or a larger multiple of those, high orders of those, from the page allocator. So that's the basic operation of the memory subsystem. So if you want a small object for storing some metadata then you have to go to the slab allocator and it will get some larger blocks from the page allocator and hacks it into pieces and shaves it out to various subsystems that need these small pieces. And these small pieces are often needed because, for example, for network operations you need a smaller block to store the network information. You just need to store the scatter gather list of all the components that go into network packets and stuff. And so the speed of the slab allocator is frequently important for the performance of other subsystems. I often think that some of the critical paths here are key to the performance of the system as a whole. We can see significant performance increases just by changing the latencies or the performance of the hot path in the kernel. It also does caching because the page allocator is a very slow beast. And you don't want to have too many trips to the page allocator and therefore we are caching as many objects as we can locally and make sure that they are easily serviceable for any of the requests of the subsystems. So having said that, there is some basic technology issues here. We're talking about the slab allocator or a slab allocator. And if it's slow case, then refers to one of them of the allocators. And there are actually three of them. One of them is SLAB, the slab allocator. Then there's the SLOB as the slope allocator. And the SLUB, the slope allocator. So a slab, if we talk about the slab, it could refer to a slab allocator, the slab allocator. Or it could also be referring to a slab. A slab is a page frame that is used for slab allocation. And so you've got to be careful on what you actually mean when you talk about a slab, especially on email communications. It has been that way forever, and nobody has figured out a better terminology. So that's what you're stuck with. So this is how the system components interact in order to get to the slab allocator. So user space code does some system calls. System calls maybe do a file open. So the file system gets involved. The file system needs to store something to store the file metadata for the file handle at the opening. And therefore, it asks the slab allocator for a piece of memory and then stores within it. The slab allocator may interact with the page allocator because it needs to acquire a large piece of memory that then can hack into pieces. And so this happens with various systems in various ways. And often the subsystems do different allocation calls, depending on if they need page frames or do they need small blocks of data for the metadata. And so these calls are what is simply used by Kernel code to do the allocation. So the K-Malloc is basically indicating to the system, OK, I want this size of a memory component now and give it to me. And you can free that again. Then the KZAlloc is the allocation in the same way, but the data is zeroed. Zeroing costs time and touches the cache lines. And so you want to be careful only use it if necessary. Then you can also create actually special memory caches with special attributes. The K-Malloc and K-Free work just was abstract and default assumptions. You can tell the slab allocator that you want data to be aligned to certain boundaries. You want to be ready for IO. You want other things to go along with it. And then you have to use a K-Malloc where you have to specify the cache. The slab allocators are also numal aware. So you can indicate that you want memory for a special node for performance reasons. And if you have any questions, just ask the question before we get to the next topic, OK? And so a bit of a history on why these three things exist and how did we get here with the three allocators. So initially, 1991, what Linus did is just did the Koenigang and Richie allocator. That's actually in the book, Koenigang and Richie. It's a very compact allocator. And so it was there from 1991 to 1999. In 1999, we thought this was no longer sufficient. So Manfred Sproul and some of his friends got together and they read the paper that Mr. Bornwick wrote about how we did a slab allocator for Solaris. And they used that as a guide to create the SLAB allocator. So in 2004, I began to hack around with it and expanded it a bit to do Numa. And it became kind of a monster. So I tried to tame the monster. I couldn't do it. And so I came up with the SLUB and called this the Ankyu allocator because the slab allocator resulted in an exploding number of Qs. It seems that most of the memory seemed to be vanishing into some kind of allocation queue. The SLUB queue approach is to cut this stuff out. So we have three different design philosophies now for these allocators in the kernel. As an OB, the one that was there first, it's as compact as possible. Doesn't waste any bytes at all. It makes it extremely compact. And so it was very useful in the beginning where we had small memory sizes. And today is also useful for very small devices. The performance is a bit of a problem, but for a uni-processor design, this is ideal. SLIB was designed by this various developers to be as cash-friendly as possible. And it was especially designed to run the benchmarks very well, that's what Laris runs. And so what it tries to do, it tries to guesstimate which objects may still be hot in the CPU caches and keeps in them in queues. And so this guessing causes a huge amount of queues to be developed where these objects are stored and they are constantly cycled and processed every two seconds they are expired and moved around. And there's a lot of fun going on there. And I had enough of that, and that's why SLUB came along. Instead of just making it as simple as possible, make sure that the instruction counts in the critical paths are very low. And one of the problems that I had with SLIB, we had constant failures where we were not able to debug the situation. There would be a message, okay, somewhere in the kernel and in the SLUB code, a function failed. But it was really due to some kind of a, some subsystem corrupting there the SLUB data on some fashion. And so what I added to SLUB is sophisticated debugging. And in particular, I was not able to enable the SLUB debugging function because it required a recompile of the kernel. And I had these huge supercomputers to run and when they failed, they failed. And so I had only had a core dump to work with. And so I had to wait through a core dump of gigabytes of kernel data to figure out what's going on there. And I could not start these things in a debugging mode. With SLUB, you can give a kernel command parameter and even a production system that hasn't had debugging compiled in. Just specify a kernel parameter and it will then instrument all your allocations and usually it can find even a production system the values or the subsystems that corrupt it. And moreover, it's the set of healing. So if it finds a corruption, it rarely receives a corruption and it usually happens to them that the kernel can continue. So your production system will still be operating although it's somewhat reduced speed because it constantly needs to check its operations. And so I thought that was very important. And the SLUB allocator is currently the default allocator and most of the distals use these SLUB allocator. And so we have a timeline here. 1991, the initial kernel and rich allocator. 96, SLUB allocator gets developed first in 1999 becomes the default. In 2003, SLUB is put back. That was because someone felt that the SLAB was too much of an overhead. So they reintroduced the one that was there before and called it SLUB. Then in 2007, I released the initial version of the SLUB allocator. And then there was a competition because I said that the SLUB allocator is more space efficient than SLOB. And then the SLOB guys got more motivated and they redesigned it a bit so that it could compete with SLUB. I still say that SLUB is better space-wise, but there are certain cases in which SLOB is better. So it stayed in the kernel. I threatened to throw it out, but it had to stay then. Okay, then there was still some performance regressions. And so in 2011, I did the SLUB fast path rework that made it much faster than SLOB. And then in the last two years, we have the move to extract common code from all allocators. We have now three allocators and there's a shared code. So I created a common SLUB framework and we are gradually moving pieces into the common SLUB framework and only have the allocator-specific data structures and then those definitions in the particular file for the respective allocator. And then in 2014, we had a new talented maintainer, John Tsung Kim, and he figured out he could make SLAB faster by adopting principles from SLUB for SLAB. And so now we have the stoppification of slab. That increased performance significantly. And so, but we're still way off on the fast path here. Some of the important people involved here, Manfred Sprower was the guy who initially did the SLAB work. He retired and he's still involved in some other issues, but no longer with the slab allocators. Matt McHale, he brought SLUB in there and he's the one who's maintaining it. Pekka Anberg has been active in various allocators and has been done a lot to improve the code. I've done the SLUB allocator slab numerous stuff and I'm generally dealing with all the issues that are arising on these allocators. David Vinicius from Google has been using the allocators extensively and tests also extensively and he's also developed his own variant of it. And then we have John Tsung Kim. He's just became a maintainer and he's a very talented guy. Has been active now for three years, contributing to the code and most of the bright new ideas come from him now. Then we have contributors that are no longer there. Some people from India that did the basic design of the SLUB NUMA code. Then Global Costa added a lot of the C-group stuff that we're using today. And then Nick Pigan. He's written numerous slab allocators that all were not merged and he also increased implemented the SLUB NUMA support. This means that even the simplest allocator can now give you a memory on a specific node if you want that and you can work with it. Okay, so what I wanna do now is talk about the basic structures of these allocators. And so we're gonna start with the simplest one and then go to the more complicated ones. So the SLUB is basically, it manages lists of free objects within the space of the free objects. And so in order to maintain lists of free objects, the data that's actually free is used. And so you don't have any overhead. The overhead is only because objects are free. And so this makes it very compact. On the other hand, this also requires that if you wanna allocate an object of a certain size, you need to traverse this to find the size, find a whole of the size that you need for your object. And then if this is a size that isn't matching exactly, you cause more and more fragmentation. So if you use SLUB and you have a load that allocates a free object in a very rapid way, you can disfragment memory very fast and you can run into issues with out-of-memory situations, although you have huge amounts of memory still free. So SLUB should only be used for small deployments that don't do too much with memory. So the sub-object format is basically two choices. Once is you have a payload here. And we may have to pad that to get it to the alignment right and stuff like that. So this is the object if you allocate it and use it. If you don't use it, you store the size of the object there and the offset of the next free object in this free space. In case the object is very small, then we can just store the offset in here. So that way even the smallest object can be used to store the free information for the next one. So size is just one, then you don't need to size and you just need to offset. And that makes it possible to use the payload to store the information about the next object and create the lists that you scan. Then all SLUB allocators must use page frames and they must have a page frame format. And this is the way that SLOB handles the page frame. So that's a page frame descriptor where you have the pointer to the free list in the page frame. In the page frame, you have the size of the free object and then where's the next free object. So you have the green zone as a free object, then you have an allocated object, another free space, free section and the object and so on. So it just points to the next free object and the page frame. And so you can just go to any page frame and look for what is available and figure out a hole that fits you. Notice that the sizes are different and this causes a fragmentation and on the other hand, it causes a good fit on the initial allocation because you can fit these holes exactly to the memory sizes you need which is different from the other allocators. And then there's also a global descriptor table and for SLOB, because of the fragmentation, an optimization was done to categorize the size of the allocations. So we say it's a small allocation, some medium and some larger ones. That reduces the fragmentation effect a bit and slot. And then you have here the complete SLOB format, how it handles memory. Then the next one is SLAB. It has cues to track cash hotness. So this is a list of objects and these cues exist per node and per CPU. And also for each remote node because if it wants to do free operations, you cannot simply take a log on the remote node. You need to queue it first to aggregate it and then do it aggregated free. And so with the systems with a lot of CPUs, you get a lot of cues and the system is a lot of nodes. You get an exponential growth of the number of cues you have in a system. And it has a very complex data structures that I'm gonna show in the next two slides. And I'm not sure how deep I should go into those data structures. I just wanna give you an overview here. And the SLAB memory allocator does object-based memory policies. So if you say you are in a numeral system and you wanna spread objects across all nodes, for each object, it will take a different node. And this means that there's overhead of management of the numeral locality for each object that you allocate. That is one of the reasons why SLAB is slow. And one of the key things here is in order to keep the data structures and the cues ready for allocation, it has to track which objects are cache-hot. And that means every two seconds, it will assume that the objects have cooled down a bit and it will go through all the cues and throw all the objects that are a bit older. And so if you have a real-time system or an HPC system, the SLAB allocator will kick you every two seconds with a large scan of its own memory and interrupt you for a significant period of time. We had one of the systems at Ames National Labs at NASA running this thing with 4K CPUs. And so the time address is spread out over all CPUs, right? So you have two seconds and 4,000 CPUs. Something's always going on. And so the MPI jobs were delayed by about 50% just because of the cache of optimizations in the SLAB allocator. And the other effect was because there was such a huge number of nodes, about 5% to 10% of the memory of the system which is gone on boot up because it pre-acute all these things. And if you were using this very intensively, the amount of memory lost due to free objects being in cues increased significantly. So this was kind of a case of the vanishing memory that vanished into the cues. And that was not too good. That's one of the primary motivations that I had on writing the SLAB allocator. The SLAB allocator has a very sophisticated free list management. So the member slope has a point to the next free object. That would mean you have true to access multiple cache lines. We don't want to avoid that. And so what we have here is in the beginning of the page frame, we have a table of all the free objects. So you can just take the first n objects and you can get the pointers to the first five objects that are free, for example, without touching the objects. And we're only touching one cache line. So that was made also for benchmarking purposes. And today we have the Johnson optimizes last. And so one of these entries is always a byte size so we can actually index up to 64 objects in the first cache line without touching the objects. The other allocators would have to do 64 cache on accesses. That is a key optimization. On the other hand, this also means that there's overhead now in the page frame. There's more memory lost now because you have to take memory away for the extra free lists. The objects need to be aligned at certain boundaries. So you have to add padding in various locations. And also the SLAB supports coloring. This means if there's a cache association of very low quality, it can avoid placing objects always at a zero offset. And that also causes additional waste of memory. So SLAB is the most memory inefficient allocator that we have. And then we have the slab object format. It's very simple. You have the payload. And if you don't have it allocated, it's just free and it's not never being touched. And it also supports debugging modes. If you are switched on debugging modes, you need to recompile the kernel for that. You have a red zone where you can check that the subsystem didn't right pass the end of the object. It can track who was the last one to touch the object and it has some padding at the end as everyone has for alignment purposes. And so this is kind of a brief structure of the page frame. All CPUs have an array cache structure that is per CPU. The array cache structure tracks pointers to free objects that are cache hot that could be accessed and could be used later. And so these pointers can point to the free objects in multiple SLAB pages. So there's no association here. And this means that the locality of access is not preserved in SLAB. Then there's a page frame descriptor which where you can figure out where does the object when you start in the page frame and the beginning of the free list where you can index the find the free objects. And it's very interesting because in objects that may be allocated may actually actually that that seems to be free may actually be allocated because this is one of the cues. So it's very difficult to figure out from looking at the page frame if the object is actually in use or if it's just on some of the cues. And the cues are massive and diverse. So one of the challenges here. And so this is the full thing. We have a cache descriptor that has the cache arrays for every CPU. Then we have the per node data where we have per node groups of these page frames. And the per node data contains a list of these objects. Also a shared structure. So if you have multiple perceived cues it has a shared structure where you can just free an object from the local per node per CPU structure into a shared structure which then comes back to another CPU because the L3 cache covers multiple CPUs, right? So it reflects that as well. So we are attempting here to build an object queuing structure that mimics the cache hierarchy of the system. And we are tracking on which cache may it be and what is most advantageous for us to get. And then as in the alien caches for the remote freeing which is the structure that grows the most with increasing numeral abilities of the system. Then we have SLUB here. And the motivation here was enough of the queuing. Therefore we're going to do an unqueue allocator and just cut all the stuff out. We have to have a queue but this is just a link list for every of the objects in each of the slab pages. So it's not that each CPU has a queue. Each CPU has a slab page from the page allocator and we are allocating objects from within the page. And so if you do multiple allocations with the SLUB allocator, you get objects that are in the same slab page. This also means it's cut by the same TLB entry and you minimize the amount of TLB trashing. We also have the ability to have more of these Persevue pages if necessary for tuning purposes. We found out that sometimes more than one page per CPU that can be tuned. The fast path have been written in such a way that they do not disable interrupts like SLUB does but you just use very fast Persevue atomic operations to do their duty. The objects are not allocated based on memory policies for each object, single object and slab. So if you do slab and say you have an interleave policy here, you would take from one node to the other. If you do it with SLUB, it will exhaust one page and then take the next page from the next node and so on and so on. So it's page-based policies, not object-based policies. That is a bit different and may have caused a different NUMA effect if you are trying that. So it also supports different mutations on multiple levels. There was just an LWN article published about what I said on Monday which was very inaccurate because these defragmentation stuff has been there since 2008 and it's working in production systems. So this is the current default allocator. These are the SLUB data structures. Let's talk to what the objects form out first. You have the payload and you have the same as the slab, the red zone to verify that nothing happens here. Then it has extensive tracking and debugging information. You can figure out when was this object last allocated, where was it allocated, which CPU did it run on, which PID was running, what NUMA nodes did it come from, and stuff like that. So it's much, very much expanded there. If the object is not allocated, then we have them using the payload for the point of the next free object. That's the same as an SLUB. And we can also poison the object if it's free so that we can detect any access to freed data. So now this is the page frame here. And typically you have freed points. It has a point of the next free object. And if you see this, you see that there's actually two free lists in operation on this page. On the one hand, there's a speech descriptor which has a free list that is over there. That free list is protected by a lock that has to be taken in the page struct. And so this requires locking operations. It's expensive, but it allows shared access to the queue in the page. And hopefully often this is the case that different objects are on different pages, right? This is only there for the case that two objects have to be updated in the same page from different CPUs. So you have one free list here. And you have another free list that's linked to the KMM cache structure. So when you first, when you dedicate a slab to be used as a, as the Persevue slab, it takes a whole free list off the page frame and puts it into a local structure. And now you can use this free list without locking. It essentially allocated all objects to the CPU. And now the CPU can allocate one object after another without taking any locks. When the Persevue slab has to be retired, it has to spin these back into the page struct. But for the time that you do the Persevue association, you have two free lists going on. So what can happen is, while the objects are being allocated by the Persevue structures, the struct page can get remote freeze from other nodes or other CPUs at the same time without interference. And that often happens. So a processor allocates objects and then does something that shifts it to another processor, which then possesses the data and frees it up. Then it comes back in the painted page frame. Once this Persevue is exhausted, the slab allocator will consult the struct page and see did any new objects be being freed to the page. And if so, it will stick with the page and just take the objects that were remotely freed and repossess them. That makes it very effective to kind of have a circular thing going around with neighboring CPUs. Slab also has much more advanced diagnostic tools. And people often don't know about this, but there is a slab info tool available in these kernel source tree. If you compile this thing, it can do amazing things that you haven't heard of before. Maybe this is the first time you hear it, not today. It can query the status of slabs and the objects in detail and can show you how much fragmentation is occurring, what happened to your objects. It can control the anti-defrag and object reclaim in some fashion. It can also run verification passes over your slab cache. If you are suspecting that something corrupts your data, you can actually give a command to the slab allocator, scan all the objects and verify that the integrity is there that you can check. What? Yeah, there's an extensive proc file system hierarchy called on slash this kernel slab. There's a hierarchy here describing each of the slab caches in detail. And it's a slab info tool interacts with that hierarchy and processes the stuff. So the verification passes depend on how much debugging you have enabled. And it uses all the debugging information available to do as strict a check as possible of all the objects that are in there. You can tune the slab caches while the system is running. You can actually enable and disable diagnostics while the system is active. And you can modify the slab caches on the fly. So if you want to compile it in Linux tools, VMslabinfo.c, that's where it is, compile it. And then you see these various, look at these four things, for example. And I'll show some of the outputs here now. This is a slab info basic output. You can see the slab caches and how many objects are in them, what are the sizes, how many space do we have. Where are these objects? Are they in the cache for the CPU? Are there partial slabs? And you can see that there's a lot of fragmentation. And then it calculates the fragmentation overhead that's created right now and the efficiency of the allocator. And notice that there's some names here that are kind of abstract. The allocator is merging multiple slab caches that have the same performance characteristics into a single one. That increases the cache hotness of the objects and avoids overhead that is otherwise existing in the system. And that's why they have these ugly names. And then the totals, this shows you what's going on in the system. So this system has 112 slab caches and it aliased 189 caches down to 484. So almost every second cache was merged into another one because it was the same as the other one. And as you can see how much is lost here, over 8.5 megs lost just empty because of fragmentation. And yeah, there is other statistics here. And aliasing, this shows you which slab caches had the same characteristics and how the system merged them together. And there are some sizes that are just famous like this isn't just great, probably popular. The 64 byte cache here has a huge number of users. Various subsystems recreate these things and that all got merged into one of them. This is only active when you are in production configuration. If you enable any debugging mode, it's disabled. And you can also say on the kernel command line, slab no merge and it won't do that. But for production configurations, this gives you the fastest performance and the most efficient memory use. Then the enabling of runtime debugging does this. I think this is key for production support if you have a strange message in the slab allocator. It's usually not the slab allocator that's wrong. It has something going wrong with metadata and the subsystem has overwritten the boundaries of an object. So it's compiled in by default in all distros. Nobody often realizes that and I get message, oh, your allocator is falling and I say, okay, reboot this option and then it says silence because the option gives enough information to debug the problem and it's obvious what the problem is. So there's a slab debug kernel parameter that can span on the kernel command line. If you do that, it would enable all debugging facilities and do the utmost possible to figure out what went wrong with your system. Do that and you will have an excellent report on what's going on. You can also get to more sophisticated and just do certain things. Let's say you want only checked certain caches or you only want to do certain checks, then you can get more sophisticated and you can specify some parameters. Let's say you only wanted to specify the F, it's just some sanity checking, it's often done, but otherwise, I rarely have seen any reason to work with these things at all. If you have a broken system that you know where debugging, and you believe debugging fixes it because debugging also does enable resiliency and it will fix the situations that it can fix. If it does that and you want to have reasonable performance, you may want to disable as much of the debugging as possible by still leaving the system intact. And with that, with these things, you can do that. And you can then tune the system that can still work, although it's broken and it will work until you get given a new kernel. And this is a sample error report for after you enable debugging. So this means somebody freed an object that was already free. And it tells you where it was allocated. So this is RT64 PCI, so some kind of a network driver did that. And it then shows you the object, what the content of the object is, and gives you back trace on where this occurred and then it fixed it. So the fix of this case is double free, it's we don't obey the free, we just ignore the request. This is pretty safe. If that is corrupted, it will try to restore as much as possible. Now, some comparison of the various allocators. So Slab is usually most compact. Slab queuing can get intensive. It goes essentially, goes exponentially. And in Slab you have the data reduction through the aliasing of the Slabs. And Slab reduces the cache footprint and stuff. So there's a lot of things going on there. So I've just booted to see systems and seen how much memory is being used. And Slab doesn't even have the numbers here, but it uses about 300 kilobytes more than the system. There was 300 kilobytes of data here, whereas these here have much more data to work with. And you see that the unattainable data is pretty high in SAB. This is because of the many data structures. This is actually about four megabytes more. So in performance comparisons, yeah, SAB is slow, Slab is fast. For benchmarking, Slab is fast in terms of cycles. And so, because SAB can track the cache state at three cache or two cache better, it sometimes situations is faster, but in general, in SAB is faster. And so, did some ultimate tests here of the Slab, fast path and slow path of performance. And you see here that the cycle consoles, SLUB are the smallest. That's about a 50% faster than Slab. And especially in the concurrent case, you have significant benefits because the locking doesn't exist in the same way as in SAB. And Hackbench, which is usually a good test is to see how a Slab allocator impacts general system performance. And you see generally it's faster. I couldn't run SLUB because it failed. The Hackbench caused a bit of allocation and reallocation, and so it didn't survive that. It probably ran out of memory or something in the enemy system failed. Remote freeing is something really important. And because often in the subsystems, you allocate descriptors and then chef out the job to some other subsystem running on some other node and it comes back to you. And so, that is also where SLUB shines. Yeah, I'm almost done. So, future roadmaps here. We're still continuing the, I said in code that I could come between the allocators. And I wish we would go do something on more on the defragmentation front. The SLUB fast path is now also relying on the CPU operations. So, Jon Su has copied the SLUB approach to SLUB and we just did some SLUB fast pass cleanup. Also, Jon Su did that. They moved the preempt and they were disabled from a better config preempt performance. The RT kernel is very much dependent on the SLUB allocator. They got a 40% improvement just by switching to the SLUB allocator because of the efficiency of the fast path. How much time do I have left? Yeah, okay. So, any questions? Questions? We have time for a few questions. Everybody's happy and satisfied. That's good. All right, thank you for coming.