 Okay, hi everyone. My name is Vlastimil and I work for the SUSE in the labs department in one of the kernel teams and an upstream memory management kernel community. We just had the LSFM conference earlier this week in this center and some of the things I will be saying we'll just report on the outcome of the discussions we had there. So I'm going to be talking about the slab allocators in the kernel because a few years ago I became one of the maintainers of this subsystem and also handled the Git tree and the best way to maintain something is to reduce the size of it to make it more maintainable. So that's my current project you'll be hearing about. So just to get everyone on the same slab page, what are the kernel slab allocators? These are equivalent of what you have in Lipsy, like the meloc free calls. If you want to allocate something small like a string or a C structure and you either know the size in advance or you are given it by somebody else. And so the aim is these smaller objects that are smaller than a page which is usually four kilobytes. Of course the API supports even larger sizes than that, but then it just offloads them to the page allocator because that's more effective. So that's K maloc and K3, the direct equivalent of the maloc and free. Then it turns out that many many objects are allocated in large number, but they are the same kinds of objects like network sockets or VMAs or dentaries, inodes. And it makes sense if you create special caches for them, you can manage that more effectively and also account more effectively. You can debug them without debugging all other kinds of objects. So for that, they are also the Kman cache create alloc and free as part of the slab allocator API. So what you usually want from a good slab allocator is for it to have a low memory overhead, so you shouldn't occupy more memory than the sum of the objects that are actually allocated and used by somebody else of the user of the allocator. On the other hand, you want a good or low CPU overhead and not wait on shared logs that would induce latency. And it turns out that these two objects go against one another. So to get the good scalability, you have to pay some extra memory overhead and that's one of the differences between the allocators we have today. And you also want nice debugging features because the allocator usually doesn't have bugs, but if somebody misuses it like allocates object and then freeze it twice, it can corrupt the slab allocator's internal structures as well. And then it looks like a bug in the allocator, but in fact, it's a problem of the usage. But you want some features that will help you debug these cases in a nice way and also detect buffer overruns and stuff like that. So that's why these poisoning and red zoning features are for. So to understand how we got into today's state where we have three allocators, as I will explain, I ducked a bit into the history of kernel that even predates the Git tree, but luckily there are archives that compile all releases into something that's also a Git tree, but not connected to the current one. One issue with them is that the individual commits are not as fine-grained and documented as today, it's usually one commit per release. And there are some comments provided for the whole release, I guess they will be written by Linus, but like in retrospective. So for example, the very first implementation was lip meloxy in very early in 1991 by Ted So, which was also when he started contributing according to the commit log. And it says that, yeah, it was the basic implementation of the equivalent of meloc and free. And it made the design decision that to free the object, you also have to pass the size of it, not just the starting pointer, because it didn't keep the sizes internally. And so it appears there have been this free as a call that provided the size and it got the years of trying to get rid of it. So I'm glad we are not there anymore. So then something called Kmeloxy was implemented two years later, which had the design decision that the size was prepended before the allocated object. So free no longer needed these two parameters. Then in 1997, we got one of the allocators that are still there today, the slapsy. And it was implementing something that was documented in a book or academic paper, which was, as I was saying, making the observation that if you have many copies of the same type of object, it makes sense to manage them together in a cache created just for this object, because then you can manage them more efficiently. Everything is the same size, you no longer need to track each allocation size. And you can do more fancy stuff like having a constructor that will save you from doing the new initialization. And for each allocation, if the object you are freeing is already somehow clean up and looks like a new object, you can skip this cleaning and new initialization. And so that's why when KMM cache alloc free and create calls were created. And the first users will actually VM area struck, which is part of memory management and struck stock for networking. There were also caches created for the generic KMM and K3 calls, but they were not used yet. They were used still by the, that was still handled by the previous allocator in KMM. But just a few months later, this was unified and the old KMM was deleted. And since then, if you want some some KMM allocation that you give just a size, it will pick one of those caches that have the closest size. There may be some fragmentation overhead, but it's easier than doing it another way. There was a bit more evolution since then, that I guess is not so important. At some point, NUMA awareness was added and is there since then until today? What's important is that in 2006, another allocator was added that was again more similar to the original one that supported just KMM and K3 and was preventing the size to each object. And the use case was that some systems are too tiny to handle the overhead of the better allocator. So for them, we do the simple thing that groups everything together, even though, yeah, you pay the worst scalability, but it probably doesn't matter on some small device, which has just one CPU anyway. And finally, in 2007, we got the third implementation that's still there from Christof Lammeter, which is doing something very similar to the SLEP allocator, but some different performance characteristics that I will explain briefly and also develop the best kind of debugging features that I mentioned. And that was, I think, 2007, July. And in October 2007, it was already made the default, the new SLEP allocator with the U. And according to the commit description, the reasoning was there are some reports that it is already a default, but that's not true. So let's make it to be a default. So I'm not sure that was the best way to do such such a change, but that's how it happened. And we try to be much more careful today and do it after proper evaluation. For example, Susa Kernel switched only several years ago after we did some performance evaluation and decided that it's feasible. Because in the beginning, the SLEP SLEP allocator wasn't a clear win in each situation, even though it was its goal, but workloads differ. And so it's not like easy comparison. And because there was no clear winner, there were also some efforts in the past to create something that would be so universal, that it would be better than each of them to separately. But that kind of ended when Linus said he doesn't want YASA, which means yet another SLEP allocator. And my work today is just continuation of that because I'm trying to get us back to the single one. But some features that were unique to one of them were also ported to the other one, which creates just more code churn. And that's why one of the reasons why I don't consider that ideal. So the summary is that today, or not actually today, but until last week we had three allocators. So the SLAP, the first one that implements the separate caches, SLOP, which was intended for the smallest devices and made some sacrifices for performance, wouldn't be such a burden to have, but it prevents some nice API changes, I will describe soon. And then we have the SLUB, which is the default, is somehow the best, and I hope that's the one we stick to. So just to make you a bit more informed, what are the main differences between the SLAP and SLOP, even though they basically do the same thing, it's about how they handle caching of objects that are not yet used. So for both, the main unit they work with is a SLAP page, which is taken from a page allocator. It could be four kilobytes or larger, and then it's just divided to the equally sized chunk that are the allocated objects. And one difference is that SLUB uses the three objects content as a place to put linked list that keeps track of where the three objects are on the SLAP page, whereas SLAP with the A1 has a separate array that basically does the similar thing, but this is one difference, but I think it's not the most important one. Then when you have these basic units, the SLUB page that you divide to the object and give them out to the colors, you need to manage the SLUB pages, which is again done similarly between these two allocators. So you consider each no-man out separately, because usually you want to allocate a memory that's on the local node close to your CPU, so it makes sense to split it. So basically there are lists of these SLUB pages to make it more manageable. They are kept separately tracked, the SLUB pages that are completely full, because then you don't want to check them for three objects anymore when you allocate. Most of them will be partially full, so they will be on the partial list, and if they are fully free you may cache some of them on an extra free list until you realize you have too many and you should actually free the pages back to the page allocator. And there's always a spin log that protects these lists. So there's a basic scheme which wouldn't scale without extra stuff on modern multicore CPUs and computers, because there's this spin log that would have to be taken for each allocation and free to protect this list and the objects in there. So SLUB with the A, solve this by caching the objects on a per CPU cache. So if you do an allocation and the the array cache is empty, it will allocate multiple objects in a batch and then for the allocation can very easily just take them from the per CPU array. If you're freeing, you free just by putting the pointer to the array until it becomes full and then you in a batch again free it to the actual SLUB pages, which means you amortize the cost of the locking. And there are also some other types of arrays that are needed to work well with NUMA, but it's not so important details. This is the main idea, the per CPU arrays. So when SLUB, the U1 was created, one of the stated reason was that these arrays are occupying too much memory and so it has a large memory overhead. So Christof Lammeter instead did something else. He put a SLUB page to the exclusive use of each CPU and only if you cannot use this, only if it becomes full, you go to the shared list and take a new one and then you don't need to cache anything outside of the SLUB pages there. All the objects are always tracked in the SLUB pages. And also to avoid some some expensive locking, the way it takes the objects from this per CPU cache is by a CMP exchange double instruction using this extra transaction ID value in the K-Memcache CPU structure because one problem with this is that if you are allocating then you have this per CPU SLUB which might have some free objects and you can just take it from the free list. But if a CPU is freeing an object it's not guaranteed that it will actually belong to the same SLUB page that's private to the CPU. So the freeing is not cached as well as well as if it was a simple array. So you have to use these atomic CMP exchange double things, instructions to free an object that belongs to another CPU so you don't corrupt its free list and that's why the allocation is very fast, that's local. The freeing is very fast if it's on the same CPU even though there's some cost to the atomic operation but if you are freeing to another CPU or to a SLUB that's on the list it suddenly can become more expensive because the cache line of the of the K-Memcache CPU will probably sit on another CPU so the coherency protocol is to do some work or you might end up freeing to the list and it's really common that the allocations and freeings happen on different CPUs so that's why the SLUB allocator is not always a clear win. Yeah I will repeat it. Yeah so my question is you said the slab pages are dedicated but you're also saying that different CPUs can allocate and free from the same slab page right from the same slab sorry. They are dedicated for allocations but other CPUs can free into that. Yeah so once those other CPUs free from it now the free list of that CPU is pointing to the same slab right? No no that's that's not done this way because that would be even more expensive if you had to take over the whole free list. Okay so yeah I guess I'm confused then like how I thought the free list so the free list pointer is per CPU right and it's one pointer that points to like one freed object on the slab to the list and so you can have a slab with multiple free lists that belong to different CPUs right? Yeah actually the one that is allocating kind of privatizes the original free list of the SLUB page and so there's no more free objects there yeah and when other CPUs start freeing into that slab they construct a new free list that's that's part of the SLUB page and strictly speaking it's possible the slab is shared but it's not really dedicated to one CPU it can be shared when they're free together yeah they are dedicated for the allocation and only until you exhaust the original free list and then you have to do something else it's quite complicated I didn't want to explain every detail because I wouldn't be able to talk about anything else but just to get the idea one one is one allocator uses the per CPU caches and the other one some rather complex things that avoids any extra arrays but as its own downsides as well yeah because as I said when SLUB was introduced it was the argument was but without these arrays the the the over the memory overhead would be smaller but it turns out that these days there are many CPUs or CPU cores and so then they that means they all have slab pages private to them and suddenly SLUB is the one that that occupies more memory and as I learned that's one of the issues for some of the people that were sticking to the SLAB until now so so that was the main differences of course there are many others and now why why is it why don't I don't want to have multiple implementations yeah because it's many extra lines of code that have to be maintained uh because the allocators are quite similar we actually have some common layer that that that unifies some of the common allocation but it means it sits in another c file than the actual implementation so there there's an extra call from one c file to another that could have been inlined if it was if there was not this common layer or or maybe if you have link time optimization this is not a problem because the other option would be to duplicate the code in both of them and that would that's not great as well and because we have these three implementations there are features that somehow involve the SLUB allocators and the implementers then have to decide oh do I reimplement it for each each of the implementation or do I not care so for example mcg kmem support was implemented in both the main allocators uh kassan and kfans also and it means just that was extra work but some features like the preempt RT chose just the SLUB allocator and that probably was much simpler and the problem with slop is that even if we say okay this is the special use case allocator it's very small we don't need to reimplement all these new features for that even even with that it it blocks a useful API change that was requested at some point by XFS developers for example that would allow objects that are allocated from the kmem uh for us from a specific kmem cache to free with the common k3 function that's most that's mainly intended for kmaloc but the SLUB and SLUB allocators don't mind if it's used also for the for the objects for the from the special caches yeah and the problem why this was the problem with slop is that again it's the simplest allocator that packs all the objects together in one page because the ultimate goal is to have as as little memory overhead as possible and it tracks free objects in some kind of lists but they have multiple sizes and when a block is when a block is allocated via some some specific cache which are emulated on top of SLUB they don't really use separate pages then when we free the the object through the kmem cache free function we get the pointer to the cache and then we know how large this object is and we can free it but for k3 kmaloc object we have to again prepend the size to it so we don't repeat the same mistake of needing the size parameter for freeing and suddenly if we allowed k3 to be used on the kmem cache alloc objects with SLUB it would mean that the header for the size is not there for these objects so it would probably corrupt the the allocator yeah last time i'm wondering like why couldn't they store the size in the free payload itself like uh because i know that there's implementations that can do that because then you need to only point to the first free object and then the payload already has the you know you could store the the size somewhere in the payload and then go to the next you know that way you didn't have the header but you deleted this already right yeah so of course there are some ways to reimplement the the SLUB so so it wouldn't have this problem if we did it the very uh very straightforward way like okay so we prepend the header to all the objects uh then the problem would be that it would make for an extra memory overhead which is even amplified by the facts that we have to guarantee some alignment for because potentially anything you allocate from kmaloc is can be used in a DMA and then you have to maintain the DMA granularity which can be on some architectures like 128 bytes and that just it was attempted but yeah the result was that everything was suddenly larger and if the goal of the allocator is to be as small as possible that that's not that's not really great so yeah as Joel said there would be ways to reimplement it to smarter but but at that point it was just why don't we delete SLUB and it would be simple does anyone care still about the tiny systems so when this was proposed like for serious last November after the plumber's conference I I did some research and I thought yeah it seems like nobody uses SLUB anymore because we don't have these devices with just few megabytes I checked what open VRT WRT does for the routers which used to be one of the small devices and it turned out they switched to SLUB already and their target devices is 128 megabytes which is just fine with SLUB what I didn't realize is that they might be the DEF conflicts in the three that say oh this kind of architectures device should built with SLUB but somebody pointed it out and yeah there were some of them and most said oh this is fine we will just switch to SLUB except one board with eight megabyte memory where the guy actually tried it and said oh this was booting before with SLUB but now I'm running out of memory so I was thinking how to solve this and it seemed the easiest way would be to try to remove some some of the nice things from SLUB the use SLUB like all the caching and debugging which means I know it will not scale anymore but it doesn't matter for such tiny device so I created a config SLUB tiny option which just modifies how SLUB is compiled it's not another implementation it's just like a feature of the SLUB allocator and this was enough to solve this regression so in 6.2 the SLUB tiny was introduced and the SLUB was deprecated and because there were no issues reported since then I went ahead and removed the SLUB in the current RC so the 6.4 RC one is already with the SLUB removed I even installed it on this laptop so it's already presenting without SLUB and what it means that we can now do this K3 freeing and K3 RC freeing on the KMM cache alloc objects which it wasn't possible before if you were compiled with the SLUB so that's nice but I would like to go one step further and remove also the SLUB allocator because as I explained this it's not the different from SLUB it has less features so will anybody care about that one so this was also attempted in the past few times and there was always one of the complainers was David Rientes from Google who said yeah we have this degradation of net perf when we try to switch from SLUB to SLUB at some point I also objected because we were still using it at SUSE but it was not like a hard objection we did evaluate that we can switch and even in 2021 David still had the same reply but things changed and SLUB gained some bulk allocation and freeing API or actually both of the allocators did again extra work and this was enough to make the workload intensive workloads the network intensive workloads happy enough so they they are fine with using SLUB and so David from Google had somebody to do some measurements which he posted closely before the conference and it was just mostly noise you couldn't say that there's a there's a strong difference but what came out was that there's I think more 30 more memory overhead for SLUB and even though it's not so much in absolute numbers because usually your memory is not so much used by the kernel objects it's a concern that we should look into but he said it's not going to block the removal right now so so we had this session at LSFMM and now nobody objected there so I'm going to propose it on the mailing list and see if anyone else still has objections that wasn't present in the room and yeah we are separately from that we are looking going to look in what we can do with the memory overhead and if if somebody finds a regression that looks valid we can look into how to change SLUB to accommodate it because that's better than having to separate implementations so and my hope is also that once there's a single allocator we can try think of more improvements to the API because we no longer have to care about three implementations so it turns out for example that there might be a use case to reintroduce again some object caches that SLUB has but maybe not for all caches but only for users that need it and that could be either for the performance reasons or there are users that would like to allocate in NMI context which is something that's not possible today with SLUB so for example what BPF guys did they implemented that their own allocator and it would be great if the MM part could provide everything so people don't have to re-implement their own another use case that came up would be pre-allocations for the maple tree which sometimes the operations need to pre-allocate some nodes in case the tree gets more complex during the operation and we can do that maybe with a smarter way than just taking objects from the allocator and giving it back if they were somehow still part of the allocator that would be more effective I thought there was already a way to do that with KMM cache like to pre-allocate and then it gets it from like it was called KMM pool or something like something with the word pool in it so you mean mem pool yeah they exist but but I don't think they work when there can be multiple users they can give you guarantee that you don't run out of your memory until you or the objects until you finish your thing but I'm not sure they work well with multiple parallel such allocations because there's no thing like banker algorithm that would say oh I cannot give this one any more objects because then I could run out of them so and from this part the outcome from the LSFMM was yeah we should investigate how to do this and actually Joel also brought another use case for this kind of caching because k3 rcu that might sleep that doesn't work with embedded rcu had has to create lists of objects that will that will can only be freed after the rcu grace period passes and they currently implemented separately by some allocating pages that is an arrays of pointer and again if the slab allocator could do that itself it would be perhaps more effective so that's all from me thank you and if there are any more questions is there extra mic is config slab tiny something you would consider removing later on or is that you think that's a permanent fixture it's something we can consider but I don't consider it urgent because it doesn't really add any more maintenance overhead it just put some pieces of code behind an if death and the pieces of code are still there anyway because they are also used in the debugging caches where you don't cannot use the the per cpu caching I heard a lot about a lot about performance improvement we have another problem with celebs for system stability depends so when we run our production sometime we see slab size it's growing pretty fast it's close to the secret limit then we were very nervous we don't know it will keep growing or it will slow down it's a bug or is it's a design and my question is how the slab how much we can trust the slab can slow that can manage the memory correctly and the second is there any accounting tracing mechanism we can we know who is allocating the memory and then we can do something like terminate the the big consumer something like that to make its average system stable okay so the the case where just caches grow it's usually if fault of the user of the caches and not not not the problem of the allocator one way you can verify it is if you look at the slap info and you check if if it's the large difference between used objects and allocated objects and if it's not large then the memory usage comes from the user and then for the cache you can enable one of the debug config option that will that will track the calling stacks of the allocators and freeing you can do that only on boot time not later because then so but if you have a suspicion of such a workload you can reboot the machine with this and then observe it the the overhead is not like critical and then you can fix that code but but but killing the process and finding it that that's probably not possible so i will that the reason why we remove slab then slab is because the development means development has been focused on slope and if the memory footprint memory higher memory uses uses the caches and why not just turn slope into something like slab and remove slab yeah that that's one possibility if the overhead is really due to the design we can adjust the design to to be more like slap at least for caches that where we know that the problem is happening but but i don't think we should like switch the design completely to be like slap or or maybe if somebody tries that and evaluates that yeah it's a clear winner no other workload would regress then sure if it's okay due to time so a question excuse me related to the config slab tiny option it was put into replace config slope which was meant for normal size systems right so it's just to fix a regression there is but the config slab tiny was really there to fix the regression for a non-MMU device for very low memory sorry yeah okay um is are there plans to move the conflict slab tiny and other optional pieces like maybe no caching at all for more deterministic systems like there might be a kernel wide flag for embedded or put it under no MMU so that never shows up as a config option for like a server compile yeah it's possible but then maybe somebody would complain so this is the most generic way it's an extra config option linus hates them but the default is n so and so so if it was default yes that would be a hard no one okay we still have time yes with tiny config can we still use all that sanitizers like came in league kernel kasan no that's not compatible with those i think so if we enable tiny we cannot debug any memory issues like most of memory issues what i think this this kasan itself is such a high memory overhead that you probably wouldn't be able to run it on such system that needed a tiny slope you said that in to say you switched from slab to slope and didn't see any significant regressions and what was the set of tests that was ran on the mizerant yes so that was done by bell gorman using his mm tests but i don't remember the exact configs but i guess it was some representative set of stuff we know that the customers are usually running so i would have to check with him so i was gonna i was thinking like this might not work for many reasons but the i was thinking the problem with the per cpu slab thing right so you mentioned that you can free objects and no other slabs and that your free list now points to those slabs so in theory if you had like a long enough free list pointing to other slabs you really wouldn't need the per cpu one strictly speaking because you'll start at you have you have a free list already right you can start allocating from that so you know there there's some ways maybe we can this is something like what you did with the tiny right the tiny slab but you're using the the common slabs and not having the per cpu one so we can maybe have something dynamic where if we don't need the per cpu one anymore maybe we can release those free objects and return it to the system you know yeah there are definitely many things that can be tried and i guess it would be best to discuss them offline because now we cannot imagine all the how the result would look like but thanks for the suggestions so am i blocking them am i blocking somebody from trying the their laptop what's the next session start i have one question so cxl drem is going to be a different be a system lab with a different performance characteristics i'd like to know your opinion about slab or locator need to know like from near or far memory or to avoid the probability issues so i think i don't think slab will become concerned with with cxl memory because it depends how how it ends up represented in the in the kernel if it becomes just another memory node which has a zone normal then then yeah the kernel objects could be allocated there and slap would just treat it as normal memory but if it's a zone device only then kernel allocations wouldn't even touch such device so slap would be out of the question even on such memory but until i can until that thing is resolved it would be too early to think about what slap should do differently i guess we will would have to try and see if there are any issues with the current implementation it's okay okay there are normal questions thank you all and and thanks for coming