 Mi je Vastimel, zelo sem hobbyist, všeč imam hobbyist zelo slapelokatore. Zelo se vsečen, zelo seziv naredil slop, ali je bil prezvodnjena. V 6.4 rc1, ko sem ustaljala, vseč na slapelku, jasno je slop 3. If anyone needs to use the kernel on some small system, they have the alternative in the slab tiny configuration option. And why is this relevant is that it means that now we can call K3 objects that were allocated on KMM cache alok, as well as KMLok, which is something that XFS people were interested in at some point. And it also means that K3 RCU and KV3 RCU should work as well. And there should be no downside except after Joel session it means that if we implement something for slab to handling use case, it means it will have to be supported by all caches and not just KMLok caches, because somebody might call it K3 RCU on some non-KMLok objects, so we'll just have to keep that in mind. And another maybe surprise might be that somebody might try to use trace points, which are separate for K3 and KMM cache free. And if they don't enable both, they might miss some free that went through the other API. So if you are tracing slab allocations, you should keep that in mind. But hopefully that's not critical user space breaking. So with a slop out, that's fine, but that was the allocator that I'm able to pronounce well. And there's still slab and slab, and I always don't know how to pronounce them properly, so the next step is that I would like to remove slab as well. And besides that reason, the other reason is that, of course, that's another thousand lines of code to maintain. And sometimes we find out that some part feature of that, like the debugging, has been broken for years, because nobody uses that, so we can have unknown breakage. But another thing is that for the two allocators, we maintain the common layer, which is another bunch of lines. And with these common layers, it's always trade-off between do I duplicate code that can then be nicely inlined, or do I have this common layer in another C file, which has to go into the implementation-specific file, and those calls cannot be inlined without link-time optimization. Then there are features that either had to be re-implemented for both allocators or were implemented for just one of them. For example, the memcg support had to be duplikated work because it was done in both. For preempt RT, just slab is supported. And the last thing is that, once we have a single allocator, it's much easier to think and implement of more API improvements, like the one that Joel was talking about, and I will do some more later. And it's also not common in the kernel to have multiple implementations of the same thing. It made some sense for Slop, which had a very specific use case and was sufficiently small, but it was still blocking some API progress. And another example is, I can't think of, just the ZS malok backing storages, which we have also multiple ones, but these are pretty small self-contained things, and while K malok is something everybody uses. So this is not the first time that Slop was proposed to be removed. I found at least three cases when it was discussed at the Linux MMA link list. And usually the main reason was that there are workloads that would regress when switching from Slop to Slop, and so it was always rejected. So I guess the question is, are there still objections today? And since Google guys were one of those that always objected, I guess it's a question for them. Yeah, thank you. For Google's perspective, I think with the introduction of Perseapu partial slabs that Slub has come a long way to, especially addressing the point from 2012 there, for the 10% performance degradation, at least for TCPRR. And what my colleague Binder did in the last bullet point is he has recently compared these results and posted them. I think performance is, it can go in either direction if Slab or Slub is better for microbenchmarks and some well-known open source workloads. The big thing that he is calling out is the overhead in metadata or just for because Slub uses much larger page sizes and it has the Perseapu partial slabs as to get a lot of that performance back from TCPRR that he finds that on some workloads that we have to 300% increase in the amount of memory that is set aside using Slub. And so I don't think there's an objection to removing Slab for the cleanup. I think a lot of folks, because Slub has become the default, perhaps are unaware of the much larger memory footprint of Slub and that now that we have our attention focus on it that we can make some incremental progress in that direction as well. But I wouldn't have an objection from Google perspective from Slab, but happy for others to talk about it as well if anybody else has a thought. Just for the background, we were using Slab for a long time in SUSE kernels and when we were evaluating to essentially move to Slub, Mel has done quite a lot of benchmarking and found out exactly the same thing. It can be some plus for one, some minus for another, but essentially on par. And so back then when we were moving to Slub, our main reason to do that was debugability because Slab just had to, you had to recompile, hope for reproduction with Slub. Everything is just, you can enable that in runtime and that's a huge advantage. So that was the main decision point for us to move away from that. And so those two main reasons from 2019 are gone. So unless there is somebody very married to Slab, then I guess we should just go ahead and just remove it and fix those problems that David has mentioned incrementally. When I was at Intel, I was working with a team that did TPC benchmarking for an unnamed commercial database and Slab was at the time better performing. That same unnamed commercial company that did TPC benchmarking and enabled Slub. So clearly the rationale for using Slab for TPC benchmarking must have gone away for that to have happened. I wasn't at that company when it made the choice. I don't know what the decision making process was, but clearly it was made. But does somebody online? Yeah, this is Greg, also from Google. I just had a question. I agree with what everybody is saying but it's sort of being washed depending on which direction you look at it. But in terms of the memory overhead, are these structural things or are these the kind of known matters that we can chip at? Because it feels like everybody in the room is getting more and more sensitive to cost in memory spend. So maybe the memory overhead is even more critical than the performance parity. Yeah, I was actually surprised that the memory overhead was brought up as the reason this time. Like main objection, I think the results showed what like 30% more of binder, or something like that. But the question is, it's 30%, but is it so large in absolute numbers as well? I mean, yeah, the dementia mark in binder did a complete workup and were thankful for it. On some workloads, yes. Just in boot overhead, I think that he was seeing on a regular off the shelf. I forget whether it was Skylake or Cascade Lake, he was seeing that just after boot Slub was taking 110 megabytes more than Slab. But depending on the benchmark, it depends on how many, it scales with cores, of course, per CPU partial slabs. And that is how we address the TCP round robin problem to get that performance parity back. But at the same time, with very large core counts, 160, 192, depends on how many slab caches you have. And that's why we're excited about the third bullet point here, which is the per object Kmem accounting that we now have, where we can share those Kmem caches together. And as a result, I think that we are willing to fine tune Slub to use not only smaller order pages for its slab caches, but also being able to shrink down the number of per CPU partial slabs depending on the core count that it can be minimized but not achieve the same level of slab. I agree that it's a structural problem because the way that, as you said, Slub achieves performance by having many per CPU caches. And it's actually interesting because I think when Christopher Lameter introduced Slub, one of the points was that slab caches too much objects in the array caches. But as time went on and the number of cores increased, then we have more slabs because they are per core and also the default algorithm that selects the default order also takes the number of CPUs into account so that even makes the problem worse. So if we had to deal with this, we would have to achieve the performance by something else than having multiple slabs and per CPU and then probably means going back to something like the array caches. I just wanted to mention as far as I understand the array cache takes up space inside the slab, inside the slab. It can, but the problem was not the array object itself but how many allocated objected caches. Because the array is missing in the slab allocator, wouldn't you have, would you be able to store more objects per slab because now you have the space that goes for the array now can be used for objects. But yeah, I need to look into the per CPU partial slab thing. Yeah. Yeah, the per CPU partial slabs, the documentation in the kernel tree is actually wrong, it's not just a read-only value, you can write to it. His analysis, as you can shorten that, it's up to the control of root. One thing from at least Google Cloud's perspective of why we really, really want to move to slub is because it solves our CPU jitter problem. I say we, I'm talking on behalf of the Google Cloud customer here, that if we send over a set of reserved cores, then because you do the cache reaping every two seconds on every one of these cores is a design of slab. You see these little hicups every two seconds when you're trying to do these reserved cores and they actually don't get exactly what they're paying for, and so slub doesn't have that downside and it's one of the reasons that we at Google, at least in Google Cloud definitely want to move toward to using slub, even if it has increase in memory footprint. Okay, so I guess I have one question. So I just want to say that early it was mentioned that slub uses more metadata, but it sounds like it's not really the metadata that causes that memory overhead difference, but the caching. Yeah, it is more slab pages that are tied to CPUs. So you have free space in those pages, even though nobody allocated them yet. Next question is where the CPU caching makes sense as we increase number of cores. So when we have like 16-32 cores maybe it makes sense, but when once we go to hundreds maybe there shouldn't be just a linear increase. There should be some smaller kind of increase in caches. In some CPU caching has to be there otherwise it wouldn't scale good, but right now it caches the slab pages which contains both used and unused object, and if it was just an array of unused objects then it could be more effective, hopefully. I understand for the concurrency reasons why we do the CPU caching, I'm just saying that sounds to me that if we just go proportionally to the number of CPUs, we start caching too much memory than actually more memory than is actually needed. Maybe I'm wrong. So you mean the heuristic that... So I'm not saying to have smaller caches principally, I'm saying there should be some kind of tree or something from each of the resources that they can again. The racism and scalability should be resolved but having a pre-CPU is not the only solution. I'll just say that I think that Slub's original design is based on returning cache-hot objects for every CPU. Yeah, I didn't understand your point because the objects that are cached per CPU actually move between CPUs. So for now the CPU needs it, it's not going to sit in the same cache. The mechanism is designed to transfer objects and then transfer them back. So there's that sort of stuff going on. I don't know, under memory pressure what happens, maybe the per-CPU caches are shrunk or something, I don't know. But I know that they do travel as needed. So they travel but obviously there is a high memory overhead with Slub and the high CPU account systems. Yeah, but it would be fair to argue that with more CPUs you've got more memory so you care less. It doesn't scale so rapidly that it would be a bigger problem or at least that's my perception. So I'm not really sure whether that's really a big problem. Is it for anybody? It is a big problem for Google. Yeah, but the question is how it can be solved by keeping the same scalability. Yeah, maybe that's one idea that you proposed and it would have to be tried. So I guess nobody here seems to be objecting to the removal, so I guess the next step would be to again propose it on the mailing list and see if there are other people objecting and ideally they would come with a use case or some benchmark that's regressing and it's really regressing because of the different design of the allocators and not just some random effect of different object layout affecting caches, which is always problem in these kinds of benchmarks. And hopefully the motivation would be some actual workload and the benchmark just hopefully that produces that thing but we shouldn't hold back just for the sake of some micro benchmark. And if something like that objection and concrete benchmark is found we can discuss how to change slab to accommodate that. There have been also some efforts in the past to merge the best of both allocators which were abandoned I think not because some fundamental issue was found but what I was reading in the archives it always just died because there was no sustained effort and probably things changed since then so the ideas should be adapted anyway. So if we can assume that slab can be removed we can consider just how to improve further the remaining allocator which means it doesn't have to be implemented more than once and we already discussed that some different kind of object caching should make sense and I think there are many use cases for that for example the better performance or hopefully less memory overhead we can also consider whether we want to support allocation in things like NMI context using these per CPU caches but without no guarantee of success because if the cache is empty you cannot do much else in such context and that would hopefully allow us to remove the BPF allocator which was created just for this use case and it would be nice if this was provided by slab itself then maybe some use cases would need guarantees like the maple tree node preallocations so hopefully this kind of caching could accommodate that use case as well the question is if there would be cache for each cache K-man cache and CPU or for each user that would say I want this cache of this size we can discuss that and the point is that lots of many people reinvent the wheel in the kernel to accommodate for things that the memory management doesn't support itself and it would be glad if we could do something about that because for example these non-MM implementations often don't either think like they should have shrinkers to prevent premature OM that I think is for example problem of the BPF allocator and if we did the implementation we would do everything properly because we are familiar more with the MM so if there are any other ideas I'm open to the problem we have and arguably this should be fixed in the call and not in the slow allocator is decache poisoning that in a system that has a huge amount of memory that is under no memory pressure the decache will essentially grow without bound because you can create as many negative de-entries as you want just by looking at files that don't exist and we rely on there being you should run the shrinker and of course you actually do run the shrinker and it turns out the 11 billion objects and the shrinker that now takes forever to run and something in there is OM squared and it's just an absolute nightmare and Oracle aren't the only people hitting this we don't hit this on all systems but we have some very specific systems that do most of our systems are under plenty of memory pressure, thank you but some of them not and we're not the only people who come across this if you look for decache poisoning or something like that you'll find a number of problems over the years so I don't know if this needs to be fixed in slab arguments decache as well but something to consider I guess it's specific somehow to the use case but I guess the allocator could provide some building blocks on top of which the solution could be built and if there is just single allocator it's simpler to do that another questions or ideas I'm overtime anyway, right? OK, so thank you and wish me luck