 I just want to see that we can integrate his work somehow into the SAP allocator subsystem that we can offer services to all subsystems of the camera, not just do that for the network stack. I think it needs to be generalized. And that's one of those things that I want to talk about. There's also some other stuff that's unrelated. So I've called this a different way than on the program. I wanted to talk generally about SAP allocator developments and issues that we're having right now. So that is the stuff that Jesper had. I'm the maintainer of the SAP allocators in the kernel, but I also work for a trading company. And so we make money by being faster than the other companies are. And so the work that Jesper is doing is dear to our hearts, and we will do everything we can to support him, even if it's faster than my allocator. I don't mind. As long as the network stack gets faster, I'm fine. So in order to support that, I've proposed a batch interface for the SAP allocators. So I want to discuss that a bit and talk a bit how we can integrate that into the network layer. And the batch interface then requires also the implementation in the major allocators, SLUB and SLAAB. And I have some ideas on how this could be done and actually have a partial implementation for one of them. Then moving beyond that, recently there have been some fast path improvements in the SLUB allocator. I want to talk about that one as well. And then maybe talk a bit about SLUB defragmentation if you have time. Again? Yeah, but it doesn't get too... How long have we been told you get that? Well, since 2007 or something. Yeah, right, so let's see. We should have the same problem. You've sorted it at some layer and then I have my stuff and I'm doing it at some layer. At some point this needs all to work, right? I can't hear you. I can heckle now. Okay, good. Sorry, can I stop you for a second? That's what it takes you like. Okay. Sorry, go for it. So is that better? Yes? All right. Okay. So this is short of it. This is a batch API that I've proposed to extend the SLUB allocators. So instead of a K-MEM cache free, you have a K-MEM cache free array. It's actually the simplest operation. You just pass it the name of the cache and the number of elements you want to free in an array. And then it goes about and just frees all these things. Pretty simple. The allocation is also pretty straightforward, I think. You have the cache where you want to allocate the stuff, the allocation flags, how many you want, the point of the array of objects, where the addresses should be stored. And then we just added a flag today because yesterday we talked about some of the semantics that we wanted from this operation and we added some additional semantics that we're going to discuss in the next slide. So the patch that I proposed contains a four-back implementation for all these things. So if we merge the stuff, we would have the functionality, but the allocators are not optimized yet. Then we get into the fine details of how do we actually make use of these things and how do we squeeze the maximum performance for batch allocations out of the allocators. So the bulk alloc modes that we wanted to add here. So we added here a flag argument at the very end, the KM cache alloc array. And in order to understand this, you have to see how these allocators work. They usually reserve objects that are for each processor. So if a processor needs a new slab object, it can take it from the local queue and doesn't need to do any locking. So usually all allocators have a number of these objects cached for services that are needed. And so if that CPU queue is empty, then the allocator needs to go to a memory pool that is node-specific. Remember, we are in a Numa system. And all allocators have lists of slab pages where some objects are still free. So it can search through the slab pages that have free objects and can pick up some of these and allocate from those. If that fails, the allocator usually goes to the page allocator, allocates a new page frame and hacks that page frame to pieces and dishes out new objects. So each of these has various performance implications and that's why we want to control these in the batch alloc function call. So if you say slab array alloc local, then the idea is, okay, we want to just extract all the CPU local objects because we want maximum performance. We don't want to take any locks. So the batch alloc will give you everything that's available there and if there's not enough available, it won't give you any more objects. So that could be probably useful a fast path where you need to make some objects in the fastest way possible. Then slab array alloc partial, this is basically you're going to the list of slab pages that have still objects free on a node and you take it from there. This means you leave the cache for objects alone. There may be other allocations going on from other slab systems on the CPU that you don't want to impact. If you do the first one, it would drain all the perceived objects and the next alloc will have to go through the page allocator with some other mechanism to get new slab pages and new objects because they are all gone. If you don't want to impact that, you can do slab alloc partial. You just drain the partial pages for each node. And so this is good because it preserves local objects and also it's diffract friendly because all the holes that you have in slab pages are being filled up. So nice synthetic effect here and side effect. Then the last one is the slab array alloc new. You want to bypass this whole thing. You want to take new page frames from the page allocator and just serve it that way. The advantage there with the slab alloc new is that the allocator does not need to construct a free list like it does right now. But now if slab or slab contacts the page allocator, gets the page frame, it needs to create a data structure to track the allocated free objects. That's a lot of overhead. If you do this and go directly to the page allocator, you can skip the step and immediately construct an array of pointers to the objects in the page frame and serve that to the application. If you need that data, it doesn't need to be initialized of the slab page. At least the initialization effort can be reduced. This is actually the fastest allocation mode if you want to do a bulk alloc. Let's say you need 5,000 objects. This is certainly the fastest way to do that. What would you do with, for example, you've got a page you haven't used it completely? What would you do with the tail part of the page? Would you then go and put those objects onto the free list somewhere? What we're seeing right now is that the slab array alloc local would just stop with the local objects if there's not money more. And the slab array alloc new would just take four pages. If there's a partial page that would be allocated and wouldn't do that, it would just switch on the objects that fit in a complete page. Okay, so if you had like 17 objects per page and you asked for 64, you wouldn't get 64 back, you'd get 17 times 4. 17 times 3. Yeah. Okay. Well, we haven't cleared the exact semantics. We just, again, last night, I'm just thinking about, because I can see use for that in various file system paths, but we'd need a guarantee of the exact number of objects that get returned back. Okay. Maybe we need an add-in and an option to just give me the exact number of objects. And then it could fall back from one more to the other if there's only a few left, and they take the partials instead of going for the full. Yeah, that would work. Yeah. And do you see any use case for the first one? I haven't had a difficult time thinking about that. Would there actually be a fast path case for that? Does it seem to have objects in a fast path? I suspect that the first one, from my perspective, I can see that you'd want at least one hot object, which is going to be the first one that you reference and actually use, but the rest of them not necessarily. Okay. What about a use case where you have a tree and you want to expand? So you really would like to allocate two objects because you want to expand the node to have two. So you would get, you could ask for two elements, only two elements, because it would make me expand my tree. Well, this is a bug out of the mode. I think if you ask for two objects, it might be better to use two Kimalaks. I don't think it would be amortized at that level. Yeah, I'm not sure. I think we're going to look at ten or more before this makes any sense. Yeah, my use case is a bit larger, right? Okay. But like 64. Yeah, 64, yeah. That's what we're using. The numbers before. Yeah. But I wouldn't need a guarantee. I would be, my system would be happy just to get something fast and not get the exact number I'm asking for. Okay. So then the first idea is on how to do the bug out of the mode in an SLUB. So this is a draft of this, like I was discussed. The problem with SLUB is it has this free list where one object, free object, links to the next free object. And this means if you traverse the free list, you're touching the first cache line of each object. So that's a bit of an issue. So in every case that you deal with a partial list and the specific view free list, you have to traverse the array and if you touch every cache line. If you take the page directly from the page allocator, then you don't need to initialize this stuff and you don't need to touch all the objects. So this would be inherently the fastest mode that would be available. So one performance optimization that you can do if you do this partial list stuff is right now you take a single page of the free list when you came in cache alloc and then it drops the log again. Now you can take the log and you can take as many pages if you want of the partial list and process it. So the log taking and log release for the pair node log is no longer there. But the traversal of the linked lists is there. Yes, you take a cache list. Unless you go directly to the page allocator and get a new page frame and avoid all this mess. I have some draft code that I haven't even run yet but I think this is possible to do. And I also need to do some invocations that we discussed last night. And if you look at the structures here, this is kind of an attempt to draw all the metadata structures for SLUB in one slide. It could be a bit complicated. So if you look at the object format, you see that in one case when it's green it's in use and you have a payload there and if the object is free you have the free pointer pointing to the next object. So this is a single object format and if you look at the page frame content here you see how the free lists work. From the page frame descriptor the free list pointer points to the first free object which points to the next one which points to the next one. At some point you get another pointer and then the free list is at the end. If you want to extract the objects you have to walk the free list here walk all the free pointers and record all the pointers to the objects. And so if the free list is one where you have the free list from the page frame if you want to access that free list you need to take a log. There's the other free list that is from the passive use structure but you don't have to take a log for the fast path. It's the same thing nested here and there can actually be multiple free lists and two free lists in a slab page of objects. Because there could be local allocations and local frees and concurrent remote frees and remote allocs and the allocator can handle that at the same time in order to give you optimum performance. And so if we want to do a local object alloc we would traverse this portion here here to extract all the passive view objects if we do all of them with a partial list we walk the partial list up there and go to the page descriptors take the logs and extract the objects that way. So these are the first two operation modes that we had before in the flags. Any questions on that? So in slab this is easier because slab has a table of free objects at the beginning of the slab page. It doesn't have the nested list from object to object. And so the free list can be traversed in a cache friendly way now. And there are already errors in the first pointer page to be paired in the alien queues. So you could just copy the pointer arrays over into the bigger array and just zap them. So that may be actually more friendly to the approach here. And I hope I have the same thing for slab here. I think I've omitted something because it gets complicated. The details on the free list are not here. So the object format is different here in slab. You have the payload and if the object is free then the object is not being touched. Because at the beginning of the page frame you have a free list. There's a table of all the free objects. That usually fits mostly into a one cache line and you optimize that so that we have one byte per object that's in the page frame. So it's very compact and can be traversed easily and we should be able to in a rapid way construct a table of pointers but the basic principle is the same. If you go for the local allocation you traverse the array cache that's processor specific and first of all I extract these entries and otherwise you have the alien caches and the per node caches where you can also extract these objects. Any questions on this? Or is this just too much detail? Then, any more questions on the whole business of the bug? Are you satisfied? It's nothing new. I really like it because I wasn't into the memory area. I actually had studied these two slides in detail to figure out how it works. So we have no disagreement on how to proceed on this one. Of course you would like the API to go in and I would like to test different implications to figure out if this lasts a long time. Yeah, I think we first need to get the API so I need to have your center on that one. Hang on, Paul. The last phrase and referenced by the array cache for some reason or it just needs a pointer there. The array should be one down. The array should be pointing to the array cache there and the array cache is basically a list of pointers. Yeah, I've got that. The green free on the far right between object and padding should that have a pointer to it or is that something else going on there? Should that have a pointer from an entry of something to it or is that just? No. So the array cache can have pointers to multiple slap pages. So there might be another array cache that references that one. It has localization. It only covers the objects of one slap page and therefore it ensures that all objects allocated are local and the TLB misses don't occur out frequently. Slap disperses the stuff and it doesn't possess locality in that sense. Any other questions on this or comments? Then I guess we can come up with a patch to at least get the generic infrastructure in there and then hopefully we can get that much by Andrew. And then we can start working on the allocations for the two allocators. Okay. Then there was some recent fast pass improvements in SLUB. Basically the problem that we had is that conflict preempt requires print enable and disable in the fast path. Initially later folks switched to the SLUB allocator and saw 40% improvements on current performance because all the IQ disabled, local IQ save and restore these heavy operations are not used by SLUB and they had actually perceived locking going on in the old RT version before they switched to SLUB. So they saw a huge increase in performance and that was said by the news that we needed to have a print disable and two instructions that fetched to the local data. And so we are trying to find ways to avoid that to restore the conflict preempt performance. And so I did a complex scheme involving being able to reconstruct the page struct from the Freelist pointer which was an invasive operation on the SLUB cache and Junsu Kim found a way to just do a retry there in the fast path and so that has been merged and it will be eventually merged by Andrew into a next tree and it's going to be in 320. And so then the issue with the conflict preempt is gone and the real-time performance has been restored to the way it was before. Yeah, maybe one thing I could do I wanted to see the last slide here. You know, to understand a little bit better let's talk about the fast path architecture. So the one of the reason that I wanted to do SLUB was to rework the complex nature of the fast path and SLUB because I couldn't get it much lower than it was already because it was always an interrupt disabled and necessary and the touching of various numerous cache lines and I wanted to avoid that. So the SLUB fast path instead of interrupt disabled it does speculative operations and then uses a person-automated operation to affect the stage change on the person view queue and only that stage change requires there's only one stage change on the CPU queue and therefore this single stage change is not subject to being interrupted by any hardware interrupts and stuff and so this means that the operation of the fast path is safe even in the face of interrupts because it's a single instruction and this kind of approach cuts the number of cycles spent in the fast path and so compared to SLUB we roughly have only half of the cycles in the fast path in SLUB and in order to fetch these speculative operations we need to ensure that they come from the same CPU so to ensure that we retry the operations if we figure out that the CPU was changed and for that purpose we need to have the preempt enabled disabled in there and you can look at the code here I just stripped this down to the minimum necessary so this is the current broken code with the preempt enabled disabled removed for brevity's sake and so what this thing does is it figures out C is the current pointer to the current PCBQ so it finds the offset of the PCB structure from the SLUB calculates it which is my PCB structure and then it reserves and gets a transaction ID the transaction ID is used to ensure that we are staying on the right processor and that nothing is happening in between so it takes its transaction ID determines the object that we were getting which would be the head of the free list and figures out on which page this object would be coming to and then we check if this operation would be successful if we wouldn't have an object there then the Q is empty this means we have to go to the slow path and we can't use the fast pass at all and the other contrarterium is if the page is not the correct first node and we want memory actually from a different node then we also can't do this we need to go to the slow path to convince the allocator to switch to a different node first because our cube is coming from the wrong new model quality if those criteria are not met then we can use the fast path we could use the fast path figure out what the next object would be following this object and now we are trying to do a complex change the current free list with the next object and also we are incrementing the transaction ID to the next one and we need to meet that transaction ID if there is a mismatch in terms of the free list was changed the transaction ID was changed then we have to reduce this whole operation so this fast path has no interrupt disabled it has no locks and it has no atomic operations so the view complex change is a complex change without lock semantics the double compare exchange costs a bit more than a single one yes the double compare exchange costs a bit more than the normal compare exchange yeah but again it is lockless it does not synchronize between CPUs it is only used to synchronize with interrupts on the local CPU if there will be an interrupt occurring an allocation would be occurring then you would have to retry the operation and then the complex change would fail or if the if preemption would reschedule you on a different CPU then also the complex change would fail because the transaction ID counter would be incremented and the transaction ID is unique per processor so if the transaction ID is fetched from the wrong processor this operation is also going to be retried so anybody can spot the bug here there we had to ok ah I tried to strip this down to the minimum I had to hand write some pieces this was never compiled I had to fit it on one page so it would be three pages if I showed you the whole thing ok so what was found is that between the the CPU point of determination and the grabbing of the TID the processor could change the CPU frames enabled and now you are getting the TID from the wrong CPU and you are operating on the specific structure of another processor which is pretty bad so this enable between those two statements and so the trick that Jonsu did is just put another retry loop in there and to fetch this again and verify that the hasn't been changed and that is only active for the config preamp case so there is a minimum impact on the norm preamp case and there is no preamp enable disabled anymore on the fast path yeah so this is the best I could come up with so far if you have a better one maybe we can optimize the fast path to use your memory barriers instead it is difficult to get fast but the optimization avoiding the preamp enable disable it saves us 2 or 3 nanoseconds so we are operating in a really small time scale again this is one of the most critical operations in the country because it is used by all subsystems anything that you do on that level has a huge impact like before it was improved on the real time guys this was just by switching the allocators and you can see this code is a long lengthy stuff as well I am sure that I won't be called from interop context I can't have that assurance if I would have that it would be much easier to do but I have been begging for that for years maybe we can do this at some point so any more questions on that any comments Paul ok ask a question I am assuming the retry you added for the preemption cases in the ELS clause why would it be in the ELS clause well you mean the ELS clause of the unlikely up there yeah ok and then you disable preemption in the ELS clause itself to prevent it from getting you disable preemption in the ELS clause itself to prevent getting preempted again and changing immediately after you did the check I can't understand you so what happens so I can't see the code but presumably just before between the ELS and the if unlikely the can become a change double you have a check to see if you are still on the same CPU no where do you put it the same to you is the TIT the TIT is unique to each CPU if you would have changed the TIT I am trying to figure out the fix I am just trying to ask a question about the fix version but I can't see the fix version so I am guessing and I think I will defer it until I see the fix version the fix is just retrieving the CPU pointer again so he retrieves the TIT before the CPU pointer construction and then this goes on and he verifies that the TIT is the same ok I am just curious what happens between that verification and actually doing something if you get preempted again or if he prevents it somehow I will have to look at the code ok it's in next so we are not upstream so if you look at next you can see the code ok so the other thing I have is the slap the play fermentation stuff so I have given this talk before at the Linux corn in Dysvoldov but I ran into some of the time constraints and I think we have lots of time here so if there is nothing else let's talk about this so we have done some patch sets when we both were at SGI on this back in 2007 and I have actually worked a couple of years on a patch set that was finally rejected because I didn't have enough knowledge about the dentry and eye note handling and I couldn't implement the pieces that I wanted at that level and so the SLEB allocator was constructed explicitly having in mind that at some point we needed to do fragmentation and I ran into some conceptual issues on the SLEB so I still have a patch set around that would allow me to enable fragmentation approaches in the SLEB allocator but that hasn't been merged yet and this work is very similar to what we have done with page migration page migration also was initially said it can't be done it's impossible now for the defragmentation last time I wrote this up at the kernel summit I got this oh this is impossible you can never do this it is possible at least my code that I have is working it doesn't do the right thing with the eye note and dentry system but it defragments pages and so how we got there with the page migration was by an iterative approach we believed us that we can move objects move pages between numeral nodes what we did first is we evicted the page from memory onto swap and then we swapped it back in and everybody believed in the integrity of the swap subsystem and therefore that patch went in and therefore we could slowly move pages back and forth and then we cut out the swap subsystem and suddenly we were able to move pages directly between one numeral node and the other that's how we got there and that's how we dealt with the disbelief and I think we can do something similar here we can do this slowly as first as a type of reclaim and then gradually move into an area where we can actually migrate an object directly from one slap page to another in order to facilitate defragmentation so the fundamental problem is that you over time as you operate this system you will have slap pages that can contain let's say 15 objects but as long as only one object can be used in that slap page you cannot free the object you cannot free the page at all so in extreme cases that also Dave has constructed you have a situation where you have a huge number of slabs that just have one object allocated in it and there's a lot of empty memory sitting around that you can't use suddenly your memory vanishes and so I know indentries are key to that and this is pretty bad you do doctor cash you erase orbiter data and you try to see if you can hit all these things and free as much as possible and then you do this by reloading the indentry and inode data from the disk and that may enable you to recover some of the memory so this is the usual architecture of the partial list of the slap allocators you have a list of these pages scripted for all the slap pages there are more objects that are free green and typically as you allocate new objects these holes are filled up so if you allocate three objects here you get rid of the first you get rid of the page struct for the top page if you would free all the blue ones you could free the whole page right and so the partial list overhead is one of the major things also in terms of performance of the slap allocators because you need to take logs to manipulate the partial list and you need to have traversals to find pages where you can have objects that you can allocate it's critical to block alloc as well and so one optimization that I've done in the SLUB allocator is I'm sorting the partial list according to the number of free objects so that the pages with only a few objects available come first if you do that then every allocation can potentially take a page off the partial list and the pages that are at the very end of the partial list will stay there longer so the chance is better that these objects are being freed and the whole thing can be freed back to the page allocator so this is very trivial and it tries to work within the existing framework so this is this approach now to do that you need to issue a KMM cache shrink for the SLUB objects and the kernel actually has a SLUB info tool I think barely anybody knows about if you run the SLUB info tool and you give the dash S option it sorts all the partial lists in this system in this way and if you then continue operation of the system it will hopefully reduce the number of partial pages in the system significantly any questions on that? okay so this is avoiding the whole migration thing this is what we can do today what is there? so now we have a different when the partial list the partial pages with the least objects come first and then the lots of pages that come last and then I found another way to do this also in this SLUB allocator I can do defacmentation by offer allocation let's say one node has a huge partial list and the other one constantly gets allocated from in that case the large partial list of the other node stays there and so what I'm doing here is if the system does not indicate that the memory should come from a particular node once in a while you go to a different node just to allocate something on the other node and if you do that frequently enough then the partial pages will vanish from the other node and there is a parameter there where you can control how often that should be happening so this is also one way to control fragmentation in an indirect fashion without dealing with it explicitly and this works best with the sorting of the partial list if you sort the partial list then the ones with a few objects are on top and any operation where it just gets one object from the other node will cause a complete page to be removed of the partial list how will that work with NMCG? oh don't ask me my opinion on that one hahaha hahaha hahaha I have never looked at that how that would work hahaha but this is basically me trying to do whatever I can within the existing framework way without getting too invasive we are getting more invasive as we proceed here so the next one is deframmentation by eviction um this is a rejected slab slab patch set that I did in 2009 um so basically what you do is you allow callbacks for each of the slab caches so if the system finds that we have a slab page that has can take 50-15 objects but only has one then we can ask the subsystem could you get rid of this thing and if the subsystem says okay I've done so then we have a 4k or a page frame free that's the fundamental idea and in order to do that we have two callbacks first get this means the system establishes a reliable reference to the object because otherwise while we do the processing the object can vanish we have the same thing with page migration so with that we ensure that the object doesn't vanish and then we call the kick function which means the subsystem can now investigate the object and see if it can be evicted from memory and so if the subsystem finds it can do that then it's fine and then it can take the page frame out and it's sort of optimistic so the callback can refuse to do this and then the slab page will not be evicted so this allows you to start this whole process in a very limited way you can just deal with the simplest cases in the callback you don't have to get complicated so you can check okay the object just has one reference left okay let's get rid of it you can then do limited checks and you can gradually increase the complexity of the function that does the kicking out of the object so this is kind of a soft way into this whole thing so in order to do this also the slab then is isolating slab pages it's comparable to what page migration does for the moving a page first you have to isolate the page from the LIU it does a similar thing actually the slab page it avoids that any allocation operations can occur on the slab and therefore since there cannot be any new allocations if you can then remove all the existing objects then the slab page can truly be freed any questions on the eviction process do you think that's possible so one of the things here is that a lot of the slab cases where you'd want to be doing this on dentury cases, onode cases, various things like that they already have an LIU it's keeping track of things that have no references so the act of actually taking a reference on it will actually tend to remove it from the LIU or actually when you drop the reference it can change its position in the LIU so from this perspective what we're looking at here is that the slab allocator has kind of a different method of determining what needs to be freed which would then give us two different callbacks memory reclaim callbacks into whatever subsystem it is that has to deal with these issues in different ways it's been five years since I last looked at this stuff I kind of think that we're probably better to look towards integrating the actual slab reclaim into the slab case itself the slab reclaim is basically outside of the slab allocators it's part of the the slab reclaim is also outside of the page reclaim it sits off to the side so I much prefer to see a solution that brings the the slab reclaim more into line rather than adding a second method that is completely different and requires object references and completely ignores LIUs and so on I'd much prefer to see that there's a method the defragmentation method that uses reclaim kind of aligns with the existing reclaim or the existing reclaim is aligned with the method of reclaim that's needed for defragmentation so we have one specific method of doing expose low level slab partial list details to the slab reclaim function and then you have some kind of function that maybe sorts the partial list and gives you the slab pages with at least objects first at the moment what happens is that the existing shrinker reclaim basically asks the subsystem or ask the case many objects are freeable and then says okay reclaim this many and then the subsystem decides which objects can be reclaimed what we're doing here is that it's not the subsystem that decides what object can be reclaimed but the case itself it proposes objects to be reclaimed it proposes objects it's the opposite situation instead of the subsystem deciding what gets reclaimed it's the infrastructure that decides what needs reclaiming those two approaches need to be somewhat more closely aligned into one set of infrastructure as opposed to fighting against each other okay so what we need to have some function that traverses the partial list basically we need to have a function for the slab reclaim where we can traverse the partial list possibly whether it's something to do with moving the LRU information into the slab cases themselves so that there may be like another list that says rather than doing LRU on per object basis we reference pages rather than the objects live in rather than the objects themselves so the infrastructure can then walk the partial lists in LRU order you're now asking the slab recorders to have metadata information about the object itself what I'm saying is that we currently track information on a per object basis whereas for memory reclaim purposes what we really are trying to free are pages not objects so from defragmentation point of view we want to select the pages that have the most amount of free objects on them but from a full memory reclaim point of view we want to first try to reclaim the oldest objects the least used so we retain the working set in memory so there's two different triggers but the only difference between them is what order we walk through the page frames in the partial list so for example if we do defragmentation you can then sort all the partial lists to find the ones that are most likely to be defragmentation candidates versus the normal case which keeps them in LRU order what we could also do is give you the number of objects that are in use in the slab page if you have given me an object pointer I tell you how many how many objects are allocated in that page how many are free and then you can make a decision on that we've still then got to go and work out whether they're referenceable whether they're on the LRU and stuff like that and so you can work that out because this is specific you know what the contents of the objects are and you can work with them yeah I'm trying to get around the fact that we've got two different methods of tracking this information and then two different methods of reclaiming them the problem is we can't really avoid these two methods because I need to reclaim, I need to have these lists to figure out where are the objects, where are slab pages that you can still allocate and you need this to have a kind of the age of the object in order to refer to that so this one Kristoff's talking about mine is basically a page from the base the object base with Kristoff's one it would be perfectly valid for the subsystem to do another allocation inside this reclaim copy it over adjust all its pointers and it hasn't freed any memory right so we needed one object for another and that's a good thing to do in terms of his system it's not a good thing to do in terms of the rekind that you're talking about so they're actually achieving two different goals not exactly because you can't replace an object with reference well this is the point, the subsystem can the subsystem can the subsystem but the subsystem knows where the pointers are that's the key thing that's where the pointer is coming from well the subsystem has a chance of knowing that here's an object the slave allocator would really like to free this object I as the subsystem can say I know how to find all the pointers to this thing I can be a good citizen and allocate another object adjust all the pointers copy this contents over adjust all the pointers and then say to the slave allocator here you can have this back I don't need it anymore right so that's a different actually different thing from the reclaim that you're talking about where the intent is to actually reduce the number of objects that the subsystem is managing so we do have two different goals in mind here there are but the method of isolating the object for whether we reallocate it somewhere else or then free it is exactly the same depends on the subsystem the main difference is the subsequent operation once we decide if we can free it or we can improve the goals are different I don't want to reclaim objects I want to compactify my mind snap I understand that it's the same if you're trying to add another method for doing something that is very similar to something we already have a method from it's similar but it's also different because my method has pointers to page frames objects yes and you're not listening to me we don't really care when it comes to reclaim I don't want to reclaim well the two slabs that we really care about in this case are the ones that take up most of the memory on a system and they're the ones that I'm thinking of here yeah but I have to think about all the slabs in the system no you don't you only have in this case you only have to think about the slabs that you can do a get and kick from only the slab cases that implement those things are the ones that you'd care about and basically the only ones that we've implemented is the ones that can actually do reverse lookups to find all the pointers and replace them which we can't do with inodes and dentries because of all of the intricacies with things like hashing rcu freeing we had the same problem with page migration and we suddenly said in the case where we can't determine all the pointers we leave the object alone if we can determine all the pointers we do dedicated guesses then we can account for all the pointers then we move the object and that makes this possible and subsequently we have more and more page types that would be movable I don't think it's quite that easy the pages are actually much more constrained in their use and structure than various slab caches and so on I mean we have inodes that are referenced by multiple indexes and there's multiple different types of lockless lookups that can reference that and we can't do an atomic swap of all of the pointers to them so you can't take a reference to a page print as well that was the same argument I was making against the pitch print stuff we promised you need to start somewhere this is what I'm getting to you but you started by swapping stuff out you can't do that with slab caches I know some dentures pardon? you evict them reclaim them and we come back then to the point as we already have a method of reclaiming yes but the goal is not to do a reclaim the use of the reclaim is only intermediate measure to make this possible to do the shifting of the objects it's not the ultimate goal we're going around in circles because you're not listening to me alright so this is kind of a description how this works you lock the page, you take your reference and you draw the kick method in the subsystem and then ultimately what I would like to go to is to movable objects so this is required for a different level if you have a fixed object address then you cannot avoid fragmentation so subsystems really need the ability to move any objects as possible that would be very beneficial we are getting into a situation where we have more and more issues with fragmentation because we have different page sizes we have the 4k page size the 2 meg page size and the 1 gig page size systems get bigger and bigger and we cannot really operate too well there will be a mixture of different page sizes and at this problem we will get more and more intense so I think at some point there should be more pressure to get this done and make sure that most of us are movable if that is the case we can do more advanced fragmentation and we can avoid the issues that we have today so we can move the megapages but the largest chunk of un-migratable memory at this point is the inodendentric hatches and I think at some point we need to figure out how in some way to approach this and maybe we need to talk more about this but I wish we would find out some way at least to get this started at least in a very limited way okay you can do an echo and drop caches and they are gone okay you want a brute force meant to drop spopping that is the way to do it just recline the kick is implemented as I just kick out then okay that was the last one okay any more questions any other sub-inmeters that you want to ask me about otherwise I am done okay thank you very much