 Alright, hi everyone. My name is Seren Bagdasarian and I'll be presenting with Kent Overstreet who is calling in Kent. Can you hear us? Hello. Alright. Okay, we can hear you too. So we will talk about memory allocation profiling today, and Mike, I promise I'll write a documentation for it if it goes anywhere. So I'll try to go quickly over the slides and then we can have a discussion. So why we are doing it? We want to account all kernel allocations including early allocations, allocations from modules. And usage examples could be memory detection, some monitoring of the usage, identifying regressions, and possibly other usages. So requirements that we have is very low overhead so that we can enable this in production. That's one of the requirements. Because this is not a debug feature. We want to be able to turn it always on. And the second one is to provide enough information to be useful. We want information to be actionable. So it's hard to satisfy both. That's why we came up with two-level solution. Basically one provides you always on high-level visibility of things that's happening but with very low detail. And the second one is detailed view where you can say, oh, this allocation looks suspicious. Let me dig it into it. So I enable the context capture for that particular location and we get much more information like call stacks for each allocation, PID, timestamp, and so on. So implementation-wise we came up with this code tagging framework. So it's code tagging is a mechanism to inject a structure which identifies a code location. And based on your application and memory profiling is one of such applications, you can attach to each code tag additional field or structure. So in this case we are attaching a counter. We'll be counting how many bytes were allocated from this particular code location. There are other applications. We showed that in the original RFC. We implemented dynamic fault injection, latency tracking, improved error codes. And using this basically it provides us with a low overhead solution and we can attach custom logic and data based on the application. So this is basically the most intrusive part. That's why we are highlighting it here. How it works? We have to wrap the calls which we want to instrument with this macro, basically, a hook. And what that hook does is basically it creates a code tag. This is the first line here in specific alloc tags elf section. Then it stores that code tag in the test tracks additional fields that we created. It calls allocator and then it restores previously set field in that test track. So this way we can support nesting. So what allocator does internally, there is a code to alloc tag ink and deck. So during the allocation it will increment the code tag for that saved in that test track. And it will also store reference to that code tag in this page extension or slab option extension depending on which object it is. So that when we freeze the object, we know which code tag needs to be decremented. So, overhead. Performance overhead, thanks to this basically mechanism we get quite low performance overhead for slab 36. For page allocation it's 26%. It sounds high but those paths are very, very highly optimized so they are very quick. So even adding per CPU increment decrement operations result in pretty high overhead. Well, visible overhead but if we compare with other mechanisms like MMCG, it's 10 times faster than slab and 5 times faster than page allocation. MMCG for page allocations. And memory overhead, I just did a rough estimate. Basically it will depend on number of CPUs because we are using per CPU counters for every code tag. And it also depends on number of allocation sites of course because every allocation site is instrumented. So a rough estimate on 8GB 8-core Android device with assuming 10,000 allocation sites which is overestimation. On my Fedora machine I saw that it's 4,800 something allocation sites so 10,000 is pretty high. So we come up to about 26, 27, actually around 27 MB of overhead which is 0.3% of total memory. I'm sure we can improve this. We just didn't spend too much time optimizing the size of those things. So that's pretty much overview of what we are doing and I would like to open it for discussion. What we are interested in getting is a feedback on generally usefulness. Like if you had this mechanism today, would you use it? How would you use it? We would love to hear any additional use cases that we might have missed. We would like to talk about the concerns, maintenance costs, runtime overhead, memory overhead, maybe some other concerns. And I know there are some other great ideas too. We would like to talk about those too. Now is it done? Okay. I can't. So Michele came to me and he asked me on Monday if it's possible to use static calls to do this. Now I guess the issue that we have that I've read on the mailing lists is the fact that doing... If you go back to the code, doing this is kind of like adding basically a bunch of macros around all the interfaces so we have to make sure that every interface has this macro and it could, you know, every time you add a new interface or whatever that, you have to put these things in to make this. And what happens is this code now gets injected into all the call sites. So basically every single call site has this code embedded in there. And there's a few issues. Cache, maybe one of the problems could be, you know, when it's enabled or even it's disabled or wherever you want, you have this code is executed. You can't turn it on or off for anything. It will bloat the instruction cache and such not. And so I, when Michele said is there a way to do static calls. If you're not familiar with static calls, what it does is I mean Tracy uses this. It's basically a way you have a call that you actually does a lot of runtime patching to make it call something different. So you actually can call different things. A lot of things have been using it like, you know, KVM has been using it to like, it boots up and says, oh, this is AMD. It does all the call sites to be a direct call to the AMD code. It's Intel does Intel. It's basically a way to not do indirect calls. So when I was thinking about this, I got up like there's a way of doing this using Obstool. Obstool executes at compile time and it could read in the call locations. And we actually find using the reloc tables and elf finding every single place that calls like, let's just say Kmalik. So I could find every single thing that calls Kmalik. And then what we could do is create a trampoline for every single call. So you'll get a section that creates a little bit code and the trampoline not will actually add the code here. A call to Kmalik and the code back. Now what's the call to Kmalik will be replaced with a jump. It will jump to the trampoline, do this work, call Kmalik, Kmalik comes back and then jumps back. And inside the trampoline you can put your tag, you can hard code the tag and hard code everything else. It's almost identical to what this is except it's going to be a little bit more work that you have to make the trampolines. And actually the added benefit, you could technically turn it on and off at runtime. So if you don't want this on, you could disable it by patching it all off. And yes. I'd like to jump in. I'd like to see maybe a more concrete proposal on the list, but my reaction when I heard this was that this is taking one set of somewhat magic things as we call macros and replacing with a mechanism that's even more magic in black box. We do this all the time. This is how F trace works. I know. This is how static calls work. This is how static branches work. And it's optimized. And it's also the fact that it could be on and off. And the one thing that's great about this, I really like about this, I think the memory management people like about it, they don't need to worry about it. It basically, yes. It's not going to make the maintenance burden go away. And there's advantages to having the actual source code and attention. Okay, wait, wait, how would it not take the way the maintenance burden Steve Steve. Can you not talk over me so much. Can you not try to take over our presentation. It would not. There's really value to the source code annotations. It's not just a game log. For example, that we need instruments. It's, it's used to be the choice for the programmer that's working on the code, which function in the call stack is the one that gets instrumented. For example, and file system code f s I know dot C as a thin wrapper around an allocation. This makes it a two line source code change to move the accounting to the correct location. I think the hooks be documented in the code. This is a documentation feature. This isn't a bad thing or maintenance burden. This is this is a good thing. If we could jump to the next slide to there's also another whole code tagging. The last slide. The next thing to do with the outlook hooks macro. Yeah, the outlook hooks macro isn't just for accounting. We've got better fault injection in the pipeline to and better fault injection. Speaking as a file system developer is something that we really need for better code coverage and our testing and whole host of reasons and right now the code coverage or sorry the fault injection capabilities that we've got are not ergonomic. This outlook hooks macro, it is a two line change to make every single memory allocation in the kernel, a fault injection point which means that we can write tests that trivially iterate through every memory allocation and just inject a fault run a test, verify that the fault was hit. So there's additional value to this. And your proposal you guys just came up with it. What last night, and I encourage you guys to explore it. The list. Sure. Okay, so I'll tell you this first of all everything you said can still be done in the in the trampoline added benefit is we actually extended in fact actually I would love to have this as a generic feature that you can actually attach to any call site. Do analysis, not just MM, but you like said file systems, scheduling anything. So if we could make this generic and actually like said you don't have to write macros all over the place to add it you just say you just what's wrong with macros because you have to then actually write the macros over the the function so if you have a function that doesn't it's not in it's going to header file once and you're done. Yeah, but why that means. Okay, I could actually anyone could say hey, I just want to use a lot easier. So okay, can please don't talk over me now. So the point I'm trying to say is you could then have like even a user interface like I said this could be enabled at runtime. That the macros cannot be done. I could say I want to instrument these functions. I can actually do something where I can now is if I have the analysis already I can actually do it so I can have this done and I can enable it at runtime and stable it at runtime with no when it's off is no overhead. There's no I cash pressure. It only happens when you have it enabled. We even hook BPF into this. And do your fault and then do fault injection that way. Needs the enable that boot time. There's no use switching it on and off at runtime because then your counters are along. And we've got the boot switch now. If you just care about babies. Well, we do a reset then but that's that's what I'm just saying to go and they've said you could, you know, push where you want but I'm just saying that this kind of came from some of the memory management people that asked me this about this and I said here's the solution. I think it would work. I understand. So that's all I have to say. Thank you. Let's open it up for us through any other comments ideas. I mean, the 36% is is a big overhead. Right. So if you can turn it on and off, that seems like a absolute must have feature. Yeah, we do have a kill switch right now. At runtime as well. I admit that my latest shiny new toy is is BPF trace. So I'm, I'm very much persuaded by that whole mindset and that's where Steve is taking us. Yeah. But I mean that feature of you have this thing that it's slow but it gives you critical stuff that you must have except that when you don't want to look at that critical stuff you need to run fast. It's a pretty good argument. It's inherently not possible to enable accounting at one time and then have counters that mean anything. The memory for counters can be allocated why it's not why it's inherently not possible. Because you miss some of the allocations already then freeze come and they don't see those allocations. So it's basically the same problem like if you didn't catch the allocation that boot time you basically don't want to see them. If you disabled enabled whatever happened between that you lost right. So it's not going to be inherently is not going to be correct anymore. Not perfect. Yes, I understand that you don't you won't have allocations between like when it's when the feature is disabled that but I mean that's reasonable assumption that yes counters are not incremented when the feature is disabled. And that when you enable it your counter start from zero and you get the statistics from like from the time when you enabled it. I guess it becomes it just becomes less useful. One one thing that also like so basically you have it boot up enabled do all your counting. And then when you're done with it like okay I got the accounting I could shut it off. So that's the one thing is you don't have to reboot you can just shut it off and then you don't yes once you turn it on will be an issue. So having it on a boot up and then be able to disable it without having to reboot and then having full 100% speed again that we can do with our approach. No you can't because this is still I cash this is still in the code path. How do you ignore how do you know where these stuff it adds to two instructions to the the allocation pass. It's still going to have a slate instructions. The other thing that the approach that I'm proposing also gives you is the fact that let's say you want to extend it maybe you want to do a little bit more than just counting maybe you want to do something else. That doesn't mean like I said let me finish. So say if you want to do something else besides just accounting say if you have another extension maybe you want to put the BPF programs on there. You could actually have if you have the support in the kernel where you have multiple trampolines where you do the things that have to be on all the time so you don't miss it but say if you want to do some other type of profiling. You can actually change all the call sites to point to the other trampoline. Let me just slow down a little bit. I would like to respond to something I said earlier. You keep dumping a lot once and I can't keep track of it all. What was I going to say? You're talking about overhead. The overhead that our approach as we have to weigh the pros and cons of both approaches and the con of your approach though is that the trampoline. The trampoline is much higher overhead than our approach when it actually is in use. Significantly higher. Wait when it's in use you said? Yeah. Have you done the measurement? What other function call? No there is no function call. It's a jump. It's a direct jump to the trampoline and the trampoline does the call and then it's a direct jump back because there's a single trampoline for every single call site. So you hard code where you return. We can do that with static keys. We do that with static keys. But no if you put but no you have you're going to have instructions in there. This fact is that inside the code where all the call sites are it's compiled with the call you just replace the call to a jump. But if that's all you're doing the size and ops slats when this is disabled that's going to be bigger than the move instructions that we add for just stashing the pointer to the code tab. That's all we're adding. Wait what? I didn't I don't understand that. For a trampoline you have to have space in the code to add your jump structure later right? To add some help. It's the trampoline will be in its own section. To get to the trampoline you have to be able to insert a jump instruction to insert that junction structure. The call itself is replaced with a jump. So we replace the at build time we replace the calls with jumps to the trampoline. Because it's because each trampoline is it's basically emulating what you did but instead of having the call there we just move a jump to the code and then jump back. Or like you know GCC actually compile it that way. So if you made your little thing or app or around there and you did it unlikely if statement would you know jump down and jump back or something like that. But no it's just basically moving this code out of line that's all it's doing. So instead of doing the call from the call site you do the call from the trampoline. And you replace the jump or the call with a jump to the trampoline it does. This is getting really far off into the weeds. I just want to add something that is about 80 percent overlap here. But I was playing around with BPF Trace and I on a you know a live system that was already up and running that I wasn't supposed to reboot or anything. And the big the big thing that really hit me was all the static functions. I don't really mean static I mean inlined functions in the kernel effectively turns out to be the same thing. And so you know if you can grab a file and recompile it so the functions aren't static then all of a sudden they become magically available to to BPF Trace. And so then I started putting around you know what's the best way to address this in the current way is well you'd write in trace points. So when when I saw this on the screen I kind of lit up and I thought oh my gosh if they're going to put all this in the in the code in some form. Then all of a sudden a whole bunch more functions are going to are going to light up and be visible to BPF Trace because somebody's going to go in and say well this is I need this visible I need that visible. Now how this fits in is if you if you take this approach and then modify it with Stevens approach then then I get what I want as well which is all of a sudden. You can do all above because these annotation sites double as additional traceable sites which were some of which were static functions before and were invisible. I'm not at all following how that's relevant to what we're doing to be honest. So the overlap that I see is that this is all about saying I have a running kernel and I need to know more about it. And that's precisely what the BPF Trace problem is trying to do and BPF Trace is falling short. And so this this helps with that. And if you merge the two then you get a nice extra little boost. You're doing the same thing but you're enhancing both tracing and you know this. Maybe I'm not familiar enough with what problems you're trying to solve with BPF Trace. Oh OK well I had a kernel hang and you know the code goes in and it doesn't come out and everything's running and you know the question is well where. And you can start off with CSRQ and it tells you what the backtrace is and then you say well that's nice but whoever wrote this code has many many layers of functions. And they've nested all the way down and you'll never be able to figure it out. It's kind of like saying OK where's the memory like oh well it's in K Malik right. You need to know what happened. And so by what I ended up doing here was I ended up you know compiling out or deleting the static keyword rebuilding and re-running. And then BPF Trace lit up and showed me where it was. So then I immediately said well hey guys can I take your kernel driver and they said no. Sounds like something holy BPF specific. This is not something that I personally run into looking at backtraces standard kernel backtrace functionality candle static versus non-static is fine. That's why I piped up just to point out that it can happen. And BPF is very good at helping solve that kind of thing. The implementation is also really fantastic and I think again combining them is going to lead to goodness. Steve kind of answered my I was concerned about kernel core size like the size of the kernel is important and people turning on config options to make the kernel smaller and stuff. So the chances of this all the stuff being useful whatever the approach might be. We should consider kernel core size. Yeah but basically what I said was that both approaches are the same of code size. Yeah the code size otherwise they are same. There was a config to disable it completely so compile it out. For people who cannot tolerate you know increased size they can just enable it. Any other questions comments ideas. Okay so my one comment is that whatever approach is going to be used. Please make it possible to free the memory for the counters. So it looks like you use page X right and that's not freeable. So if it was possible to use or something. I think it's freeable if there is page extension have page x ops which have the need function. It does but I think it's only called your boot when it's. Yeah yeah yeah you need to enable it as a boot. Yeah you need to disable the boot. Yeah I'll think about it but I don't see a direct way to do it right now. So like you put a system just look at the stats and then you disable this feature and free the memory then this would be quite useful I think. Another thing that we could do to free memory that we're not doing yet is those old sections could be free the same way that in it sections get freed after boot. Yeah those sections are not that big though. If you look at the breakup. If we did that though then essentially all the overhead would be gone if it was put off. The biggest sections are page extension slab extensions basically everything. Yeah but those won't be getting allocated if it's disabled at the time. Yeah I think the question is asked is to do it at runtime basically once you don't need it you want to disable it so it's not. And another thing is if we do like the approach I said with the trampolines you could actually free it. So basically you could boot up with it all running and then free it. Yeah that's what I just said. Well no I'm saying. Well wait which well I'm saying now well the actual like the injection sites. Oh oh oh you mean the sections. Yeah they are not that big. Yeah they are like 900k with perceived use and 312k. So the bulk of this comes from the back pointers basically because you have to have back pointer for every page for every slab allocation. Oh I see. Yeah like we said page X cannot be freed today but I don't ignorance anything fundamental preventing to implementing the freeing. Sure. And document that. Mike thinks it's not that simple. If it gets allocated late enough to be located with Vmalloc you can free it but if it's allocated at boot with memblock we don't really have something that freees memblock allocations afterwards. We do that with huge TLB. We unreserve it. So there is no common infrastructure. Huge TLB does it. I think CMA doesn't do it. So huge TLB does it in a way. But there is nothing like common API you'd need to generalize it probably. But is there anything preventing creating that API and the common create. So in the future we could free it then. And then of course the next question is. You can free page X if you are the only user of the page X but if there are other then you can't. I think that this is not really important at this stage because yeah I mean it would be really nice to have it freeable but that can come up later. As long as you just have that feature as an opt-in so that you just enable that one you really want to use that. Fine that's not a problem. And regardless of what kind of solution that will be you need some metadata to track whatever you need. So I think that discussion is not really the most important at this stage. I think that it's probably much more important to decide whether we really want to hard code what we are tracking or have something more dynamic. Is it boot time opt-in right now? The overhead is all boot time opt-in. There is a config option to compile it out. There is a boot time option to enable and disable it. Also another note on page X is that we are considering moving it from page X to struct page itself because actually using page X has quite a bit more memory overhead than having it in struct page. Is it a separate allocation that's not otherwise done? No, no, no, no, no, no. I'm just repeating what a general consensus in the room. No, extending struct page is hard stop. I can't. Please no. Just where the overhead is coming? I recently fixed the problem in page X where the flex which is like if users don't use the flex which was always built into the page X is not needed, then it's removed. So you should have no overhead if you don't use the flex. I think Kent is talking about the runtime overhead for lookup page X. The performance overhead. Okay, not the memory overhead. Possibly I was wrong about that. Anyways, it's not something that I'm pointed to either way. That's a discussion that's been ongoing on the list. I guess it's impossible if you extend struct page to make it build time option because that would be just a compile time option then. That's what we were thinking as an option. The use case that I'm interested in hearing about is if I've got a gradual memory leak. I guess the use case I see it abroad a lot is I'll find a subset of machines on a new kernel that'll be leaking memory slowly. And I don't really care about totals or comprehensive view of the system. I just want to see I want to have sort of zero overhead memory or CPU until I notice something's wrong. And then turn on some piece of magic, maybe this, but then I can use to get traces about never freed allocations after that point. Right. And I'm willing to leak. I don't know if this offers that capability or not, but that would be cute. Right. The idea is you enable this minimum information basically, which tells you just at this, in this file at this location, this is the memory allocation. So you collect that data. You can have like a user space demon that collects that data and sees that there was a gradual increase. And once you decide that, yeah, can I jump in here too? We also added this patch set and additional information to show a member report. This will grab the top 10 allocations and add them to the show member report. So if you own, if you've had that leak, you'll see it there. And I've already had this in the V cash registry and it's shockingly useful. Sure. But I assume these are all turn it on and boot and wait for trouble modes. I'm assuming I'm thinking I don't want to turn it on. I don't want to turn it on until the counters that are already in the kernel show I have trouble. So you want to have it off until. There's playing crown is today that tell me everything's fine. And then every once in a while, I'll see, oh, the slab counters are growing out of control. Gee, where are they being allocated from? Let's call Kent concern and make more tracing more invisibility happen for those for those call sites, which I can't identify. So, no, you have to have the accounting off from the start in order for the accounting to be able to tell you anything. Okay, well, that's I guess my dream then in this maybe aren't going to line up together. Yeah, I mean, we all wish for zero overhead. Visibility, but But I guess if I'm willing to if I'm willing if this could be turned on dynamically after I could have it. If I can interrupt there, there is something already. It's not a possibility. There's a BPF code. It's called BCC, which is, you know, compiled BPF tools that somebody's already written some called memory. It does that. And of course, since it's tracing, it's you can turn it on when you need it and then turn it off when you're done. And it will give you a back trace of whatever it sees leaking, you know, starting starting when you did it. It's just it's in the it's already in the BCC distribution. Would you enable it in production? Yeah, I mean, if you've got a problem, you would turn it on at that point. You need to identify that you have a problem right first. The use case I heard here was that somebody says, Oh, okay, I noticed I have a problem. Now I want to turn on the tools. This addresses that, which brings me back. I'm sorry, it brings me back to my interest in doing something like this to add more points that are visible to the BPF system. Your annotation here would add more points. But I think the existing kernel as, as it stands, plus the mem leak tool written in BPF that already is there will solve the case that I just heard. It will solve the case if you can reproduce the leak. Well, that's what he just said. Right. Yeah, I can, I can reproduce the leak. I just don't, I just don't know which allocation site is leaking. Oh, okay. Yeah, that's an easier problem in that case. You don't need it to be enabled from the get go. Where to, I'm not at least, well, the problem that I'm trying to solve is when there is a leak somewhere in the field, I receive a bug report that there is a leak and I cannot reproduce because it happens once in a while. There are some preconditions which I don't know and I don't have enough information to track it down. All right. Yeah, sorry, we're out of time. Thank you very much, everyone. And we'll continue with that. Thank you all.