 How's that? Problems to hear you, so. Maybe you got it. You guys hear me? Yes. Cool. OK, OM report. This is on 5.17. We do a big order 5 allocation, and it fails. And we get this. And I don't have enough information to debug this. I mean, maybe someone else can parse this better than I can. But this isn't terribly useful to me. One interesting thing of this to me is that we're doing an order 5 allocation from the 9p vertifest code and what looks like, usually it looks like it's from the reader code. I'm not seeing that here. And there's not any big directories in my test setup. So why the heck is it doing an order 5 allocation? What this tells me is that we've got places in the kernel where we're using an excessive amount of memory. And no one knows because there's no good way to see who's allocating what today. Just profiling memory usage is kind of a nightmare in the kernel. It's a little bit better in user space where you've got like TC maloc. But we don't even have anything like that. The page owner stuff is not easy to use. And it's way too expensive to enable it most of the time. That's a whole other thing that I'd like to see improved. And maybe we'll talk about that later. But for right now, what can we do with the OOM report? I made a start on this. Here's an example of an OOM report from my shrinker to text branch. So previously, in the OOM report path, we'd only sometimes get information on slabs. And when we would, we'd get all of them, which would be like two or three pages. I don't need most of that. So I changed it to just dump the top 10 and then sort it ordered. Of course, we still need an actual name for that. I think that might be a bug in my code, though. And look at this. Now we've got shrinkers. Now, hopefully because I'm in the OOM room, people are aware here that shrinkers are actually kind of important for debugging OOMs because they're needed for member reclaim. Yet until now, we haven't had any good way of seeing like shrinker information, not at runtime or in the OOM report. Thanks to Roman, we're gonna be getting this at runtime by debugFS. And my hope is that via the pretty printer work that I'm doing, debugFS and this report will be showing exactly the same information on each shrinker. And we'll get a callback function so that instead of just printing the very basic information that VM scan has, every shrinker code can define that callback to print out its own internal information that's relevant for what objects can be reclaimed. If your cache is completely dirty, of course you're not gonna be able to reclaim. We don't have any way to see that today. But if you look at, I didn't get you a good example, I defined a text method for the BcacheFS one that does show how many objects are dirty. But unfortunately, I can't show that to you. And this is only a start. I don't intend to do all this work myself, but I'm noticing that this code hasn't changed since Johannes wrote it, but back in 2006, is when it looked like you added the show mem report. So no one's touching this stuff. And I think that's the problem that needs to be fixed is we need some better organization and some better tools so that we can just have better log messages and error messages and reporting. That's the end goal of the printer stuff that I was talking about in my BcacheFS talk yesterday. Should I, Matthew, should we talk about our crazy pretty printer? Yeah, sure. Yeah. So right now, let's pull up a drag this one. Yes, print buff is going in a new direction. So I was looking through vsnprintf.c And it turns out we've got lots of pretty printers in there that I didn't even know existed because they're all used via this crazy percent sign piece in tax that's not discoverable. You can't see scope to see where the code you're calling is. And all these pretty printers are in this dumping ground that no one looks at except when they go to add something new. Pretty printers should be defined with the code that they're printing, not off in this dumping ground. So that's where you would look when you wanna see a pretty printer, you go, there's one place to look to see what pretty printers are there. It's not as fine. Well, that's one way of looking at it. Matthew's idea was that, let's see if I have it here. Here's the test code. So we're gonna have percent sign left parentheses, which is going to represent a function call. Then you'll pass the pretty printer, the function, the name of that function to printf along with subsequently arguments to the pretty printer. So we have printftest function. This is using the old calling convention, which then does some outputting. And now instead of specifying like percent sign, capital B, which I think represents B dev name, you'll just be able to just say call B dev name and then pass it the argument to B dev name. So we'll be able to have a million pretty printers and define them wherever you want. My experience in Bcache FS has been that getting into this habit of writing a lot of pretty printers, good begets good. Your log messages get infinitely better. It becomes a lot easier to write a new pretty printer when you've got pretty printers for all your little base types. I think life is gonna get a lot easier and our log messages are gonna get better. And all that stuff in vsnprintf.c, we can move it to like where those types are defined. We'll be able to find that code in C-scope. In the meantime, where was it? What else can you do with the showmem report? Well, we could rate limit it. Right now, sometimes I get, it's spewed like 20 times in the space of a second. I'm not sure that that's terribly useful. Showmem is that report when we hit knowem, that's seen by a lot of people, including end users, that wanna understand why their system ran out of memory. And I don't think it's terribly useful to them either. So maybe we could introduce some better organization. Maybe at a top level, we could divide it up into kernel memory and user memory. Sometimes you run out of memory just because one user task has gone out to lunch. And if we dump the top five or 10 tasks and sort that by memory usage, I bet a lot of end users would find that useful. Page allocator fragmentation, that could be useful. I was getting into some talks with Dave Chenner and I think there's a bunch more stuff with shrinkers that we could add with shrinkers, we can be reported. This is working. So actually we used it, if you wanna look back in the kit log, we used to print the top X tasks and it was removed for some reason that I don't remember. But you can. Huh. Yeah, top tasks are not exactly three. So again, what? Yeah, so if you want to dump all the tasks or at least some of them, that's not really an easy thing to do because you need to do some task lists, look a looking and it can get pretty expensive, especially if you have hundreds of thousands tasks running which can be the case very easily. So that's one thing. Another thing that I wanted to mention slightly earlier that. If we break limit how often we're generating that report, that will help somewhat. Maybe yes. I can just tell that we do that on the OEM killer report and that's really expensive. And it can dump your log buffer to the point that essentially not even the whole OEM report just gets there. Partial information can be useful but usually not all that much because if you are not having a complete picture, then you can draw a completely wrong conclusions from what you are seeing because it might be not just the top consumers are the problem but the sheer number of them, at least from my experience. So and also one thing that I wanted to mention earlier is that what should be in the OEM report really depends on the out of memory situation that you are facing because if you are failing on GFP no weight allocation, that's a completely different thing and shrinkers are likely not the primary information that you are looking for or any reclaimed statistics. If you are failing on the high order allocation that shrinkers are probably not the most important thing you should be seeing because it's usually compaction that is failing and knowing why the compaction is failing is very likely what you are infested in and yeah. That depends on fragmentation which we like it could be compaction or we could just not have memory. Yeah. We don't really print information on fragmentation right now so we add that that'll get us in the right direction. Yeah, we have been discussing that on the mailing list already but what I was trying to point you out to is that the allocation failure context matters and doing unconditional stuff there is probably not going to be very helpful in 99% of situations except for very, very limited ones but what we currently have in Shogman is mostly compromise of what might be useful and it usually is. There are cases when it's not. I'm not sure whether that's really something that will help in general. Well, anything is going to be a compromise but I think we can probably do better than what we're doing now. I'm looking at just below your shrinkers it says add tracking to see what or how many slab objects are allocated versus how many are on. I'm thinking do we have, do we already have like a watchdog timer? Basically once we see that a certain, basically a certain level where a member is starting to get constrained and maybe we don't have much that we can reclaim and we could say oh, you know. There's nothing that looks like a watchdog in a shrinker code or in VM scan. That might be a good thing to add. Yeah, so that way once we see that we're getting to a point where, hey, we're not reclaiming much and we're running out of memory and we're going to a position where we can have an OEM and then maybe trigger off tracing or something, say okay what is starting to think where's the cause of the OEM? So if we could see what's happening before, because usually right now it's like post mortem. We hit OEM, or OEM, sorry, and it's afraid I'm slipped there, but we hit the OEM and then we try to figure out what brought us to this point and it's almost impossible to do it but we had like a level saying oh, we should start taking a look at what's happening, what's being allocated, who's the culprit? That might be more useful. Yeah, that would be awesome, except for in most cases that build-up might be really slow, so it's mostly the boiling-frog kind of problem that you are just doing fine until that very last drop, which just gets you over the board and that's usually the victim rather than the culprit of the whole build-up that has happened until that moment. Another point I want to make is that debugging OEM sucks even the simple easy case. It sucks when I'm just running XFS test in a single virtual machine in my optimized developer setup. If I can't debug OEMs there, then we've got bugs that just aren't getting fixed and those bugs are also showing up in these big complicated scenarios like you're talking about. Let's make sure that we can at least debug in the easy scenarios and then there will still be extra bugs that show up in those harder, bigger, more complicated scenarios, but let's get the easy stuff first. Yeah, I would love to see the report to be more useful. For example, I myself have to post-process what the OEM report is telling me because you just see too many numbers and just to wrap your head around that you need to do some basic calculations. For example, what I usually do is I just check the proportion of the LRU versus SLAP memory and if that proportion is like, I can see 90% of the memory is just consumed by the user space, then it's very likely to be a problem of the user space. Having that called out explicitly would be really nice. I think that would be an improvement for somebody who just doesn't have that processing hardware into the keyboard already. Maybe some of the stuff that you're calculating in your post-processing, maybe that could just be in the show memory port. Yeah, yeah. Maybe yes. I think we can argue would be those thresholds when you start reporting here and there. That's usually much easier to be done from the user space when you see the whole thing and just put AWK or whatever you use for processing a lot of numbers, but I would like to also default to our show memory port being more uniform and less conditional. And just like there was no reason before to be dumping all the SLAPs with two pages of output. If we just do a bunch of useful summaries instead of dumping all of some things, I think that'll be more useful. The show memory port. There's always the corner case where you just, I wish I had this. Yeah. But I think what could be useful is if at least follows what is most pertinent and goes down to. Like another thing in the same line that Michael said is, we list total memory, we list free memory, and then we list a bunch of known consumers, but some consumers aren't known. And you have to add up everything that's known to figure out, oh, there's a large gap. Somebody's doing page allocations and he's not reporting it. So if we can start with the summaries and then here you might also be interested in this, right? I think that could go a long way. If we rate limited, I think it's not a big deal to have a more verbose dump. But, and for Ooms it's fine because they're not that frequent, but we have machines that do like page allocation failures, because they're allocations that don't qualify for Oom either because they're higher order or they're no weight or something like that. And then you have machines that are just in a loop failing allocations and it's just dumping the same thing over and over. It is kind of useless. Yeah, so for the rate limiting, that's really nice thing in theory, except it doesn't work because rate limiting can work only when the, let's say that critical section that you are rate limiting is really negligible in how long it takes. And print cake can be pretty slow if you are running to very slow serial consoles or whatnot. And we have actually seen Oom reports to be essentially putting the machine to grind just by dumping that information because it just takes ages and rate limiting cannot really help you because the whole thing takes much longer than the time you are rate limiting to. So if we want to ever do something about that then we really have to come up with a way to... This will have to be some custom rate limiting, not the like print cake rate limited thing. That won't work for this. Yeah, it doesn't. So yeah, that would be something that would need to be done as well. So what I'm trying to say that just dumping more information with the current implementation is not all that easy. There is quite a lot of work to be done. And also too much information can be tricky on its own because then you don't know where to look. So I completely agree that having some high levels initially to just tell you that 80% of your memory is not accounted. So you probably should look at networking subsystem because that's the usual suspect or something else but you will not do anything or you will have hard time to find out much more because it's not accounted. So for that I don't think this exists now. As part of like the folio stuff, Matthew's been figuring out what the actual type hierarchy is of our different page allocations. We should probably be tracking memory usage by the actual type hierarchy. And I'd love to be printing that out. I was just wondering how somebody can allocate memory and we don't know that they allocated memory. Yeah. Well there's just no, it's like you call or lock pages and it doesn't. Well, you know, the ones we account, you allocate pages and then you call code to tell VM stat. Hey, I allocated n pages. If you don't do that, we just don't know. So should we not allow that? There's a lot of call sites that you would have to update. Let me talk about my plan for solving that problem now. So right now we don't have any way of tracking memory usage by call site that's remotely efficient enough to be enabled at runtime of production and it is possible to do this. There's a trick that the dynamic debug code uses. I think everyone should know about. It replaces your PR debug calls with a macro that defines a static struct that it puts in a special elf section and then when the kernel starts up and it's in it code or when you load a module, it walks that elf section treating it as an array and adds every single PR debug call to debug FS. So now you can grip through debug FS and turn on and off individual PR debug calls at runtime. Super cool. Imagine. Don't we have tracing or why don't we use tracing for that? I mean, we do have trace hooks into page allocator. And that's the usual way how we, at least we, when we have customer problems that are reproducible, we just enable tracing. If that's a... I want something that's always on and I can't enable tracing for every single memory allocation all the time. I want to just be able to look and debug FS and see file and line number. This has this call allocation call site as X number of megabytes allocated and pipe it through sort and we can do this. Always enabled on every server. It's completely, you just have to flip them back. Yeah, yeah, but you can't have those trace points for memory allocations enabled all the time. And then user space then has to be the one to match up the allocation and free trace events. And that's expensive in terms of memory usage and user space. There's ways to put in profiling or something attached to that would be pretty true. You want to match up locations without locations of freeze. Everyone's is tracing and tracing is great but it's not the solution to every problem. Oh, one thing you could do for that to get some benefit out of it would be to use BPF and track a map by call site. Yeah, and that uses tracing. It does. It does. So one thing with doing this, especially once you start involving user space is that user space will trigger more allocations. And so you have to be careful that while you're tracing and recording what's going on you don't recurse into allocations. So we try doing this with BPF but it gets hairy really quickly because you have process context freeing memory and it's being interrupted by an interrupt that also frees memory. And then it's like who gets access to the data structure to the BPF map. And we start losing events. So yeah, we've tried to have this like what's the outstanding memory allocations to have it just a BPF map and that just quickly fell apart. So wait, what's the mapping? You're trying to match like just the allocation to the free and by like the address? Yeah. So I've used synthetic events. So synthetic event is a way of creating, attaching two trace points and mapping it by a field that's common among the two of them to create a third event. And then you could make a histogram or whatever you want off of it. And this works through interrupt context or whatever. It doesn't, it does everything pre-allocated before you do it. And obviously it has a pool and it runs out, it runs out, but. But we'd want to look for not. What do you mean? Well, because he's tracing allocate and free, we'd want the allocated. This means user space has to remember all the trace points for every allocation that has not yet been freed. Right. That's too much memory overhead. I want something that can be on all the time. So that it's actually there and already trace, already collecting information when I run into a problem or something that I want to look at. For driver like allocations as such a tool that's always on would be very useful also because there are many drivers that we are working with at third parties. We don't have sometimes sources for them. At least two pin points that who is doing that would be very useful. Even though we have DevM API, but not many drivers use it. So such a, well, I don't want to point fingers, but yeah, a lot of, actually not just, I mean, yeah. That's another point I wanted to make about adding or changing home reports is, I know at Google at least, we do have scripts which parse those and kind of extract what's abnormal. So when we are changing, we need to be very careful not to break those. Or if we are breaking, well, at least they better be backward compatible so that existing scripts don't break. Adding is fine. I mean, we don't give any guarantees that additional information is not going to break those, but it would be great if we can figure out the way not to kind of affect those. I don't really care about breaking Google internal scripts. I'm sorry, but these are. I'm guessing not only Google does that, right? I know, but I've never been aware that our print K messages have been part of the user space ABI contract. This stuff, the. No, it's not. Consumed by humans, first and foremost, it should be useful to humans. And if it's not useful to humans because some people don't want to update their scripts, that's a problem. So I missed how the DevDebug approach fixes your problem. Yeah, I got interrupted. Once we have this trick of a macro that defines a static struct in a special elf section, we can use that to wrap all our Kmoloc, alloc pages, et cetera calls, and then pass the address that struct into Kmoloc or whatever. And then we can remember for each allocation, the call site that it was allocated from so that free can decrement the right counter. So all we need is, and like this could be plumbed through like Slub and the page allocator. And they already have provisions for tracking information per allocation for like the page owner thing. So we just need to repurpose that stuff and getting some help from the event people that are familiar with that code would be great. The page owner stuff has been really useful if the problem's reproducible just because it's too heavy to run in production. But I think the reason why it's so expensive is just it needs a lot of memory to store the full stack traces and just gathering the stack traces on every single allocation. But it's usually a lot more than you need. Yeah, exactly. So this goes back to the drivers, right? If you have a general direction where to look, right? Even if it's just like a file in a line, or if you know the module where it's even coming from. I think the perfect would be the enemy of good there. Yeah, I think having a rough direction which to look at would be really useful already. My main goal is cheap enough to be enabled by default so that more people are looking at this stuff because like I showed you from the 9p allocation failure, if we're doing stuff that stupid, that means we're just not looking at our memory allocations. Okay, so yeah, we need to move on to the next topic, but I think you can't.