 Okay, so getting full user stack paste traces. So, user space stack traces could be used for profiling, so things like perf, I was gonna make sure it's still there. Perf, when you want to profile your applications to see how long to make nice flame graphs that Brandon Gregg always shows off, requires full stack traces. And that's something useful for things like perf. Ftrace also uses it, so when you wanna do system calls, you wanna know where the stack happens, you wanna use stack trace. BPF can use it, in fact, I'm sure you could probably even add a filter to BPF to say trace things when a certain function within the stack trace is there, so lots of useful applications. Now, acquiring the user space stack traces where it gets interesting from doing it within the kernel. Right now, it requires frame pointers, which means that you have to have frame pointers within the kernel to, or within the applications have to be built with frame pointers so that the kernel knows where the next stack frame is, so when you go up to the user space where you entered the kernel, or into the kernel, you have the registers, you can look at the stack, the frame pointer, or actually the stack pointer to find out where this frame pointer is to find the next stack and everything. When the frame pointers are not enabled, Perf, written by Yuri, does this huge copy of the stack and then post-processes it later, I believe, using the Dwarf on Winder to be able to find out where in the stack, so just as every time you take a profiling thing, it will record what, 16K of the stack into the ring buffer, so I could use quite a lot of data within the ring buffer. The problem with frame pointers is that it has overhead and LWN had a nice article when Fedora enabled frame pointers and I'll actually talk about more about that, but frame pointers require a setup of each function which means there's more instructions to execute. It also requires another register to keep track of the current frame pointer, so now you put more pressure on the registers that are used and as I mentioned, there was a good LWN article that talks about this, that gives, when Fedora enabled frame pointers, there was complaints that the build increased by 2.4%, the blender test case increased by 2%, and there was even some Python programs that increased up to 10%. There, all from, I don't know, I didn't do these tests, I just read LWN. So, how the stack frames work? Here's just a little idea, you have the stack pointer and you have the frame pointer that points to the previous stack frame and then from there you could get, whoops, you could get where the next frame was and then it'll show you where all the return addresses and then you get a nice little output. So, inside the kernel, we have this thing called the orc unwinder, which means oops, rewind capability, but we all know they were just throwing that in because of dwarf and elf. So, it's much simpler than the dwarf format which remember how many people had done like assembly program inside the kernel where you had to put those CFI annotations all over the place and how many people got that wrong. I was notorious with that. I did a lot of the assembly code and I tried in vain to get these things right and I would constantly get complaints from people saying, oh, this is totally broken. To me it was just black magic. I had no idea how those CFI annotations worked within the kernel, it was just crazy. In 4.14, live kernel patching required accurate stack traces. Reason being is there are certain cases where you don't wanna patch the kernel if that function is actually being used. You may not wanna switch it, there could be race conditions I could add. So, when live kernel patching would go in it would kinda like do a, I believe it did like a stop machine or did something to check to make sure that, and then it checked every single task and it had to do a stack unwind of all those tasks to make sure that what the functions were running was not something that was about to be patched and cause a crash in the kernel. Because the whole point of live kernel patching was to make sure that the system could run and you could update the system without having to reboot. So, if live kernel patching crashed when you enabled it, that kinda defeats the purpose of the live kernel patching and people would be kinda upset at you. So, I remember when this came up we were saying we need really accurate stack traces. Josh, I'm not gonna pronounce his last name cause I know I will mispronounce it. It's French, it starts with a P. He came up and developed the Orc Unwinder to solve this. Which was the tool, if you go into the tools, OBS tool it does this and it actually can define the Orc Unwind section at compile time. The way it does it is it has two tables. One table is the Orc Unwind, which is the information about the frame point or basically where the return address is and a few other little things. And there's another table that just is the IP addresses, that's a range. And to explain this, how it works, is let's say you have your stack pointer that points some place in memory with your return address and variables and you have your instruction pointer someplace else in the function. So, you wanna do a stack Unwind. The Orc Unwinder, you look at the table for Orc Unwind, it's a sort, the IP addresses are there are sorted. It's basically a sort of like ranges. And so it's not every single IP address basically, it's the start of the function, end of the function. So, you can easily find, with a binary search, the IP address. And once you find that, the index of the Orc Unwind is equal to the Orc Unwind IP table is equal to the Orc Unwind table as well. So for every entry you just then go to, once you know where the index is, you then you can find the entry with more information on there. And one of the things is the stack pointer offset, which then you take the current stack pointer and you add the offset to it, it gives you the return address. That gives you the second instruction pointer. And now you start doing the whole thing over again. And this will tell you the entry there, I got this a little bit backwards. But then you find the stack pointer, I find the offset, gives you the second one to the next function and you get a nice little stack frame out of that. Which brings us to S-frames. S-frames is based off of Orc Unwind, same concept. But this time it's for user space. What's nice about S-frames is now you could do, when you compile with S-frames, you actually could do from the kernel, if this is implemented, a full stack trace without enabling a frame pointer. Which means we're sacrificing disk space for speed. We actually have none of the performance penalties of frame pointers, but it requires more data to be stored to hold these tables. The section is an L file, and it has the two tables. It says one of the things we have to require is to compile it in. Ideally I would love to have this be like almost by default, or at least if you have the dash G option or something like that, that these are created. So basically everyone will have this by default. But once we get this done, it will be possible to read this in the kernel. And that perf, ftrace, bpf can all benefit from this. So my idea for this solution, because the data that needs to be read is just located in a file, I have to be able to read it. And obviously perf triggers a lot of times from NMI context. So I'm thinking if we could do it from the ptrace path. We also have to find a way to handle the offsets. Because what it gives you is an IP address, and the problem with the IP address is the fact that it's relocatable. So yes, you get an address in virtual address space. Now how do we map that back down to something that I actually, if I do a ftrace trace, and I get a stack trace, a bunch of addresses, I want to actually point these addresses to actual functions, or maybe at least file offsets that these came from. So then I actually could find out where in user space that these were. Otherwise it's kind of useless for me. One thing I'm saying, wouldn't it be nice to know we have this information already in the kernel, and that's if you just cat proc maps and gives you the nice address of where everything is mapped, files are mapped to the address spaces. So the way this would work would be, so you're in user space, NMI triggers let's say, jumps down to perf. Perf says I want a stack trace. Currently it just says okay, I'm gonna read the, it actually does this from NMI context. It will actually read the stack from NMI context with a don't page fault option. So if you do take the option, it doesn't do anything, it just says okay, you faulted. And in case if the stack just happens to not be in memory for some reason, memory pressure, and it was kind of swapped out into the swap disk space, or you hit a section there, too bad it just will not even trace that patch. You just drop your thing and continue. I want to change this, and I've been talking to Geordio about this, about okay, instead of doing the full stack trace, it's going to ask for an S-frame. And then, on the way out, so it returns them NMI context. This could be from internal context or anytime. But just before we exit back to user space, there's a flag that could be set inside the task struct that says is there work to do? And if it's not set, it just says okay, no, go right back. It's in a fast path. We don't want this to do anything. It's a very simple bit check and goes right to user space if there's no work to do. But if there is work to do, then we go to the P-trace path. Reason why it's called the P-trace path. And in fact, P-trace is PT-regs and everything else we know is based off of all the P-trace code that GDB uses to trace other applications. So in this P-trace path, F-trace does stuff, Perf does stuff, this is where all the tracing happens. What's great about the P-trace path also is that it's in the normal context. We can sleep. We can map in pages. We could do whatever we want at this location. So we are guaranteed that we can get the pages back in if we have memory pressure or whatever. But we can actually get the full stack trace without losing it due to something not being mapped in. So then what Perf would have, Perf, F-trace, or BPF, whatever, could have a way of registering to the S-frame infrastructure that says, okay, I'm gonna get the stack trace for you and then send it off to Perf so Perf could again record this chain. This will require changes in the Perf user space tool. The reason why is because you may want to do profiling of the user space as well and you go into a system call that takes a long time and you get, say, three or four hits with the NMI while you're inside the system call. Each of those hits is just gonna do the kernel stack trace, but then it has to do some sort of flag or something that says we're gonna add a user space stack on top of this. But it won't do it in this ring buffer. The ring buffer will actually just say stack with a flag or something that says I want a user space on top of this and keep going and maybe it has to be a tag so we can identify it. And then on the way out, it will actually do the record. And when it goes back to user space, user space will have to go back and say, oh, we see these kernel stack traces without the user space stack, but there's gonna be a flag or something saying we want one so it's gonna actually have to keep going and says there's one coming up and we'll just have to store that information until it finds it and then it could re-append the user space stack trace, yes. Mike? Yes, Steve, I was wondering to kind of make Perf work out of the box. Is it possible to reserve some space below the kernel stack? Okay, actually this is a second benefit from this process. Smaller Perf. Why are we wasting this space on the ring buffer when it's just duplicated? Once you're in the kernel, the user space stack is not going to change. You could wait to whenever you want to do it. So the idea is it actually has two folds. One, to fix it so we could make sure we guarantee to get it. And two, let's save the space or ring buffer. Ring buffer space is high real estate. It costs a lot of money to use that. So then it's done, it goes back to user space. So the other thing I have is relocational addresses. This is, oh yes, yes. Yeah, I have a very quick question about the stack that's been swapped. So like, do we want to know that that stack was swapped at the moment when it was actually profiled? I don't care about it. I don't know, I don't think Perf wants to know if the stack was swapped out. We have trace events that can tell us about it. But it is swapped out and we actually have to map it back in. It's going to actually, yes, it may happen and yeah, actually that's another issue that we have to think about with this whole thing. When we get to this location and I might want to talk about this. This is kind of like some of the things I should probably want to talk about right here. The work I want to do is much more likely to require memory paging in at this point. Where the old point, we're just looking at the stack, copying the full stack. It's highly unlikely that the stack is swapped out because the stack is used a lot. So in the LIRU algorithms, it's going to be always the most recently used, whatever. So the idea here is this. When we ask for a stack and we go into the ptrace path and this is where I'm going to need input and discussion and maybe discuss it here offline. And if someone wants to please email me that would like to be involved in this. We are going to now do that lookup for like the Orc on Wine lookup. The problem here is that table is a elf section. So there's a few things about this. Now we talked about just get user pages. You know what? Honestly what I would love to have is that the S frame section is a loadable address and this is where I don't know how it works. In user space, from what I understand, when you execute something, it will just say, okay, here's where everything's mapped but doesn't update the page tables. When you go to execute and the page table doesn't exist, you take a page full, goes into the kernel, the kernel looks at the VMA and says, hey, this is code from this file, goes finds the file, maps it in and into that location and then goes back to user space and it just keeps going on until it falls off the page and then does it again. Maybe you have optimizations that pull in more than just one page, who knows. The idea here is I want the same feature with S frame. So the S frame will not be mapped until it's used. So we come into this point. Now yes, it will cause that Heisenberg effect where the tracing will slow down the application. The thing is, once it's mapped in it's seldom likely to be mapped out. So it will only have an initial hit. And yes, if you're doing profiling you might see a little blurb and then maybe we could write some code to say just let it go for like a little bit and then everything else. I mean, it just sounds like such a rare event that the stack gets swapped out. But that's not what we're having a worry about. That we probably want to know about that. I mean, I'm not saying that no, don't fall to just like keep it swapped out. That's fine, let's take a slow trace event but get the stack. But I'm just saying that we should give this information to the user so that like, so they know that this stack will stop talking to them. So this happened, we could do that. Okay, no, that's actually a good point. That's actually something we get added to the infrastructure. We get added to the perf infrastructure or whatnot. Saying that if we do take a page fall that we record it and time it and then actually send that information up to perf and the user space can then say, oh, this little blurb here is due to our tracing and try to fix it. So that actually is a very good point, John. Please mark that down because I'm not gonna remember it. So, hey, it's recorded, right? See, what are you just trying to do? Is this going to get rid of that awful patch to add build IDs to every struct file? So the build ID, oh wait. There was a patch that wanted to add build ID to every single struct file. Please tell me what you're working on, we'll get rid of that. Well, there's a struct file for, or which struct file? Oh, I know that guy. Every single struct file. They wanted to add a pointer to store a build ID. Yeah, I saw that. I never understood the purpose of it. And I don't know if this is good. Was it by Google? Was it? It was Giri also. Yeah, I know what it is. So, okay, so, yeah, yeah, yeah. Would you want to pass the mic over? Do you want to respond to that? Sure. Sure or no? So, yeah, I guess the reason for the build ID and the file object might be for bigger discussion, I guess we covered it in that email discussion. I'm not sure if this could be actually, this could help with that in some way. The problem is that we would like to have the build ID information like available in the context where it's really hard to get. Like in NMI, we can try to read it, but it's not reliable. So, the reason for that was that we tried to read it at the page loading time, BMA loading time, and store it in the file where we can access it later. Yeah, I'm not sure where this is going for this table if that information will be available then at some point. The point with the build ID, we would like to have it prepared so we can actually have it available in the context where it's really hard to get. Was this build ID for something to do with mapping, knowing which, so build ID is like this unique ID of the application. So, later on, you can actually identify like debug info directly for the application that you are on. So, would please give Mike the mic. Thank you, Steve. So, if Steve builds this infrastructure to go to the ptrace pass, you can access the build ID from there, right? It's not in any of my context, it's normal kernel context and you can do pretty much whatever you'd like to. So, okay, do you want to continue this conversation or are you going to continue with this S-frame? Yeah, yeah, okay. I'll go with the S-frame. So, my dream thing for this point right here is that the S-frame, which is a section in ELF is treated almost like normal text sections and data sections that are in the file so that I don't know how the kernel would work. Is there, how is that done in kernel? I know there's get user pages, but if I want to simulate a, like this is actually an execution that actually maps in and I could actually do something like user space where this code I'm wanting to touch and I want you to map it in. I think you're making these two complicated. I think all you really want, you don't necessarily need to use get user pages, right? You just want to access some bytes which are in a file somewhere. Yes. You can just go to the, you can ask the page cache, please read these bytes for me and it will do that. Okay. And it never gets mapped into user space at all. We don't play with the page tables. You can just get those bytes out of the page cache. Okay, so that's actually pretty good but it's going to have some tricks that's coming up. Okay. But no, actually this is the thing. I might actually actually ping you about more information about this. I'm more than happy to help you with this. Thank you, because that's the whole idea. I have to do that binary search in a file that could be huge. I mean this will be, I plan on having Chrome built with this. And that's going to be a several meg S-frame section that's going to have to do a binary search to find things for these lookup tables. And it's going to be sparse. It's not going to pull in everything. So I don't want the full page locked in. By the way, Indu did do, she's the one that by the way, I just want to let you know why she's up here. She implemented the S-frame logic on the GCC side so in Benutil. So she's completed that. I was so excited to see it. I've already compiled and she's done it. And she even has a proof of concept on the kernel side. So that's, so she sent it that and it does things, but it's just to see if it works. That's all it did. And she designed the format too. Yes, oh, and she designed the format of the S-frame file too. So that's, she's done a lot. So appreciate it. Well it's not going to replace optitool. Optitool, well actually, technically we could make S, the kernel use it. That's actually something we could, instead of having an optitool do it, we could actually, since it's a big build thing, if the compiler is already making this frame, then I know you want it. Yeah, so I think the kernel has these two things which will be problem, inline assembly and handwritten assembly. So even if, so good thing is that if you do write CFIs correctly for these stubs, then you will have correct S-frame information generated for it. But as far as I understand, there are non-standard stack usages in these, and so if there is non-standard stack usage in the sense that there is a specific register that you're using to say that this is where my stack, stack pointers are going to be stashed, that's not something that S-frame can represent. It has to be, so S-frame doesn't say that I identify the stack pointer register for you. Across ABI's, it just says, the unwinded will find out whatever that ABI is, the unwinded knows what that stack pointer is. So cutting it all short, no. For kernel space, we'll have to see whether unwinding through these portions of the kernel where you're using the stack in a non-standard way is it very important? If not, then yes, it could be used. If it is important, then it will not be able to cover those parts in the form it is. And I think it will be harder to even, even if we go that route that if we do want to support it in S-frame, I think it will bloat up the format. But again, we can talk more depending on how important that part of the kernel is. It could maybe simplify S-tool. So we could actually build a S-frame and then S-tool could just look at the S-frames instead of having to do its own analysis. So from where it is used, S-tool could actually say, hey, S-frame, I'll just use this data and create my tables with that. And then throw it out because we don't need to keep it. So yes, I need to have that so we could do the quick search. Question? Or is this question, yes? Yes, so Steve, I was just thinking like, I like Matthew's idea of the calling the address space operations, find us a page cache. Another like plan B, if we don't want to complicate the ptrace path could be on the first NMI. Obviously, you're getting the kernel stack, but you need the user. So you could store in the perf ring buffer itself that you need information about this file. And then in perf, you could have perf call back into the kernel and load the S-frame. How's that going to help F-trace? Oh yeah, F-trace. I thought you were dealing with perf and... No, actually, I know. F-trace, BPF, anyone else using it? No, this is something that... You could make trace command... No, but how does Echo and Cat do that? Yeah, busybox is gonna... Yes, never, this is going to work for busybox, yes. Yeah, I was wondering, you said that there could be megabytes and you're doing a binary search. So if we did that by calls to some page cache API it wouldn't be that much slower than if you could just go through the page tables. Oh, it should be a lot quicker. And the advantage of doing it by calling into the page caches that you can control directly is you can control the page caches. You can say, hey, please do read ahead for these things. So you could actually submit not just one IO but if you're doing a binary search you could say, I'm gonna want these three pages. Please start IO on all three of them and basically you're prefetching a level ahead of where you want to go. So there's lots of interesting options once you're playing some of those kinds of games. Okay, and since we got the five minute thing I'm gonna try to go a little bit. Some of the, there's a lot of things here. This is the first thing. If we get this in, this is great. But then we have this issue, the file offset issue. So in the file, so where main is, and this is actually real numbers, I actually did a little program and ran it. I do a file offset in the text file or whatever. It says, text file offset or whatever in the section is one, one, three, nine. And the main actually printed the address of main. So I just ran it and printed it and I got that memory offset from there. Of course, you know, it's one of those things if you have the relocatable every time. So I ran this every time and you got a different number every single time. The one, three, nine stayed the same but everything else changed. I want, this is the one that, that crazy, the crazy number here is going to be what is going to be in that stack trace. How do I convert that to main? You subtract it from the, you subtract the VM, VMA, VM offset from it and you get the answer you want. Oh, do you? Yeah, I can get you that. That's a line code. So it's like from this? Yes. Or I have that. I can do that for you. Awesome, that actually great. So I don't know if you wanna go over this at all. Yeah, I had these two quick comments. I think the second one can be left out in interest of time. The first one is that this format is unaligned on disk and I know Colonel has a very specific thing around unaligned accesses. So there are three main components to the S-frame section. There is S-frame header, function descriptor entries and frame room entries. Oh, we flipped to the next slide. Yeah, so going back, you know, S-frame header, there is no problem because the members of this structure are naturally aligned and the function descriptor entries are 17 bytes. So you start off in the memory and at some point these will be unaligned accesses. All these accesses can be managed. I mean, you could write the API so that you know that if you are doing unaligned access, you go by it and then you could still read for when they are naturally aligned and so on. So it is possible to do it, you know, in these APIs. But I did want to point out that this will be done something like this. You will have meme copies and weird looking casts and stuff, but so, and the, because there is no other alternative. If you go to the route that this has to be aligned, I think it will blow it up the format because we really try to do many compactness related optimizations that if it is just a number like 16 that I have to encode, I'll use only one byte in the format. So it is natural for this format and it is best, it is in the best interest of this format to be unaligned. So I think it should remain like that and we plan to handle all of this in the APIs that we have. That's it, I think. Yeah, and that, I think we can handle it. She just wanted to make sure we pointed that out and let me know that. Here's some little things that we have to look out in the future by the time, but this is like I said, if I could just get this part in, that would be awesome. But we need a way of handle DL open. That is, we not quite have a file, but we may have to have a system call to say that here's the S frame for this file, you know, to tell us that. JIT, how do we handle JIT? So that's going to be very interesting because that's, we have to find a way of saying kind of like here's an S frame for this logic that has no file back, backing stuff. One thing I'm like saying is like, is there, you know, maybe passing the file names into the trace buffer, but I don't think you guys really care so much about that, but here's another thing, is could we actually get the symbol names somehow? So if we have the symbol L file, or if we could do some sort of logic to say, hey, I got this address, can I actually look up in the symbol team or I don't know if this door for something or if we could have another extension to S frame or something, maybe this is on your side that we could say, here, by the way, here's also the functions, not only is that the, yes, someone's back. So if I understand correctly, you already have to parse like the L format to even identify the section in the file and then do the binary search, so you already have to identify the section somehow, so you could identify the other ones, right? Well, yeah, but that's done already at the loading of the L file. So binLF parses it, so we just have to add the S frame, that's another thing I would say, we need to actually add the code in the bin function or something right now, automatically, yes, so. We'll be in the map info of G-Lipsy. So basically, G-Lipsy will load a shared object or whatever and then G-Lipsy internally it keeps, it has a linked list of map descriptions. So what you need is the kernel to have access to the map. Yeah, that's true, yes. So you don't have to parse there. But we have to define a form, like I mean this is something that the kernel and G-Lipsy does not have any defined specifications, so there has to be a specification that we could follow, that we don't break user space. One thing I think I don't think you mentioned or any of that S frame that she implemented is a versioning, so we can actually, actually if we wanna add to it, we can actually increase like the version of S frame and let the kernel know like, oh, this is a different version, so we actually can change the S frame format if we need to in the future. So that's basically I think I covered pretty much everything. Any other questions? And yes, Matthew, I'm looking forward to working with you, so we really, really want this in, yes. Actually, talking about the map info and everything, the S frame information, it lives in its own segment and the segment is typically in a shared object, right, in a library, is that a problem? At least shared amongst, you know, actually it's better, right, I guess. If the loader has discarded it, I mean, I can't work magic, I can't bring it back to the dead. So I'm looking forward, hopefully anyone wants to help out, we're gonna try to like keep track and this is why it's here, so thank you very much for your time and listening and thank you, and also thank you, Indu. Thank you.