 So welcome everyone. This talk today is on S frame, the simple frame stack trace format. My name is Indu Bhagat. I am a member of the Linux tool chains team at Oracle. So today we're going to talk about stack traces. In general, many things about stack traces. What is a stack trace? What are the current methods of generating stack traces? What are the challenges of each of these methods that we have for generating stack traces? What is S frame format then? And what impact can it make? Specifically, I want to steer at some point towards the direction of how is it that S frame can help in the, for the use case of fast and low overhead stack tracing. We'll also see how to generate stack trace information. And hopefully I can also show you how easy it is to stack trace using the S frame stack trace format. We'll also talk about some current and future work that we have planned. And if there is anything that jumps out to you or you would like to contribute to, please we can talk at that time too. So stack trace is just simply a list of function calls that are currently active in the thread, right? And we use stack traces for a variety of use cases. We need it for profiling, tracing, debugging tools and more. And a useful stack trace like so will show you a call chain IPs and some symbols associated with it. Now, this process of symbolization is what makes it human readable and may bring in just not just function or symbol names, but also maybe file names and line number and so on. But the thing I wanted to point out here is that in today's talk, we won't get into the aspects of symbolization. We only talk about call chain IPs and how to generate them using S frame. So what are the methods of generating called stack traces? Well, so the short answer is it depends. There are tools out there like a debugger will do it differently than a profiler because they're just different use cases. They have their own restrictions on what is the real, what is the overhead that they can sustain. This, the list might be larger than this, but I think this is mainly, this covers the various methods that we have so far for stack tracing. And for the context of the talk today, I think it's useful if we get into the three of these methods, frame pointer method, EH frame, and then application-specific formats. This is just so that I can set up enough context to then talk about S frame format. So the first method is the frame pointer method, right? This is perhaps one of the oldest methods for stack tracing. And in this method, what you do is the ABI reserves a register and the compiler generates additional instructions using which the program will save the top of the stack as you enter the function, well, and also restore as you exit the function. But so for example, in case of X86, the register is RBP, and you will see additional instructions in your program which are the prologue and epilogue using which the program will save and restore these instructions. So there is an overhead with this method which is in terms of these additional instructions and also in terms of the compiler having to keep this register for nothing else but stack walking. So there is a performance overhead to this method and this has been one of the biggest complaints of the frame pointer-based method. On the positive side, however, this method is really, it's beautiful. I mean, it just works and it's so simple and easy to understand. So there are many good things about it, but the cons are mainly that there is a performance impact and the second thing, I don't know if it's so much of a problem in practice, but it seems like it's not, but it is not fully asynchronous. There are a few instructions in these few instructions in your prologue and epilogue. It's not fully asynchronous as in if you try to stack walk back from these instructions, it's not going to be fully precise. Moving on, the second method that is widely also used, in some cases, is the EH frame-based method. Now, EH frame is a dwarf-based method and it's not just a stack trace method, it's an unwind method. By that, what I mean is unwinding is not just, unwinding means stack tracing and also being able to restore the registers at each of these points. So, EH frame is really more than just a stack trace format and this information is present in the binaries as .ehframe and ehframe header sections. The format itself is quite compact, it's versatile, it works across languages, platforms and it works well in practice. There is good library support with it. If you're a user application, you want to use EH frame-based stack walking, you can pull in these libraries and there is good library support. The information is of band, so you don't have to compile your applications with FNOMID frame pointer. But the biggest concern with EH frame-based method has been that the stack tracer itself is slow and it is complex and why so? So, in EH frame-based method, what is encoded in these sections is basically the instructions to recover the stack offsets. So, some of these opcodes are simple, some complex, but you will need to implement a stack machine where you can execute these opcodes and figure out the stack offsets and unwind using that. So, this whole aspect of having to execute opcodes is what makes unwinding slow. At the time of stack tracing, it's not just there are other aspects to dwarf that make it slower, but by and large, this is the biggest complaint with EH frame-based stack tracing methods that these are slow and the unwinded itself is becoming more complex for applications to handle. And this is one of the reasons why it's not the preferred method even in case of kernel stack unwinding. There is, yeah, this Fedora 37 acceptor proposal now gathered a lot of attention from various folks in the community and it's interesting to read what we have so far talked about frame point or an EH frame. There is some discussion there too. So, and then moving on, so there are also solutions out there which are application-specific formats, right? And it's not just one, there is one that we know of which is the kernel or stack trace format. There are others out there, not open source, but you can see that there are applications out there who are fulfilling their needs by using similar methods. And what they have done is basically come up with their own stack trace formats. And if you look across these applications, one of the common reasons why they have had to implement their own and come up with their own application-specific format is that they consider that EH frame-based stack tracing was coming out too slow and complex for their use case. And the problem with this solution is that now across applications you have this solution where these unwind formats which are internal to the application, you know, like ad hoc formats, but they're not generated by the tool chain. So if you are an application, you need to now port your application to multiple platforms. So you're now stuck with something that you cannot port. It becomes difficult to maintain, right? So you see now connecting all these methods together. There is bifurcation in the space coming out. And we need to ask why, why is this happening, right? So take a step back and let's look again. Frame pointer method was good. It was low overhead, but there is a performance impact. EH frame-based method was good. It was versatile. We could get rid of the additional, you know, we could get, we did not need to reserve any specific hardware register for stack walking purposes. But the unwinded itself came out to be complex with high resource requirements. And it wasn't the preferred method of stack tracing, especially for profilers and tracing purposes. Then come all these application-specific formats. The good thing about that is these are fast formats. They are designed specifically for the application. Information is off-band, so to speak. And but the problem here is this is not generated by the tool chain, right? There is this effort that is needed to reverse engineer and maintain these tools over time and port these across platforms. So this begs the questions of, so is there a set of requirements which if we fulfill, we can cater to the needs of fast and low overhead stack tracing? So the short answer is yes. And now if we connect everything that we just talked about till now, the requirements are these four. Support for asynchronous stack tracing. Asynchronous would just mean that given any PC you do, you can stack walk. Given any PC you will be able to stack walk. The second being that the stack tracing itself should be low overhead. Stack tracer should not be very complex. And third, it needs to be generated by the tool chain, right? So now I pull an S-frame into the context and S-frame has been designed with these requirements in mind. And we have defined and implemented S-frame stack trace format in Benutils 2.40. The spec is available in the release documentation. S-frame stands for simple frame stack trace format. It encodes the minimal necessary information that you need for stack tracing, which is given any PC, what is the CFA and what does the frame pointer and return address? That's all you need to stack trace. And that is all that is encoded in the simple frame stack trace format. Now, yeah, current version is version one. The format is defined for two ABI's at this time, AMD64 and AAPCS, the AR64 ABI. The format has support for encoding PLT entries as well. And on the AR64 side, there is support to also encode pointer authentication-related constructs. So if the return register is mangled with some signature bits, you have constructs in the format that allow you to encode whether or not the return address is mangled. And if mangled, whether it's key A, key B, and so on. The information appears in a new ELF section called dot S-frame. And the section itself appears in a segment called PTGNU S-frame. All that you need to do to generate S-frame is use an assembler 2.40 and later, well, in future, any later assembler, and pass a minus GS-frame to the assembler. The new linker will do the right thing as an if it sees dot S-frame sections and the input binaries, and it'll combine the S-frame section and generate one in the final linked artifact. Bin utils, read ELF and obj dump. The other binary utilities also have support, so you can specify minus, minus S-frame and get a human readable text output to it. Yes. Oh, okay, yeah, yeah, yeah. Sorry if I said it the other way around. So the question here was the linker already, does the linker already support linking S-frame sections? The answer is yes. The support is in GNU bin utils 2.40 and the linker just does the right thing there. It combines without any additional arguments, yes. Thank you. So now let's start to get a little bit into how is it that the information is laid out in the S-frame section. It's really simple. There is no complexity to it. So first from a high level view, there are only three components to S-frame section, the S-frame header, the S-frame function descriptor entries and the S-frame frame row entries. The S-frame header contains a magic number, a version number and some offsets. These two offsets that you see on the right side, the arrows, they point to the two subsections in the S-frame section. The first subsection of interest is the S-frame function descriptor entries. The second subsection of interest is the S-frame frame row entries. We will get into what these mean, but I think the takeaway from this slide that I would like to have is just one that all of these FDEs, you see FDE is the function descriptor entry. They are all together in a subsection and all the FREs go together in a subsection here. The FDEs are fixed length entries. These are basically some information per function. And if you keep these entries fixed length, that allows you to do binary sort, quick binary search and find your, the relevant FDE per PC. So that's the reason the FDEs are fixed length entries and FREs, then for size optimizations, we kept them variable length. So yeah, only the thing to take away here is information access in S-frame is via offsets and FDEs are clubbed together and FREs are clubbed together. To represent stack trace information for a function, you need one FDE, the function descriptor entry and N number of frame row entries. The number of frame row entries is something that depends on your function, how the function has been laid out. How is it that it uses the stack? So now let's go in further and see what is an FDE? What information does it contain? Again, very simple. It just contains the function start PC. What is the function size and bytes? What kind of code block is it representing? Is it a regular code block or is it a PLT entry? The reason being that the start PCs should be considered differently if it is PLTN. It's more like a pattern thing, masking thing for PLT entries for regular. The unwinded needs to handle it differently. Then there is an offset to FRE. So remember this diagram here. FREs for that function are appearing later in that section at a specific offset. So FDE tells you what is that offset so that you can easily retrieve them. The last thing is the number and type of FREs and we will talk about the type of FREs right now. So FREs is basically the, it's the backbone. It's the backbone of the S frame stack trace section. And all it tells you is that given a PC, what are the stack offsets to recover the CFA, FP and RA? And that's about it. So FRE contains the start IP offsets. So if you know the start of the function, all you have to encode is at what offset into the function does this stack trace information hold valid. So as you can see, the function size is maybe different. So you don't need as many bytes to encode the start IP offset. And this is one space saving optimization that you have that now you can, now basically in S frame format you have three different kind of FRE representations and you can use one over the other depending on what is the size of that function. And the last thing is, yeah, just the number and size of stack frame offsets are encoded right there in towards a trailing part of FRE. So now at this point we have talked about what is, what information is contained in S frame and what does it look like? Now I thought for the audience here it might be better to take rather again higher level ground and ask, so what is it about this format that makes it so effective? We talked about some features but to sort of tie it in, tie it up altogether. Let's see from the perspective of what is it that makes it effective, right? So the first thing in the right direction that I think is generated by the tool chain and that helps a lot in the sense that you can now use this format and this can be easily generated for your favorite applications. The second thing is going in the direction of since this has been designed for those requirements in mind, it is effective and again it's like let's take three key features that we can discuss and then bring home the point that this format is effective for the use case of fast and low overhead stack tracing. So the first thing is that these FDE's are sorted. They're sorted on the start PC of the function. Now most stack trace formats will do that and they must do that because you want to look up information in a quick way. So this allows them to do band research and this is similar, for similar reasoning it's also done in S frame. Since the FDE's are sorted on start PC it helps you to quickly find the stack trace data for the PC and the FDE's that hold the, FDE's hold the offset to where the corresponding start FRES are, yeah. So that's one thing. So it allows fast lookup of information. The second is the FRES, right? The stack offsets are encoded directly into the format. There is no business of executing more op codes to then generate the stack traces and this is also for those of you who know orc this is also something that, this is something that orc also does. The stack offsets are encoded directly. It's just that in S frame format there are some space optimizations done here. So which makes it more compact representation to put the stack offset. But that's the second feature that gives it its effectiveness. The fact being that the stack offsets are directly encoded you do not need to do any computation when you are stack walking. The third is also important. I think compactness is important and there are a number of space saving optimizations that we have done in the representation of the format. So think of it, so there are two knobs that have been used. One is that the functions can be of varied sizes. And the second is that the functions may use stack differently, some functions may use more space in the stack, some may use less of course. So the offsets will either need to be large or small. So depending on the function you can choose what is the size of the stack offset that you need to encode. So taking all three things then, well before we do that. So this is one graph that I have. We need to take it with some grain of salt as in, so first what this graph shows you is the ratio of S frame to the EH frame sections. There are two sections, EH frame, EH frame header. This is for X86 and it's only a limited set. I just, it's not really very representative. These are just some binary utilities. From the GNU Benutils. And this graph shows the ratio of the S frame to EH frame sections. And on X86 it's coming out to somewhere between 80%. Now EH frame is a totally, you know it caters to a different use case. It has more information in it, so it's not really a fair comparison here to say that S frame is 80% of EH frame. These are two different formats. They cater to two different use cases. But I thought it might be good for to satiate your curiosity if you were thinking about how large these sections are. And hopefully this puts it, gives you some perspective to it. So yeah, I think so at this point we have talked about few key reasons why S frame can be effective. And now we'll start getting into, we're switching gears now. So we will now start talking about the usability aspect of S frame. So there is a library called lib S frame. This is the S, sorry. This is the S frame format library is shipped with Benutils 2.40. It has APIs to read and write S frame data. Now it's been written with BFD and LD in mind. So it has APIs to write S frame data. I don't see why a stack tracer may need some APIs to write S frame data. But just to cover it all, there are APIs there to read and write S frame data. There are also some APIs for stack tracing. So if when you're stack tracing, you would like to find out given a PC, what is the FRE? And given an FRE, can you give me the stack offset? So those APIs do exist in this library. Going deeper, the S frame, lib S frame APIs. Again, I would like to assert here again that this is flavored for BFD, LD use case. So, and it is a young library. So I, at this point, I don't think it's fair to make any guarantees for a stable API, but we'll work towards that. The S frame dash API.h is a user interfacing data structure. You'll find some user interfacing data structures there and the API details. At this point, I think it's also valuable. If I, if I, clarify that S frame sections are unaligned on disk, right? So the lib S frame internally will handle, will handle the access to it in a, so that there are no unaligned accesses, but the format itself is unaligned on disk. There are read, write, and find APIs. And if you want to look up easily, there are some, they're all prepended with the same string S frame underscore decode S frame and code. And then for finding, there is S frame find FRI and FRI get offset. So this is some code that maybe can shed some perspective on how easy it is to stack walk. So this shows you how to stack, how to do one stack walk. So given a PC, what you do is you figure out where is it that the S frame was loaded up? You know, you get some offset. And with that offset, you find the FRI. Once you have found the FRI, so this is find FRI, once you have a handle on the FRI, all you need to do is get the CFA, figure out the CFA, then get the RA offset, figure out the RA return address, get the FP offset and get the FP. And that's it. So you're set up for the next iteration. And if you do this iteratively, there you have all the PCs in your, as you get the return addresses, that's your call chain IP. So it is simple and easy to use. We have tested for both AR64 and X8664 at this time. So for future work, there is some work lined up and on the assembler side, there is, so the GNU assembler, at this, the GNU assembler, the way it will generate S frame data is it will use the CFI directives generated by the compiler and it will, if you specify the minus GS frame flag to the assembler, it will generate the S frame section. Now there is one dot CFI directive that is being skipped at this time, which is the CFI escape. The compiler generally generates it and along with it, it specify here is an expression when you are generating, it's used in EH frame. So here's an expression to figure out what the stack offsets will be. Now the S frame machinery at this time hasn't been coded up to work on the CFI escapes, but I will add that, that's one of the things to do. But this means that it is not fully asynchronous because of this, but that being said, the number of times compiler generate CFI escape is really limited. So to be fair, I would say it's not fully asynchronous but it should be really, really close. Assembler side, we should also add some more regression tests. There's ongoing work on that. I think and by and large, what we plan to do in the next few months and for much of this year is to take on more feedback from the community on how and where they see use cases of S frame format. If you have some user applications that you would like to use S frame format for, please let us know if there can be any collaborations there. For the Linux kernel, there's another, yeah, we thought about the Linux kernel and we have started some discussions on how it can be used in the kernel as well. The discussions are ongoing. We did some POC, there are issues with that POC but what has come out is that there seems to be a general consensus that it will be a good thing to try in the kernel for user space. It looks viable so far and we would move ahead in this direction and the general idea is to do it generically so that it can be used by many of the components in the kernel like Perf or Ftrace and so on, Dtrace maybe and BPF hopefully. I think that's, we're nearing towards the end of the presentation, yeah, so that was quick. So we talked today about stack tracing, what are the methods of stack tracing and what are the challenges with each of these methods. We talked about the impact of S frame format, what it is and so on. If you have questions now, later get in touch with us at the binutils mailing list and yeah, that's all I have. It went by fast. So what applications are actually using S frame format today? Today, well, we have started looking into the user space. You're doing user space stack walking in the kernel so that will be one application of it but other than that, actively nothing yet but we would like more tracers and more profilers to make use of it. So the use case is really fast, low overhead and in theory, it'll be beneficial to all of the stack tracers, all the profilers and tracers out there. So are you working on converting some user space applications to use it now or are just getting the word out so other applications will start using it? So we started working on the user space for the kernel but the intent is also to go ahead and find out more of these applications where S frame can be directly used, yes. At this time, short answer as we started looking in the kernel but the others are something to think about for us, yes. So I don't know a lot about stack tracing but I do use tools that obviously benefit from it. I do a lot of multi-threaded parallel programming and I'm curious whether S frame or other approaches are better or worse at doing stack traces across threads like not just within, this is what this thread did that it came out of this frame in this other thread or things like that. I think it's orthogonal to which of these formats you use even if it is multi-threaded, yeah, it should just work depending on how you handle the stack across processes or threads, you just have to make sure that there is no collision there but format-wise this should just work, yeah. Okay, and I also had one just clarifying question, use the term asynchronous a few times in your talk and early in the talk you talked about it being a problem with the first approach, I think. It was never clear to me what asynchronous meant in this context, could you clarify that? Yes, so asynchronous means that given any PC you're able to stack walk precisely, right? So this is generally a problem with stack traces as in even if they're imprecise they are considered sometimes they're good enough for an application sometimes not but when I'm talking about asynchronous I rather mean that given any PC I want to be able to generate a stack trace precisely. So in case of frame pointer, so when you enter the, in case of frame pointer what the compiler has done for you is added few instructions in the prologue and a instruction in the epilogue in case of x86, right? So the first instruction I believe is what it does is it saves, it saves the RBP on stack. So you'll have a push to the stack and then, or is it the other way around? But either way, so the RBP has not been saved yet. Yeah, so you should first save and then update the RBP. So when you are in those instructions the RBP is not yet correct, right? So if you stack walk from that PC you will miss some part of your stack trace. So that's what I meant by asynchronous. So frame pointer based approach is not asynchronous because if ever it happens that you are, because these are asynchronous, well asynchronous now being the request to stack walk, right? This may come in at any point and if your application is at that PC then you may have, you will have imprecise stack trace. And does that answer? Yes. Thank you very much. Okay, thanks. I had a similar question, probably like a follow up. So the S frame I understand it's kind of like a tracing format but is there a similar tracing format type prior to this? Or was it that, I'm just not familiar with the current situation about tracing. Was it like every tracing application uses its own format and you want to unify it or? So tracing applications. Well, I believe if I look back now and you know gather my thoughts most tracing for most of the trace. So the two components to your question are there other formats apart from stack trace? The short answer is yes. Laundry, if you look at EH frame it can be used for stack tracing. Although it's an unwind format. Unwinding means that as you're stack walking you are also restoring the register state, right? But you can use EH frame for just plain vanilla stack walking. So it is also a format for stack walking. Now there are also application specific formats like orc exists which solves a problem for kernel space stack walking, right? And now S frame. So yes, there are and there are more formats which I have not talked about. There are more formats than S frame that try to solve the stack walking problem but S frame is the only one at this time that seems to be doing it in more or less while more or less fulfilling the requirements, right? So we talked about those requirements of fast. What are the requirements to do fast low overhead stack tracing? So I think that's the first part of your question and the second was, could you repeat? Oh, my second question is probably, is it like for all those different types of, are they like for each of those formats, is it being used by multiple tracing tools already? Or it's just like each of its, like a tracing tool use each of its own format and you want to have a new format to unify that. It's mostly each tracing tool is free to use their own format but I think in practice, many of them are still using frame pointer or maybe even EH frame. I'm not sure. There are just a number of tracing tools out there because if you're pulling in, yeah, Lib and Wine and LibDW, you're using EH frame in those cases. And it really depends on the tracer. I don't know if tracers do it out there that they can detect that, you know, I would rather be these applications are using EH frame. So you specify while using the tracer that just use EH frame because just use frame pointer because I know the application has been compiled with frame pointer enabled. I saw you mentioned the ice frame is designed for x66 and I guess you also mentioned the power of the arm, right? The architecture, if I may be wrong, you have, if I recall correctly risk five, the function API is a little different from x86. Have you considered if you apply this to risk file, we will have the same solution or the solution will be a little different? Risk five, I'm not really sure of the API but arm 64 is different from this arm 64, the API. Well, arm 64 also has a link register. So, and this is really x86 related, you know, the picture here is for x86 but arm 64 has a different ABI, yes. It has both frame pointer and link register, right? So every call instruction will make, when you in arm 64 ABI, the call instruction will save the return register and the link register. But even then the problem remains that as you are going, as you see these instructions you don't know whether to use the LR or whether to use the FP. So there is still some reliance on an additional information somewhere. It could be EH frame, it could be S frame. You still need to rely on poor PC. So is it the LR that I need or is it the FP that I need? Right, that's the arm 64 ABI. For risk five, I'm sorry, I don't know too much about that ABI at this time. Recall is a little different than x86. But I guess maybe the basic idea is similar, you need the PC first and the frame pointer. So so long as the return register, so long as the return address is at a defined location, either it's stack or it's a fixed PC, right? So in arm 64, it's a fixed PC. It is LR. If it is not LR, it's stashed away on stack. In x86, the return register is always on stack, right? So it's a defined location. For these kind of ABIs, you can use S frame. But if there is an ABI out there where the return register can be anything, anything from any of these registers, then S frame format cannot be used as is. Because in S frame format, you don't encode the register number to say that this is the register that will be used for return address. So for those other ABIs, you will need format extensions. At this time, it works for MD64, AR64, and it can also be made to work for other ABIs where the return address is more or less yeah, in a defined location and a fixed register. It may still need format. It may still need some extensions as in we'll need to take a look at the ABI and just make sure everything still works. But I don't expect major revamp for those sort of ABIs for the other ABIs where the return register can be anywhere. I think there will be problems with this format because then it just bloats up. Now you have to all of a sudden be able to restore any register, right? So right now I have three offsets, CFA, FP, and RA. Now if the return register can be any register, then it just really bloats up the format. And I don't know if we really want to get there. Yeah, with this format. I have a question myself. Orc is one of those application-specific formats, right? And could you clarify what is the relationship between S-Frame and Orc? And in particular, why the kernel will need something like S-Frame to unwind the user land stacks and why you could not use its own application-specific format for that. So Orc is application-specific, yes, and it has been designed for kernel stack traces. Now kernel has this, kernel has not just C code, but also inline assembly and handwritten assembly. So, and there are some constructs in there which are I would say non-standard stack usage, but Orc still has representation for all of that. So it can represent everything that the kernel is currently doing. And for user space, can you use it? I think there'll be issues as far as I remember. There is a struct where the function start address can only be 16 bits or so. So it will need changes for it to be used out of the box for user space. You could do versioning of the format and move to that, but I haven't looked very closely. This is one of the things that I remember from having looked at it last time. Yeah, I mean, in theory, it is, Orc is doing similar things. The stack offsets are directly encoded. It wants to do stack, it wants to do fast stack tracing. So there are two sections, I believe they also do some sorting of these IPs. So you can look information clearly. So it's similar principles, but it will need changes for you to use it in user space. Okay, so basically S-Frame is not to replace Orc in the kernel, but to complement it to unwind user landstacks. For user landstacks, yes. Yeah. Well, thank you. Thank you.