 I'm Christos from Greece. Can you hear me? Yeah, thank you. And I'm going to talk about detrace, more specifically a subsystem I created for tracing arbitrary instructions in the kernel. Now, how many of you have experience with detrace? OK, so I don't think I should spend much time explaining what it is. So for those that don't have much experience, detrace is a dynamic tracing framework. So basically, you can ask the kernel questions about its state in real time. It originated in Solaris in 2005. And it's been ported some year in FreeBSD. And so some keywords in detrace like provider is basically a kernel module that does a specific task like tracing the entry point of a function or a syscall. And a probe is that specific point of instrumentation like it could be an instruction or a syscall, as I said. So detrace works in a language called D, probably familiar with it, some kind of C and Oc language. And if you want to read more about it, because detrace is huge, you can just go to this website. Has lots of cool examples. So how many of you have experience with FBT? OK, so FBT traces the entry and return point of a function. Here I have an example. We want to trace, for example, the entry point of malloc. So every time malloc enters, detrace speeds out a message that we hit an entry. And I also print which program entered malloc. So the thing with FBT is that it cannot trace inline functions or basically inline functions. And it's something quite useful to have. So for this, I created a new provider called Kienz. So Kienz stands for, I guess, kernel instruction tracing, maybe. And it was inspired by FBT. But in contrast to FBT, which traces only max to two instructions in a function, the enter and return instruction, FBT can trace any instruction. So this is useful because inline functions are not easy. You need to be able to trace any instruction in a function in order to put probes on inline functions. So being able to trace arbitrary instructions gives more fine-grained tracing. Like if you know in assembly where an if statement starts, you can put a probe there. And every time a if statement fires, you hit the if statement, detrace, brings out a nice message. You can also do that for loops, branches, or any instruction, really. But the thing is that it requires good C to assembly translation skills. I don't have, I haven't had time to create very high-level tools to translate C lines maybe to assembly instructions. So that's a project for the future, maybe. And Kienz is available for AMD64 ARM and RISC 5. And here below is where you can find the code base. So these are a few examples of the three kinds of syntax you can have in Kienz. So you can have just a function and leave the probe field empty. And that means that we want to trace every instruction in that function. So here in this example, we trace all instructions in AMD64 Cisco. And it says it matched 458 probes, which are mostly or almost all instructions. You can also specify specific instruction, but in order to find which instruction you want to trace, you probably want to enter GDB and this assembly function to find the exact offset. So you see this plus zero, plus one, plus four. You give it a function and that offset and then it traces that instruction. And you can also bring registers. So, and then we have inline function tracing, which is the syntax similar to FBT. So you give it an inline function and an entry or return keyword as a probe field. And here it traces the critical enter function which is inline. And you see it matches 130 probes, which means 130 inline copies that is found in the kernel. And so you see spin lock enter, basically every function that critical enter is being inlined in. So this talk is going to be more about high level ideas and not really very low level technical stuff. So I want to talk about three things. Basically how instructions are instrumented because it's quite different from FBT. What obstacles I found in creating different ports, the AMD risk five and arm ports and how inline function tracing works because I guess that's probably the most interesting thing. So as is the case with most detrace providers, Kinst uses a device file under dev-detrace-kinst and when you type the detrace command, say script, probe information is sent to that device file and lib-detrace sends that to Kinst, and Kinst does. Get disassembles of function that we requested, say we want to trace first instruction of malloc. Kinst finds the linker information from malloc, disassembles it, finds which instruction is the first one and puts a probe there. And then the original instruction is overwritten with a breakpoint one and we save the original instruction, say a push-RBP instruction. We save that to a buffer and we can restore that when we close detrace. So when the CPU hits the breakpoint, we enter, as always, the trap handler and there are various detrace hooks and depending on, we enter, basically we enter the Kinst trap handler that essentially traces instruction. And so as far as this second to last point is concerned, I'm not gonna talk about it now, maybe you will see why later. So this is essentially the flow chart of how Kinst works. So up here, we have detrace command, it's sent to lib-detrace and we send an IO control to Kinst and Kinst creates the probe. It overrides the original instruction with a breakpoint so we can trace it. And so here, this table, say this is a function, like on the stack and say this could be malloc, for example, and it has a bunch of instructions and we want to trace this instruction. So Kinst replaces with a breakpoint and each time the kernel, the CPU hits that breakpoint, we enter the breakpoint handler, we enter detrace and we enter the Kinst. And so Kinst does some stuff and we continue execution. And we could also have like a breakpoint, few instructions down the road and we do the same thing. So, wait, yeah, so how exactly we trace an instruction is essentially quite different from FBT because, so FBT, the way it traces an instruction is it does this overriding with the breakpoint and when we hit the breakpoint, FBT basically emulates that instruction. But because FBT traces only the return and the entry and return points, it is not very hard to emulate two or four instructions. So, but with Kinst we might have thousands of instructions that and for three architectures that's actually thousands of instructions. So emulating every single instruction is, I mean I wouldn't want to do that. And also it's very easy to get wrong. So to trace an instruction in Kinst, what most of the time happens is we use a trampoline, basically a block of memory somewhere else that we transfer execution to and we execute the original instruction there. I guess most of you are familiar with trampolines, yeah? I have slides about that. So exactly his question was how do we, so say an instruction addresses some place in memory relative to the PC, like the program counter. Now if we copy that instruction to somewhere else in memory, the trampoline, now that offset no longer is correct. So there has to be some modification and that also causes a problem on that with how we get back from the trampoline. So we enter Kinst, we transfer execution to block of memory somewhere else and now we have to return back. Otherwise the kernel will crash. So do some technical information maybe for those that care. Trampoline, as I said, executable block of memory and each trampoline is, Kinst has what is called electrical chunks maybe. So each chunk is a size page size and there's basically a list of chunks that in each chunk is logically divided into smaller trampolines. So each chunk has say 32 trampolines and each trampoline has the original instruction. And there's here like functions that are used to allocate memory via map find with execute, obviously, via map remove. Yeah, so this is layout of a chunk. We have the chunk and here we have trampolines inside each chunk and each trampoline contains the instruction, obviously. And depending on the architecture, this will be fixed. This will not remain like that. So depending on the architecture, after the instruction you might execute a break point or a far jump. I'll talk later about why. So in AMD64, we have, so each trampolines are per thread and per CPU. So they're essentially rewritten every time we execute a probe. So we have say two probes and one probe for first instruction of malloc and one for the second. First time we hit the break point, we fetch a trampoline, copy the instructions there, transfer execution to all kinds of stuff and return back. And then we do the same for the second probe. So this is not really a smart approach. It is proven to be quite buggy when running VMs. So ARM and RISC5 use one trampoline for each probe. So this is the control flow of AMD64, as I said, to be deprecated. So we have the instruction, we hit the break point, as I said earlier, trap handlers. And so what GIFS does is if we have PC relative instructions, we have to first modify the offsets. So if there was a move that referenced some memory relative to the RIP register, we have to re-encode that offset and then we copy the instruction to the trampoline. We set the RIP to trampoline, execute it and return back and basically do a far jump back to the next instruction. And for ARM and RISC5, which is the mechanism that AMD64 is gonna have soon, this is the control flow, we enter the weather against and we decide, okay, has this probe already fired? So if it doesn't, if it hasn't fired yet, we, this is the no case, we save the state, basically interrupts, any registers might care about and we disable interrupts as well. We set the PC to the trampoline, we transfer execution there and if you see, we execute a break point after the original instruction. So that break point basically goes back to the trap handler and we re-enter kins because it was called by kins. And now this probe has fired, so we enter the other case where we restore state and interrupts and we continue execution. Now, this seems quite more complicated, but it's not. And yeah, and synchronization, yeah, synchronization is done through a per CPU state structure, basically, where we save, each CPU could be tracing one instruction at a time. So if we're tracing an instruction, we disable interrupts, save current state, execute trace instruction and restore the state. So a few problems with this. AMD64, well, has a complicated ISA, as most of you know. So parsing instructions, because you need to do some parsing to some extent, it's quite tedious and very error prone. Reparative instructions, as I said, have to be re-encoded every time to copy them to the trampoline. Call instructions have to be emulated because have to be emulated in assembly for some reason, you cannot do that in C because you have to increase the stack size, make a reserve space in the stack, that's not possible in C. Here's a file if you want to, the BP call label, if you want to see how that's done. And in our minute risk five, you cannot really encode, you can re-encode offsets relative to the trampoline because that trampoline is believed quite far from the original instructions, so you'd have to use more than one instruction in order to encode a large offset. And since that's not possible, we sadly have to emulate those instructions. So for risk five, I don't remember like off the top of my head which instructions there are, possibly the branch instructions for example, you have to emulate them. But they're quite relatively easy to do. And some functions and instructions are completely safe to trace. So we cannot for example trace instructions that, atomic instructions for example, or general instructions that are very, very low level. We cannot mess with them. So the man page I think, yeah, the man page lists all instructions and functions that are not safe to trace. So in one function tracing, this is a syntax, there's a return missing here. And all in line tracing is basically done in libDtrace. Now what I mean by that is that libDtrace uses dwarf, have any of you experienced with dwarf? Well, good. And elf, basically dwarf is a debugging standard that's used by compilers, GDB. And it contains all sorts of information, file lines, say locations for variables, functions, function references, everything. But parsing this is very, very, very, very slow. Tracing a new line function can take up to 10 or 20 seconds in hardware or a minute in VM. So if anyone wants to make libDwarf faster, that would be great. So when we enter inline tracing mode, which is basically calling kids with an entry or return offset instead of a number, that libDtrace basically starts parsing all loaded kernel modules to see, okay, is this function really an inline one? And if it is, libDtrace finds the boundaries that each, for each inline copy, basically, we find the exact boundaries where it starts, where it ends, and we, through a few calculations, we basically get the exact offset and the functions that inline function is being called at. And libDtrace essentially creates normal Kins probes. Say, create a call enter is called from malloc at offset two, libDtrace will find that and it will create a Kins probe at malloc offset two. So that saves a lot of kernel computation basically that would be quite painful to do. And as I said, this is done for each kernel module, so slow, and if the function is not inline, I mean, we could ask, for example, Kins to trace malloc. Malloc is not an inline function. So in that case, libDtrace just converts that probe to FBT1 to avoid duplication in the code. This can also have nested inline functions, or at least as I have tested so far it can. And I've written a few articles in my website if you wanna read more about the technical details of that, how this is done. So basically, if we have an inline function, inline function, say this is a script that we wanna trace this inline function. This is the initial script. We gave it the name of the function and we wanna trace the entry point. And libDtrace converts that to that, which is something that Kins can understand. And if it's not inline function, as I said, just converts to FBT. So dwarf, as I said, you should not learn it probably. And dwarf works by saving debugging information into a tree. So we have a compilation unit on the top of a tree, say a file, and each leaf is a function, a, it's variables and all that, all those things. Yeah, and functions that get inlined have a certain, so each entry in the tree has certain attributes. Say, as I said, it's location, or, well, location, name, file that is being called from, and it also might have an inlined attribute. So what libDtrace does parses the tree, it says, if it sees that we match both the name and the inlined attribute, we know that, okay, we have an inline function here. So it starts searching for individual copies. Copies are stored in different entries, and they point to the declarations entry offset, and they also include the lower and upper boundaries. So libDtrace provides all of that, it's just very, very slow to implement. And here is an example of what an entry looks like for an inline function. This is a declaration entry, so this could be stored in a header file, basically. And you see here we have the inline attribute, so libDtrace sees that, and later down the road, it might find a copy. So this copy points to, we know that this copy basically points to this function because here the abstract origin attribute happens to be the same as the address of that, the address of the declaration. So we know that these two match. And here we have the low PC, which is the beginning address, and the offset for the end address. We get those, and with those two, we basically have the boundaries. So I know this sounds quite complicated, I certainly got lost. So how to calculate the specific offsets, we need, as you've seen here, we have the entry offset of the inline function, we have the return offset, we add that to the low PC, obviously, and we get the return offset. And so we know that the, read it from here, it's worded better. Yeah, so libDtrace, now it knows the entry and return offset of the inline copy, but that doesn't really give us much information about which function this is from. Because, as I said, Kinz needs to know the parent, let's say, function in order to create roads. I mean, we know the offset, but we don't know which function it's being called from. So to do that, we need to scan elf symbol tables, which is also quite complicated. But so we scan the table and we basically try to see, okay, so elf store, for each symbol, it has an entry and return address. So since we know the inline functions, entry and return addresses, we see that, okay, are those addresses inside that symbol? So I wrote this very, very smart looking formula, formula like if we find that the upper and lower boundaries of the inline function are inside an elf symbol, it is probably the function we're looking for. And then we can get its name because we're already parsing that symbol. And we also get the entry and return offsets using those calculations, which are simple. And that was it. Thanks to Mark Johnson obviously for helping me and Mitchell for helping with the RISC-5 port. Do you have any questions or those questions out here? Have any of you used KISS? No, so still quite experimental and low level. So I'd love to hear feedback if any of you cares about using it. So any questions? Yeah, this one. Yeah, so the problem is that, you mean execute, what? Oh, he asked, why do we need two traps? Basically one for this one, the original trap from the instruction and why do we need this one as well? A trap from the jambaline, is this what you're asking? So arm and RISC-5 is not possible to encode a far jump in a single instruction. So we have to find some way to break out of the trampoline and return to the original instruction. And so trampoline, having another breakpoint to get into Key Ends and restore the state and manually go to the next instruction seems to be the best way of on so far. No, we first need to execute the trampoline. So we transfer the execution there and we execute, say, the push instruction. Then we execute the breakpoint, we reenter Key Ends and since we know that, okay, now we trace the probe, we just restore the state and exit. Right, yeah, I mean, no, so, yeah, I mean, it could happen but it would make trampolines way bigger. So yeah, I mean, that's the solution I haven't really thought of. I need to test to see if it could potentially break things. For example, if you encode the power jumps inside the trampoline. I'm not sure, so, yeah. I'm not sure I understand, can you repeat? So, can you show me like after the talk how this could be done? I will show you like the code of how it works. Right, okay. Yeah, I'm not really aware of that but having just a second breakpoint seem to be simpler. Yeah, I'm not the most qualified, yeah. Wait, no, no, I think I skipped that before. Yeah, so, chunks live above the kernel base. Oh, that's how detrace works. I mean, all providers do that. FBT does the same thing. Yeah, yeah, more questions. Okay, thank you.