 Hello. Thank you for joining this talk. My name is Chris Van Hees. I work for Oracle in the language and tools department for Linux engineering, and I'm going to be talking about detrace, specifically detrace on Linux, and the main focus is the new development that we're currently doing, which is leveraging the power of PPF and other kernel features that are for tracing. So to give a little overview of the presentation, I'm going to start with a very short history of detrace on Linux, primarily because from previous experiences, we found that people are not really aware of how long detrace has existed on Linux. And so we're going to put a little bit of a historical context here. Then I will be giving a short overview of detrace on Linux, the overall design that is primarily focusing on the existing version or the older version, I would say. And that we can work towards the new development, which is detrace based on PPF and other tracing facilities. I will be highlighting some of the work that has gone into that, how we're approaching that implementation, and some of the different needs that we found. Then I'll be going to the implementation details, not getting to all details of it, because that would be a very lengthy talk, but definitely want to highlight some things that are really special in how we're doing this. And then I will move into lessons learned, and I probably will be speaking about those throughout the presentation, something we found as we were doing this work. But I want to highlight some that really stood out and can also help people who are like us are working with detrace and or with PPF primarily. And then finally, I want to highlight some limitations we have found. And some unsolved mysteries, unsolved problems that we're still working on, and it remains to be seen whether those will be things that we may have to submit a kernel patch for, or we find other work rounds. So moving forward to just looking at detrace on Linux for a moment, as historically speaking. So originally the project started in 2010, and it really started with our management essentially asking us to explore the possibility of bringing detrace to Linux. And I was given the task to do this exploratory work. And I came back with the decision from my side of recommendation that I didn't think this was feasible. And then two weeks later, we were told, we have to do it anyway. So we had to figure out, you know, how this can be done. So we initially were able to release the first version in 2011, based on Linux 2.6 kernels. And it was a challenge because there were quite some features that we wanted to implement for which we didn't have the support in the kernel. So developments continued on from there, alongside with obviously newer Linux kernels being released. And so, you know, as I'm showing the slide through, you know, the time period from 2013, 15, 18, 19, we are progressively being active in developing detrace. And at this point, the implementation that is available, and, you know, we have been backporting the patches and forward porting both. So we make patches available for these versions of what we call the legacy version of detrace on Linux. Even to the most recent kernels that are being released. So we want to keep supporting that effort for people to be able to use detrace. Now, in 2018, we kind of set it together as a team and determined that the use of a very invasive kernel patch in order to implement detrace was suboptimal, because a lot of people don't want to, you know, apply large patch sets that occur in order to have a feature like that. And so we started looking towards whether we can do this using existing features in the kernel and redesigned the user space component so that detrace itself would be just a user space. And so we've been working that since 2018. And so this year, we were able to release a new version of detrace on Linux, which is based on BPF and other Linux kernel tracing facilities. We currently still requires a few patches to be applied to the kernel, but they're not specific to detrace. And they're definitely not invasive. So we'll get to that. So looking at detrace as a whole, and this is before the whole design based on BPF, it is built around two big components. One is the kernel space producer, one is the user space consumer. So all the tracing data is being generated at the kernel level and user space consumer is just taking the data out of buffers and representing them. It also has a component in terms of loading the tracing programs into the kernel, compiling them. But that's, you know, all done completely at the user space level. The actual tracing work is being done in the kernel. And that involves some kernel support functions, very core kernel support that we have had to add to be able to facilitate various features that detrace has. We added proxies to the kernel that are very specific to detrace. And then there's a set of kernel modules that implement the bulk of the producer code. And the only reason why I mentioned this slide here, a rough count of how many lines of code those components comprise is really to kind of give that sense that it's about split 50-50. And the reason is that although the kernel does the most of the work, the consumer contains a compiler and that just consumes the bulk of the code. So, but it still shows that, you know, we had quite a task ahead of us if we wanted to essentially get rid of the kernel components. So this is a bit of an overview slides that kind of I've been using for years pretty much from the beginning when I was at this work, that tells you what all sits in detrace. And that is based on the original design. So we have the user space component on the top, which modifies the user space, obviously, and that has some of the trace buffer management. And then everything else in the kernel. You know, there is the execution engine, the helper functions. There is a process that handles probe actions. And so in a way, there is both an activity going from top to bottom and from bottom to top. So when you start tracing, obviously, you invoke command on user space and information is being sent down to the kernel. Once tracing starts, all tracing is based on probes and fire at some part in the kernel. And so there they initiate execution. So probe fires, then there is a provider that handles a set of probe and have the same characteristics. And they all funnel into this probe action processor, which is a single function in the kernel module for detrace that can the unifies probes as a genetic probe and handles the execution in the execution engine and writes data to the buffer. So that's kind of the overview and largely that same design is still present in the current version with based on BPF and kernel for tracing facilities, but it's no longer detrace code. And so some some functions have changed there. So looking towards doing detrace on BPF and some other tracing facilities, what we found this was in 2018, when we started exploring this is really that a lot of the features that exist in the kernel for tracing have matured a lot. And I don't want to imply that when they were designed, they were kind of unstable or not mature in their design, but a lot of them did not work together as well. There was not that much unity between the different limitations. Now we have K-probes, U-probes, trace points and raw perf events. And they can all be made available as perf events. So you have a single interface that gets its access to all of these. And there are different mechanisms of how you enable them. But ultimately, you can treat them kind of in a more genetic way, which was one of the underlying needs we had. And the fact that it's already in the kernel means we didn't have to add code for that. BPF exists as an in kernel execution engine. Now there can be debate on whether it's supposed to be interpreted as a very generic execution engine for pretty much anything, or whether it's more for a specific purpose. But for us, it did what we need to do. It can be used to execute BPF programs that are attached to perf events, which is great because we have proglophiers. It executes a BPF program. It can create data and write it to a perf event ring buffer. And then we can pull it out of there from user space. So it kind of does what we need to be done. And so that whole portion in the previous slide that is kernel level, more or less exists in the standard kernel anyway, so we can make use of that. And so the very important part behind the design is that we are providing retrace. You know, we're not providing some kind of a tracing wrap around BPF. So the core of all the functionality, the core of what defines, you know, what we are offering in terms of functionality for tracing is retrace. Detrace added is documented. Detrace the way people know it. And so that was a very big underlying concept that we can't change how detrace works. So we're using BPF. BPF is not what is ultimately the center point of how tracing is going to be. And that's an important distinction to make because we want to make it important that people are familiar with detrace can expect that it's still gonna work the way they used to. So very quickly on the design philosophy. And I used to have a more entertaining slide for this, but I found it and it takes away from what is really important. The biggest assumption we made, starting out with this was that we can do everything in user space. And that was after we did a few false starts with trying to get features into the kernel that we knew what we're going to need. And we were correctly put in our place about the fact that it's not the way we should be going about things, you know, we should first establish that there isn't a real need. We should. So basically, we should first try it another way. And that's, you know, what we're gonna do, we're gonna assume we can do everything in user space. And then we'll see where that goes. The second assumption was that we assume there's no impact on performance or stability. That that's, this is the point where in previous talks, I kind of pointed out the fact that, you know, you look at those two assumptions and you start crying. Because there's no way this is going to work. But, you know, we were trying to be overly positive about it, so that we can kind of evaluate how things are going to go. So we take those assumptions, how crazy they may be. And so we start the re-implementation of the tracing user space, you know, reusing as much of the code as we can, because that helps us guarantee the functionality is the same, but making sure that, you know, we work with new features. Then we perform accuracy tests, you know, it needs to work the way it is supposed to work. Stability obviously can't impact the system, you know, tracing should be as noninvasive as possible. And then performance test, again, you know, doing tracing always is going to affect the system somehow. But it should be minimal, to the extent we can. So then we evaluate the findings. That is where, you know, we start weeping because there's no way, you know, you're not going to see some effects from the fact that we're doing the implementation differently. So that's where we have to evaluate, adjust our implementation, go back and just reiterate the same process over and over again, trying to perfect the implementation. If eventually we find that with throwing everything added that we can come up with, we cannot maintain accuracy, stability and performance, then we get to the point of needing to look outside of this just user space aspect, and evaluate whether there are some improvements we can recommend for the kernel implementation of some features, or maybe make recommendations or additional features to be added to the kernel that we'd be for the benefit of any tracer that basically uses BPF in some of the other kernel facilities. So the goal was really to be able to, at a minimum, collect enough information that we can make a well recent case with evidence that, you know, there are limitations that we hope to be able to resolve at the kernel level. So that's really when all else fails at this point. So looking towards the implementation details, so BPF and Dtrace are very different things, and I shouldn't have a slide comparing them because they do different things. You know, BPF runs code in a virtual sense, it's an execution engine, and Dtrace is a tracing tool. The reason why I'm comparing these two is that there are some underlying assumptions that exist, especially for where BPF is used for tracing, that is quite different from what Dtrace expects, and that has been one of our biggest pain points. So in BPF, you have a variety of program types, and if a program is of a certain type, then some functionality can be used and other functional cannot be used. So that's very important because if probes can have different program types, I can't write one program and expect it to work for different types of probes, which is something in Dtrace we can do. On top of that, because of each probe type having its own program type, it also has its own BPF context it runs with, that is the kind of information that is given to the BPF program when it starts execution, and that differs depending on probe type. So I don't have a generic abstract context, so to speak, that represents all the probe types. That's an issue because again Dtrace expects something different there, and you can only attach one BPF program per probe. Now that's not exactly correct. There are ways you could get around that, none of those are pretty, and there is still a limitation of how many there can be, and that is a hard limit no matter what, so whatever tricks you use you're still going to hit limits there. Now on the Dtrace side, we work with a single generic program type, all D programs, these the language in which they're written as a high level language, it's one type, it's one type program. We only have one generic concept of what the probe context is, and all the probes somehow figure into that, and that's why we have the providers, I showed a slide with the overview, you have providers that group probes count together by type, but they expose a generic probe context that is the same for all of them, just they might fill in different information. And again we can have many classes per probe, and a single class can be attached to different probe types. So there's a bit of incompatibility here, so to speak, between the two. So if we look at more deeper into the implementation details, look at a D class, which is simply saying I have a probe or multiple probes, maybe specify, and I'm attaching some code to that, essentially that code will be executed when one of those probes fires, and it can be a probe of different types or whichever. So what we do is we take the D class and we generate a BPF function, and you know I put the function through the tab there, it gets that DTD CTX context, that's the D-trace context, that is what it's been passed to. Now at this moment when this code is being generated, we're presuming we only have one program type, we only have one probe context, how we go from probe on the BPF level with different contexts to this, I'll address in the next slide, but so when the actual tracing code is compiled, we are compiling it as if it's generic, with no knowledge whatsoever about the probe. So it operates on a genetic probe context and it can be used for multiple probes, multiple types, so we compile this class once, and then it can be part of a BPF program for any variety and any number of probes. So what that works is that once we have these functions and we're ready to attach programs to probes, we generate a BPF trampoline program, that is an actual BPF program that we've got to be loading into the kernel, and that one is specific to us with a certain program type and accepts a specific BPF context for that program type. So it sets up the generic probe context based on that information, it's kind of doing the work that in the past the providers will do, and then it calls each class function, every class in term, passing that generic context, so this is where we have moved from doing things the BPF way to now doing things the D-trace way, and so all these class functions are written in a generic sense and do whatever manipulation they need to do, but it's without having again knowledge of what the original program context was, and then we perform any necessary cleanups so that after the program finishes we're back in a clean state, and I should have mentioned on the previous slide every class function is the same way, every class function is written as if it is the only class function, it gets passed the probe context, it runs, and at the end of it we should be back in the same state as when we originally got called, except for some data was generated that was put in a buffer, and so we can have any number of classes called one after the other, and they don't affect each other except in cases where the code is specifically written for that by the use of global variables, thread local storage variables, things like that. So before this whole rewrite or redesign started, the unit of compilation, and here I'm comparing detrace before and detrace now, just to put that in context, the unit of compilation was an action, and that's not the easiest thing to describe, so a class might contain multiple actions, actions are usually things that generate some kind of data, so let's say if I have a write system call that I'm tracing, I could write out each of the arguments that was passed to the system call, if I do them independently, each of those data items is its own action, if I do it let's say as a print f, and I include all of them in there, that print f is one action, so it's when looking at a class it's not that easy to really see the boundaries between the different actions, but that is what actually happens, the compiler is breaking up the classes into actions, each action is being compiled independently, it generates diff code, which is just the intermediate format code, it's a byte code, and they get loaded into the kernel, and so it is loaded together with some metadata that essentially associates a set of actions with a particular class, and then that with a particular probe, so we're dealing with different units, and so when a probe fires a sequence of actions is being executed, one after the other, not a series of clauses, but the effect is the same, and so there's a kernel component that provides variable management, like I said, global variables, local variables, PLS variables, all that stuff, and some support functions. Now in the current worlds where we no longer have these larger components in the kernel, specifically for Dtrace, the unit of compilation is actual class, so the whole class gets compiled into a BPF function, obviously we generate BPF code, and because we have all these clauses we compile once, and then used for different probes, and they might be of different probe types, we actually had to introduce a linker, which takes the generated trampoline program, which is written BPF, which has function calls in there, and relocation records that tells us which functions need to be linked in, and so we have a linker process that pulls in all the BPF functions that are necessary for this particular program, because the BPF program has to be self-contained right now, and so it links it all together and that gets loaded into the kernel, so where before we're doing individual actions, we are now actually loading multiple clauses in one single program, and that is done per probe, and then of course variable management and support functions has to now be implemented in BPF code itself, and that is in part done by code that is generated in line by the compiler whenever we're making use of a variable, which means we have to do some memory management as well, and it also uses BPF functions, so actual code that is written in C that is compiled into BPF codes ahead of time, and the linker pulls those in and links them into the BPF program to accomplish functionality. So that brings us to the lessons we've learned in this whole endeavor, and that's been quite a bit. The BPF verifier, and I'm bluntly honest by referring to it as your enemy and your friend, and I'll go for the friendly part first, whenever you load a BPF program into the kernel, it is analyzed by the BPF verifier in the kernel, and it will reject your program if there is any unstable operation in there, and it's picky. You know, it really is the candidate to guard against unsafe BPF programs being executed in the kernel, so it is your enemy in the sense that the least unsafe operation you do that in your mind should be okay because you know what you're doing. The BPF verifier can say no, so if I for instance know that a register has never been used or even better, if I know that because of the implementation of BPF, a register is initialized at zero, that doesn't mean I can just start adding to it. I do an add operation because I will involve reading the value, adding to it, and writing it back, and the verifier will say, well, you never wrote any value or never stored any value to this register, so I'm not going to let you read from it. So it's taking a while to implement the decompiler to generate BPF codes, to generate code that the BPF verifier would accept because I would keep finding these things that I thought were safe and the BPF verifier reminded me that actually it's not because it might be safe right now, but down the road it might not, and so I had to tweak things. So it's very important that it exists, but it can be a little painful to work with. The output from the verifiers is also quite obscure, I mean it's it tells you all the different instructions in the program, you know, if it fails the load, and it tells you where it fails but the syntax is not familiar to what you typically would get, for instance, with a business assembler. Fortunately, Dtrace has its own business assembler, so we changed this assembler to generate BPF instructions, obviously since that's what we're compiling to, and you can tell Dtrace to give you a business assembly dump of the compiled functions or the completely final linked program right before it gets loaded into the kernel. So that can help with finding out where things go wrong, and hopefully we do a good enough job in testing the compiler that's in general use you're not going to run into issues. The decode should be at all times compiled into valid BPF code. We found that the LLVM and the CLang base BPF compiler has some peculiarities, like for instance you must compile with optimization level two, which is not necessarily an issue, but we found that as we were writing some of the support functions that we want linked in, LLVM and CLang was too specific in what it expected and the kind of object code generated. So fortunately another team at Oracle has added BPF support to GCC and to Benutils, and so we can now just compile our ccode into BPF using GCC, use Benutils to generate a nice alpha object that we can use as a library to link code from. So we worked around that problem of LLVM being more specific and being able to now use a more generic compiler has been a great benefit. And then finally we found out and of course we knew this from the beginning but it's always a good thing to still highlight it, that when you're making use of existing features you're impacted by the limitations and I'm not saying this to highlight that somehow the tracing facilities in the kernel have limitations, I mean everything has limitations and it's not a perfect fit and that is what I would like to highlight here primarily. We are still running into cases where certain tracing facility features were implemented with a specific need and so that need might not match completely what the trace is expecting and we need to work out whether we can work around that by using them in a different way or using different features or whether yes we have to open a communication with kernel developers on what we can do about these limitations especially because our use cases may resemble other use cases that either already exist or may pop up with other tracing tools and so it can be to everybody's benefits to be able to work around some of these limitations. So that's always a very important thing that we keep in mind that we need to work on. So going to the limitations and unsolved mysteries section, one thing and that this has come up at all the conferences where we've either attended or presented on D-Trace and using BPF is that there is a desire for many people to have BPF code sharing at the kernel level. What I mean is being able to have certain BPF functions being loaded into the kernel as almost a library and preferably something that you can do almost dynamic link library support because like for instance all the support functions we have in D-Trace to handle global variables to help DLS variables, string functionality, I think by that are going to be used by all these different BPF programs attached to their individual probes and right now they have to be statically linked into the BPF program for every single of those probes and when you start tracing a very large amount of probes you know the the bloat of these extra codes can become significant. So that is definitely something that is missing right now. It's a limitation that hopefully a solution will be found for and I know people have been working towards that so we look forward to that and we definitely want to also look towards being able to contribute to that effort because you know it's definitely also to our benefits. We found and this is an embarrassing one for me personally because I kind of thought that there was a scientific and model instruction in BPF and globalism which is not a big of a deal. We had to work around but it's it highlighted and the reason why I put it in as an example it highlights that there have been certain assumptions made in the past and I still probably do that on a daily or weekly basis that I would expect something to be there which is not and that is not criticism of BPF it's just BPF was designed the way it was and we have needs that we have and you know it would be nice if any processor that if if you write code would have the instructions that you want but you need to work with the instruction set that's available. So it's one of those things we need to be aware of and make sure that we can have implementations for this function are these instructions that's functional that is you know as lightweight as possible. You know likewise memory and string functions do not exist there is very there's some limited support to the helpers that can help with that can work with this but in tracing facilities tracing tools we are going to be working with strings we're going to work with memory blocks you know we might want to know what is the fine name passed to an open syscall so we need to be able to you know copy that we might want to strip off any path elements and just have the base file name we might want to copy a memory block you know if we're tracing for instance network functionality can we copy parts of the packets data into an output buffer. You know these are all things that we need to figure out how to best do that within the limitation that BPF presents in those limitations are there for another reason which brings us to the next one there are no loops BPF right now and I have to also add that there are no loops in Dtrace so in a way that is not a limitation but loops would be very convenient to implement things like string functions you know since we don't have for instance a function that might say to look for a character in a string well with a loop that would be very easy to implement well we can't so we can work around that we can do loop unrolling implementations like that but loops would be very useful and looking towards the future by using BPF Dtrace actually is able to implement some features that we couldn't do before and like for instance conditional statements in a class are not possible right now in D but it's something that would be great to add and BPF obviously allows that because it supports conditionals now with BPF at some point might provide support for loops that would be a very big nice to have as a feature to add to the D language as well for Dtrace so there is definitely some connection point there then standard Dtrace SDT probes that is a working progress because there are obviously a significant amount of probes already available in the Linux kernel there is various trace points and Dtrace has for very many years pretty much from the very beginning established a standard set of statically defined tracing probes in a kernel and the various implementations on different operating systems have provided those and so it would be ideal if we can provide those still in a version that is based on BPF which means we're going to have to do a combination of using what is there right now for Dtrace some points and even have some code that can changes them not not in the implementation of kernel but have some BPF code that will present them to the Dtrace classes as their equivalent SDT probes and that can be done and that is just a matter of just of generating that generic probe context now there are going to be some cases and we have identified a few of those where there is no equivalent probe currently in the kernel and so we will be evaluating and have been evaluating which ones would need to be added in order to be able to satisfy this set of standard SDT probes and that will be something that you know we will be in communication with the kernel development community and tracing community to see you know what is the best way to move forward without implementation of dynamic variables and associative arrays is also something that is currently an unsolved topic mainly because it's it involves some kind of memory management that needs to be done and before there was support for that in the Dtrace kernel module now we have to implement it somehow in BPF and so that is still a bit of an unknown issue and with lots of question marks on how can we do this and how can do this in an efficient manner and as always you know one of the underlying principles that you know performance needs to be limited sorry impact performance needs to be limited the error probe that is a probe that fires in Dtrace when something goes wrong in the execution of probes that one is currently not implemented yet because there is no way to trigger it because it would actually mean that we should be able to trigger a probe firing from within BPF code and that obviously is a bit of an issue because i'm you know you're dealing with re-entrance of the BPF execution engine and that's that's a whole other thing that we don't want to get into so we're looking at different ways to implement that to still have the functionality Dtrace requires but within again the context of BPF and what it provides and like one of the really unsolved mysteries here is that the error probe typically will tell you which instruction caused the error now there is no access to a program counter within a BPF program you can't see which instruction you're executing or this yeah you can't access that like from a register so we're still working on different scenarios on how we can resolve this or for lack of better terms you fake it that we get the correct information but again within the context of what is being provided by BPF and then scalability you know that's the one that right now we've kind of put a little bit on the back burner because obviously accuracy of the implementation is more important and stability is more important but performance versus scalability is where there is still a lot of unknowns and for instance a lot of other tracing tools will trace some probes now we have been working with use cases where we're craving thousands of probes and then suddenly the impact of for instance a BPF program taking a full page of memory in the kernel that needs to be locked or the BPF maps that store data being done in chunks of whole pages can really start hurting you on the system especially if you know you run with a system that has for instance 64 kilobyte pages instead of four kilobyte pages so this is something that still we need to explore and it's going to be where we're putting the tracing facilities in the kernel really to the test of can it support this more massive level of onslaught of probing without impacting the system that tremendously so where can you find this the source code lives on github it's in our details utils repository and specifically it is the 2.0 branch dev3 and i need to highlight that and this is this has occurred because of how our developments took place we also have a 2.0 branch which is the one from which our internal releases are built from so the actual active development is on the branch that is on this slide and that is where everything is being pushed after patches have been reviewed have been accepted that is all development happens there the 2.0 branch 3 is i guess the stable released code which is usually less interesting because it's going to be several weeks behind the development branch but so if you want to check out what we're doing this branch specifically will you give you everything you want to see and that has always been tested before things obviously go on there but that doesn't mean it's guaranteed to be stable it's it's pre-release obviously and then we also have a mailing list detrisdevil at oss.org.com which is for any discussion of detris right now i would say 99 percent of the traffic is discussing the version of detris on linux based on bpf and other kernel tracing facilities with occasionally something popping up about the what i would call the legacy version based on the more invasive kernel patch and patches are posted to this and this is where you know we would welcome any dialogue about detris and further development because the goal really is here to do this in a way where we can make detris available to the wider community both as a tool but also to welcome people's input and any contributions towards making detris better and you know bring all the power that it has in full to linux because the current version obviously because it's still an active development of the design you know still poses limitations and you know that is to be expected we want to as much as we can build up from that and and do additional developments so i would very much encourage anyone to check out the code there is information in the branch concerning some kernel patches that are required right now for the best operation to highlight them it is providing ctf information which is basically type information from the kernel so that you actually can you know access data types that exist in the kernel you can get access to addresses based on symbolic information so that information is there there is a patch that makes it possible to be waiting on tasks be able to support the user space tracing although the actual tracing portions of that are currently still in the development and then there is a very small patch to okay alt sims what essentially associates kernel symbols with their specific modules they exist in whether the module is component of the kernel or not which just makes it easier to organize the probes so all that is in there it gives you instructions on how to build this it should build on pretty much any nature distribution right now there are some dependencies again they are mentioned in the tree and we also in that sense would welcome any reports on our mailing lists of issues that are found trying to compile this on a particular distribution because the goal really is that this should be a user space tool for tracing that's you know is not tied to a specific distribution so we definitely welcome anything there in terms of contributions so I very much appreciate your attention during this talk I always find it strange doing these talks virtual I always like the interaction we have with the with audience at presentations but I hope this gave a bit of an overview of the work that has been done on detrace again this is still very active development and we hope that we are able to find some solutions for the outstanding problems that still exists and that all those will be you know a benefit to the overall tracing community because it's the underlying goal by able to use the tracing facilities that exist in the kernel we want to become one of the tracing tools that make use of that and that therefore can provide use cases can provide a way to exercise the facilities and highlight areas where further development can take place improvements can be made and so we can all work together be it on the kernel side being on the tracing tool sides towards improving the ability to trace the Linux operating system both in the kernel and in user space so thank you for your attention