 Hallo, ik zal de sessie over de tracers joinen, dus we gaan nu van Strace Ftrace naar Dtrace. Hoe weet je wat Dtrace is? Heb je het gezien of is er iemand hier wat experience met Dtrace van non-linux environment? Cool. We gaan even de Dtrace introduceren, omdat het eerst in 2005 was gebouwd. Het was geïnteresseerd voor Ssans Solaris. Het was de vraag voor de performance en observatie van het systeem, zodat we de gasten van Ssans Service en zo kunnen supporten. Vandaag is het 2018 en Dtrace has been ported to various operating systems. Het is nu op Apple, Mac OS X, ofcose joinen, omdat het een fork van Solaris is. FreeBSD adopt de Dtrace. We werken in Linux, het werkt in progres, maar het is ported naar Linux in general. Ik heb ook al een attempt gezien om Dtrace te ontdekken voor windows. Waarom hebben we een andere Linux tracer nodig? De idee achter Dtrace is dat het niet alleen Linux tracers is. Mijn goal is eigenlijk om een Linux operating systeem te ontdekken voor de familie van Dtrace ported operating systeem. Niet alleen maar ook een Linux-only tracer. Een streng van Dtrace. Het kan Dtrace kernel, het kan Dtrace userspace. Tracing gebeurt direct binnen de kernel op de probe-tijd. We zullen het later over praten. Dtrace is het eigen scripting-language. Het is called D. Het is eigenlijk een superset van C-programma-language, dus het is heel simpel. En het is heel moeilijk. Ik zal het laten zien op de exemplen. Ja, en omdat het in 2005 was ontdekken, hebben we gebruikt dat het zon in orde is voor de decade, de troubleshoot, de productieve machines van onze gasten. En ook een andere benefit is dat er veel boeken en documenten zijn over Dtrace al. Dus het is geen nieuwe technologie, het is nieuw voor Linux. Dus ik denk dat een van de key differentiaties is met Dtrace, je hebt alles wat je nodig hebt. Dus er is geen nodig om external toolchain compilers, er is geen nodig. Het produceert niet een kernel module, dus je moet het loaden. Dus het is ontdekt om in environmenten te worden gebruikt, waarvoor je bijvoorbeeld de kernel modules moet bezoeken, dus je kunt niet iets loaden in de kernel. Ja, dus eigenlijk, dat is het. Dus ik zal brief beschrijven wat basis van hoe Dtrace werkt. Dus van jouw userperspectief workflow is het simpel aan eBPF en BCC soort tools. Nou, dat is omdat dit de producer van die tools is. Dus je schrijft wat tracingscript in de language. De script wordt compilerd in een bytecode die de Dtrace-instructie formaat gebruikt. Dus we hebben een kernel machine die dit bytecode verantwoordt. Voordat we iets verantwoordt, verwerken we dat de bytecode veilig is, want Dtrace was de manier om productieve machines te doen. Dus een deel van de protectie is de restricitie op de language-zijde. Dus je kan niet gebruiken loops, bijvoorbeeld om niet iets te creëren dat stopt CPU voor te lange tijd. En ook verwerken we dat de produceerde bytecode omdat iemand anders kan senden iets onder de kernel, is veilig te gebruiken. Als alles goed is, kunnen we de proberen aanbieden en starten de acties, recorden de data naar de tracingsbuffer en onze gebruikerspace-componenten uitleggen de data van de kernel en verder proberen ze als het simpel aan andere tracers je hebt gezien. Dus de key componenten van de Dtrace architecture is a probe. Do not mix it with the K probe or U probe. A probe in Dtrace environment is just an event. It's identified by a unique tuple which consists from provider name, module name function and name of the probe. We will get to this later. So it is quite abstract. Then there is something called provider. A provider is responsible for creating probes in your system. So it knows the internal logic, the implementation of how probe should be fired, what does it mean. Because our probe does not necessarily must be tied to instruction stream or something like this. A consumer is the other part of architecture that extracts the data and processes them in user space and sends down your script. And then there is a framework. The framework is the core component that basically glues everything together. So framework is responsible for managing the probes. If you have multiple consumers, it is basically subscribing the consumers to the probes and sends the data through. And yeah, that's it. So to give you idea what the provider is, for example you have a Dtrace provider which is just the framework itself. There are some three simple probes. I will show you later how they are. We have a profile provider which is able to create a probe via CPUs. We have a SDT provider's code statically defined tracing. So you know it probably as a trace point in Linux kernel or you as a developer statically introduce some point of interest to the code. Function boundary tracing is similar to K-probes. It basically scans the kernel modules for possible instrumentable symbols. So to build the probes out of it. FASTA provider is now being in the work so it allows you to do the same thing in the user space. At the moment we support static instrumentation of user space binaries but we don't have an equivalent of FBT on Linux yet. Yeah, everything is about Cisco so we also have a Cisco provider that allows you to observe an instrument Cisco in the operating system. So how does it look like from binary or deployment? You need to have a kernel which is a Dtrace support because we need some architecture specific stuff directly in the kernel and you need a set of modules. One module is called Dtrace is the framework itself and then usually a provider is another separate kernel module which can be loaded and it adds some features of the probes to the system. Then we have a libDtrace which is the core of the user space and is responsible for compiling your scripts and talking to the kernel part running the stuff extracting the buffers, parsing the records handling them. Then there is a libproc which is a basically procFS wrapper because it's a porting layer for Dtrace so it adds some simplified interface on top over procFS so it can be ported to other platforms and then we have a compact type format helper library, I will get back to this later. The last thing is the command itself you use, it is the consumer but most of the logic happens in the libDtrace. Thanks to this architecture you technically can remove the Dtrace command and write your own command using direct the library APIs to create Dtrace or integrate Dtrace to your application. On Solaris we used to have a Java wrapper so you could basically use the library from Java but I think you should be able to get Dtrace integrated to other languages like Python, Go basically whatever that can bind the C language from your language but it's not easy, I've tried the Go and it's not so easy. So finally we have some project page which is open source so if you want to see the sources or state of the project we try to keep it updated it's not always the latest great test this is where it lives for now so now let's move to more interesting part and it's actually how you can use the details how it looks like given that there is not so much time to talk about architecture in details I try to prepare some set of examples so we can see how you can use it in real life or so the first thing you need to know is how the language looks like so it was designed to be simple so if you know org you basically provide a list of probes identified by the tuples it can be more than one probe because you can share action blocks with multiple pro firings it's up to you then we have a predicate a predicate is optional and allows you to write a de expression and if the expression evaluates to true the action block is executed in the kernel if expression is false we just skip it away so it serves to allow you to quickly filter out the stuff just directly when we are the probe firing point yeah details actions you will see some of them but basically you can work with variables you can call something which looks like c functions and that's it your script is executed from top to bottom and you can have multiple action blocks for the same probe it will be executed from top to bottom so yeah now to explain the things you can use so we have something we call a subroutine you can think of it as function returning a value so subroutine is used if you for example need to convert data or copy something from user space to kernel space or other way around and it can for example return you point where is the destination in the kernel space for the trace and usually can be used in predicate because it has a returning value and you can use it to do some some decision the action is only in the action block it is a statement so it doesn't have a return value and allows you to record something into the buffers or basically perform some effect on the system so the actions can be executed in two places straight in the probe context where we basically disable scheduling interrupts they will happen on spot or later in user space in that moment basically the action is not in sync with the probe firing but for some actions it does not make sense to do them in the kernel just to save CPU cycles and also some of the actions we support can be destructive to the operating system you are looking because we have action like panic for example if you intentionally kill the host so active action requires usage of explicit argument the idea is we try to be as much safe so you avoid killing your system by some random accident but we want you to explicitly say you know what you are doing before you kill the system and it doesn't necessarily mean it has to kill the system but the actions for example you can introduce delays at random places in kernel so you can mess up the scheduling interrupts whatever so there are some hard limits and all the actions are blocked from you to use so now basically we get through the language description so you can see well if you try any new language the goal is to write a hello word so this is a hello word in detrace so I am using a detrace provider so you can see the full description of the probes provider is detrace it doesn't have a module, it doesn't have a function they are kind of special because the begin probe fires always first and is always last before we terminate tracing so you can do some init cleanup and error probe fires only when an action block you execute causes some error for example if you touch memory and would cause a page fault or something like this detrace will silently ignore it and replace it like loading value 0 but it will fire the error probe for you so you can introduce your own handling if you don't you will see something like there was an action error on action number div instruction number and so on so yeah you can see the example we use just the begin probe and it is not fully qualified because you don't have to if detrace is able to identify uniquely what you don't get, there is no other begin probe in any provider so you can just write the begin and it has just two actions the first one prints something the second one terminates the tracing with the return value so when I was talking about some actions happening in user space the printer is a good example because the printer takes a formatting string an optional set of arguments like you know from C language but if we would try to assemble the final string in kernel we would just waste CPU cycles so what we do we store only the arguments in kernel and then pass it through to the user space and when we are in user space we do the file string formatting and the print so it just balance the performance between kernel and user space so one great feature about detrace is that the framework knows about every available probe to you and you can list it so you can think about that fbt is similar to kprobes it has a we support only a turn and transfer points return and entry points to a function we don't support arbitrary instrumentation but technically we can add for example kpro provider and use kprobes underneath but the great thing is you can list the probe so when you come to some random system you know nothing about it you can at least as a first step of debugging take a look what's available for you to use our pass slide is missing so you can do more you don't have to list all the probes you can just provide some partial specifications so you can list just a subset of probes that matches what you are trying to look for and you can look more detail description of a probe so as you can see it's just some kernel identifier provider name module function and somewhere here is probe name but in addition because on slide it is usually a lot of about stability then backwards compatibility you have some stability attributes for every probe this is interesting if you want to implement a script and don't want that script got broken like next day when we change a kernel we basically give you some description how stable the probe itself is so whether it will exist tomorrow or not you can see private private identifier names means it is our private we can change it whenever we want but for probe arguments you can see it is evolving which gives you some there is some document about this what are possible kinds of types but basically it shows you we will change it probably in future but not today or daily basis so you can base your script on top of it and the last interesting thing is that for some probes we can say you what is the type of argument a probe will get so you know what structure or type we can access from from the script itself so the one of the most powerful things in details are aggregations aggregations happen directly in the kernel and they are represented in the D language as a special variable which can use multiple keys a key can be anything it can be stack, it can be string it can be number whatever we can make a hash out of it and then there is aggregating function there is just limited set of such functions it can be from simple counting or to something more complex which I will show also later what it does it means whenever we record this kind of keys we have a sort of hash table in kernel and we apply the aggregating function on top of it for every CPU and then we merge aggregations together so it is quite effective if you need to see some counts because it happens straight it doesn't need post processing you don't have to send a huge stream of data up for some further post processing later you can do some kind of things with this directly so in addition to aggregating variables of course we have some some built in variables which are accessible for you in the D language which itself it's probably better to see documentation but just to give you an idea what it is there is a current thread variable which always holds a pointer to the running thread or task on a CPU at the moment when the probe fires current psinfo is equivalent but for a process we have a timestamp which is something like clock monotonic kind of time so you can sort the events based on time we have a wall timestamp which is real wall clock time based on resistant time and each probe can have some arguments like arg0, arg9 everything is exposed to variables this kind of variables is not typed if I got a few slides back you will see there is a slight difference when I listed the probe this is a RI so if we know the types we expose a variable in RI and argument index the argument number is indexed to this RI so you can access it directly as a typed argument if we don't have it you have some built in variables which will be like 64 bit integers that's it now we enhanced our D language skills so we can look for more advanced example so in D trace you don't necessarily need to write a long script so I based this on examples of so called one liner so whatever you see here just imagine that there is D trace minus N and this is added as a string argument so it is just a one liner and you are missing some part of it I will explain it so basically what it does we use a profile probe which should fire of each 997 hertz or a CPU it's important not to use your number 2 not fight with the scheduler because typically 10 milliseconds is some time of scheduling so that's why the number looks weird the probe if you look it in the documentation has a two arguments which are program counters where argument 0 is kernel program counter argument 1 is user space program counter so whenever timer fires you know based on which variable it is non-zero whether you are in user space or kernel space context and then I have an aggregating variable the variable is anonymous so it's just and the name is missing which just means D trace will print it at the end because it is very simple script so I don't need to use some variable name and I'm using two keys one key is the process ID and the second key is the executable name and what the part you missing is it equals to count so the aggregating function just bumps counter each time we see the same key value pair so if you run D trace this time you will see it at form 1 probe and the output will be this is the pit this is the exec name again somewhere here are the numbers so it is sorted from smaller to higher so I have a 51 for the top process so what basically we did in this example we created a user space profiler just with one line that monitors every CPU and accounts for every program counter in the user space so we know the top because it was my VM back in our face was running just the top and nothing else so if we just swap the arc1 to arc0 we change it to kernel profiler because arc0 if there is no test it is a physical test for if value is 0 or not of the argument so now we are testing whether a kernel pointer is set to something and if it is we aggregate and remember I was talking about the subroutine so the func is a subroutine which takes a pointer and gives you back a string name of a symbol so again if you run the D trace you will see that this output is exactly the same it gives you the symbol names and yeah the counts so the most often called function in Linux kernel was native save hold because most of the virtual CPUs were idle one last thing is that a profile provider is kind of dynamic so there is no such probe like 997 hertz so if you ask a provider to provide you some probe the profile one will synthetically create it when you request it and if the probe exists it gets reused because the framework duplicates the existing stuff so this is just one interesting so not everything must be present from loading the provider module it can come up to life dynamically so now let's talk a bit about the variables itself of the D language so we have 3 scopes of variables there is a global scope which is kind of obvious but you have to realize that every block action block for any probe can fire multiple times at the same time because you have a more than one CPU there is no synchronization because synchronization means matrix for locks and we are in kernel holding interrupts and schedulers so global variables are problems of race conditions then we have a self scope which means a variable is bound to a kernel thread so whenever you write something you can in completely different probe extract the variable back because it's bound to the kernel thread and this variable is per probe which is interesting for you if you have multiple action blocks for the same probe you can fill in something local for the first call and then use it subsequently to shorten down the processing time yeah global stuff is like a compiler so if it is statically allocated we have a memory for it if it's dynamically it's basically allocated by your first store and whenever you set a value of dynamic variable to zero it will get garbage collected and freed if you run out of or memory leak the dynamic variables you will see something eros like I don't have a dynamic variable drop and so on we have a support for our eyes but our eye even indexed by integer is always associated because it's based on this dynamic variable stuff so to get quickly to the types D is a superset of C we are not using Dwarf because debug info is not always installed on production machine it can be big and Dtrace historically is based on CTF CTFs are smaller because at kernel build time we deduplicate the Dwarf to create a small description of CTFs we put them in a kernel modules or have a archive next to the kernel which is shipped in rpm we have a this way type information at least some subset reasonable for Dtrace always available on production machine so if I come back to the profile provider and just again change the aggregation you can see I'm using the built-in variable which is current thread which has a type because we have a type system and I can de-reference it which means now I'm basically picking an internal member of the task strut from Linux kernel and aggregating by name so it is profiler that shows you the names of threads that have been seen on CPU given the time constraint it has as many options like bet we probably run out of the alphabet in the uppercase we have a minus X so I compiler you can provide a key equals value you can change a lot of things interesting one is the buffer switching policy by default we have two buffers one is being recorded one is being extracted ring policy basically overvise the buffer again again again and for example you can extract it from crash dam so you can let the machine die load it up and look what was the few things before the disaster but on Linux it would need to change the crash or other tools so it's not on Linux lazy attach is basically the problem with S-trace conditions like if a probe doesn't exist you get an error if you do the lazy attach option it means we will compile the script and run it anyway and when the probe appears in the system we enable it and start processing your stuff so speculative tracing sometimes sometimes you need to record for example function arguments and decide to keep the results only when it returns error which is quite hard for a tracer to solve it so in detrace the speculation tracing allows you to speculative restore something aside to different set of buffers and later decide whether you commit or discard the value depending on for example the return value you can commit if it is an error then we will copy or temporary buffers and it will appear to the tracer in user space yes so give them time quickly work on detrace is not only about the tracer because they are well known statically defined probes across many operating systems so for example you can see this is a probe which is a logstat provider for Linux kernel for every adaptive mutex acquire so what it does this one liner it collects all stacks and counts them so this is the stack that was the hottest who was successfully acquired mutex during the time I was running this one liner the best thing is it works on macOS and Salaris too because we add the SET providers in the same way so that's another part of our work is to create it to really add Linux to the other family of operating systems here is another equivalent but this time we are using a tcp provider so for every tcp receive event in kernel we aggregate now you can see there is a type argument I don't have the details here but it allows you to get a source IP address and you can see we use a different aggregating function now it is quantize which basically creates exponentially of two sized buckets and uses its argument and increments count in each bucket it fits so you can see I was at the time just done a parallel SSH to my machine so this was some slight traffic and you can see I have one bucket between 1 kilobyte and 2 kilobyte but most of them were like 32 to 64 so you don't have an exact size but you see some histogram or distribution of the values it's very useful for various IO stuff and also this is printed for each address in the aggregation so if I would have a multiple parallel connection you would see different source IPs with the different histogram inside it so now a different example typically from the support case like and just say hey something's going on with my machine tell me what it is you don't know the machine, you don't think about the environment so what detail it is very good at is to narrowing your viewing field so you know where to focus so again still I'm using one line, I haven't write any single complex script well yeah this is a bit risky but basically this script instruments every kernel function entry point and I use the built-in variable as I said the probe is a tuple so we have a built-in variable which allows you access to the part of the tuple of the probe definition which is quite effective because each time the probe fires you know what is their name and we aggregate it by the module where the probe resides so I just run it for a while terminate and boom surprisingly VM Linux well, biggest candidate you can't see but it's like 700.000 times has been hit something from VM Linux so the next step is ok so what the kernel is doing we just restrict the FBT probes to the kernel itself and start aggregating by the function, yeah you can use wildcards so you can use something like VM Linux asterisk everything is just a wildcard so this matches every entry symbol from the kernel itself and we aggregate by the function which is exactly the missing point in the probe description so what you get you find out that this is the hottest function called in the kernel yeah well this is something I have changed recently because I moved from interrupt handling to lockfiel, goiton and result is this will be always the hottest because we are looking in the kernel itself the machine was idle so let's take a look for the stink compare because someone was calling stink compare like 12.000 times so your own expression is usually why? why are you calling stink compare so often so as you can see we again narrow down the FBT description and start aggregating based on stacks so this is just the hottest stack and you can see we are doing some security context to sit comparsion which is probably based on stink compare but most importantly if you go bottom to the stack it calls from open syscall so it is a result of a user space activity so at this point you need to know who the hell in the user space is doing so much syscall so I switch a provider use the open probe and aggregate based on process ID and exact name wow, it's detrace and it is done 128 times interesting so what next keep the syscall but we restrict it to the detrace itself and aggregate based on the first argument of the open syscall because the argument lives in the user space we need to use copy in string to get it to the kernel so this is one way you can do all because we have narrowed the field down you can pick your tool because probably maybe you know better than I how to debug the stuff and you will see most of the time it was opening something like this because in the detrace we have a call which does P online and checks for how many available CPUs you have on a system and that is implemented by iterating over this kind of path checking for some reason 128 CPUs on my 8 CPU sized VM if there would be a user space implemented we could follow with the user space tags directly from this example in detrace and just switch towards the user space tracing so briefly we are running out of time so now the script is more complex but every syscall a thread does sell per thread variable t1 we return from the open syscall we set it to 0 then we trace every kernel function for every thread that has this value set even though it doesn't have an action so basically we have bucketed some execution between open and end of a syscall and if we use minus f4d trace it will change the printing style to show you how you entered and returned from the probe we start with syscall open we end with syscall open and this is basically everything that has been called a code flow in between this is another nice feature if you want to see some part of code for execution in a kernel the last thing is completely different like let's say you have some telemetric system you want to see fancy graphs and so on so how much reads and writes gcc en ld does during kernel compilation this is kind of limitation in the language so it is hack it is not a nice stuff but you have to add everything 4 times for the read write and different process but important thing is each one second we create something like detrace.gcc.write and use the amount of writes we have accumulated then I add a timestamp at the end of it and just clear the aggregation so count starts from 0 so we now have 1 second based numbers getting out of this and because this is accidentally format of the graphite database there is a carbon parser which understands exactly this so I just run the detrace with my adhoc created matrix pass it via pipe to netcut and send it somewhere to database and from somewhere else here is when I started building the kernel and here is where the link time happened and because kernel has like 300 megabytes and in our case 3 rounds of linking it's like here but you can't see it, I can't show it on the computer yeah I know this quick one about questions are there probes for Java virtual machines as well not yet because they are tied to the user space ok well if there is a USDT then yes but I think on sans java there is an option which forces the J to generate the SDT instrumented code I am not so sure about the open JDK alright but I think Parf and BCC and other tools do something about it so I bet there will be ok do you debug your D code well there is an option for the Dtrace command which will dump you disassembly in the diff language which you can loop on I guess it can be quite tricky to write your D code technically the only thing you cannot trace with Dtrace is Dtrace itself and it's always a pain so usually we need to bridge it from sources of structure that does not exist when you what happens how is it reported well basically we do similar things that k-probes do so if you access something in wallet we have a hook in general protection fold page for handlers this is the kernel layer of the Dtrace and actually now I am working on NMI's on x86 because it just ends up with random corruption of registers and you need to find something clever no Dtrace I have a good one so some platforms like MAC doesn't have S-rays is there any ready Dtrace script that can be used to do something sibling you mean on MAC well on MAC I think regiment script I think on MAC the answer is the Xcode because Xcode is using a lot of Dtrace for the iPhone development and the performance measurements of your application so I don't think there is a library of script ready for you I just wonder if there is some simple script that just broke system calls and put timestamps on all media the problem loop for example there is a group but what you need is to use the Cisco provider enable everything and then out printf like from out instance for instance url with some timestamps you can use the printf I have to write that no one actually has done that 4 lines of code ok people just write it I thought it would be something simple there used to be something called Dtrace 2 kit there used to be something called Dtrace 2 kit which was a set of spits and I'm not sure if it's there an example but basically die one of the other Dtrace moved to the EDF on BCC I don't really understand the code ok oh ha ha ha ha thanks, good talk