 Ok, tynnu'n gweithio. My name is Daniel Thompson and I am here today to talk about BPF. I put my colleague Leo Yan up on the screen there for the simple reason that he wrote nearly half... possibly more than half the slides. So he gets a name check at the beginning but he can't be with us today. That's not moving. There we go. So, this is talking about what is increasing being called BPF as the extended gets forgotten. ac dwi'n dweithio dydy'r bachet feelsbwylltau, mae hwnnw haf wyldwch am dwylo'r pachet fel尔. On iddyn ni'n chwiwch chi'n bachet fel TCP ac y gallwn yn dweud. Mi llwch chi'n ddiadu hynny'n gyfw damagedrwy ysgrifennu gyda'r system dweud gael packet. Maen nhw'n ymddweith e-BPF i chi ddweud hefyd, nes yw fydweith yn gydag. just it's being reencoded, it has more op codes, it can hook into different parts of the kernel way beyond it's original scope on the packet filter, and it comes both with fast function calls to get thewor an gehen a'u sy'n cystal hollikk. And a rich set of data structures. Particularly when you're using the CAPE probes, BCF programmes that allow you to get data in and out of the kernel efficiently and effectively. We're here today to talk about using Dorfod Oreo for kernel allows us. Mae'r bydwyd eich bydwyd, yr Argyflwydd John Corbett yn mynd i yw'r cyfrannu. Mae'r bydwyd eich bydwyd eisiau i ddefnyddio'r gwybod ar y cernol, oherwydd o'r perffesiaid a'r gweithio'r ysgol, ond yn digwydd hynny yn ddod o'r byddwyd eich bydwyd eich bydwyd, eich bydwyd eich bydwyd i'r gweithio i'r bydwyd i'r cyfrannu. Yna'r dweud o dda'r pethau cyfrannu o'r cyfrannu, mae'n ddyn nhw i ddu ar y cyfrannu. i'r ysgol y bydd ymlaen i'r gweithio arlaeddau. Ond ystod y gallwn y cerddwyr cynod ymlaen yn y cerddwyr yn y BPF virtual machine, a mae'r ysgol ymlaen ymlaen i'r ysgol sy'n bwysig o'r gweithio ar y cerddwyr i'r cerddwyr yn y cerddwyr a'r gweithio'r data ar y cerddwyr i'r gweithio'r cynod ymlaen. I ymlaen ni i chi bwysig yw ddyfod y cyrchy ensures i chi'n gweithio'r copiwysau i chi bwysig o'r copiwysau i grinio i chi dyluniau gynghoriad arall â'r progomau. Ond yna chi'n gweithio yw ymlaen a'n gweithio ar y cerddwyr i gwaith y cerddwyr i'r ffnfa yma yn y machine, y'n gweithio gwir y gwir yn y fwyaf oherwydd y BPF yw cael ei ffro dim wedi cael ei ddweud i'r cerddwyr cymparau cymparau cymparau ymlaen LLVM yn clang is actually providing us the compiler for that, to the very best in my knowledge that's not supported landed in GCC at this point. But there are other tools that can exist. You can generate code directly, you can hand to code assembly if you are so inclined. The key thing is the BPF programme will then be loaded into the virtual machine with sanity checking and it's not even sanity checking, it's full verification to make sure that it's safe. And it's attached to designated co-pass in the kernel. yn y case of debugging, you're usually attaching it to K-groups. And then finally, you've got this map effect to get stuff out. So if you think of a map a bit like a Python dictionary, there are other ways you can use the maps. But just to have your mind set up on how the date is coming out, the program inside the kernel will be throwing stuff into a dictionary and then retrieving it from the user space to display it to somebody. Just a slight thought about tracing. I mean, if you went to Stephen Ross's talk earlier yesterday, what you can do with tracing is mammoth. Tracing is really awesome for embedded systems particularly. But the key thing you can do with EPPF is aggregate results. So you can gather statistics, performance information, and you can predigest it and present it to yourself if you're writing the probes yourself in a condensed manner without having to go for that intermediate parsing of the trace information coming out. You can also do tracing with EPPF just as you can with other things if it's the tool that suits. So with that kind of setup, we're now going to actually dive down inside how it works. And the reason I'm going to do this is because any debugging tool in the embedded space is competing with somebody who can just recompile the kernel and add some print case. And so it's important to understand the internal so you can decide when you want to recompile the kernel and when you want to bring out a different tool. So it is important for us more than some groups to know exactly what the tool is doing because then you can decide when you want to bring it out. So BPF is an instruction set architecture. It's a virtual machine. It's a fairly simple risk light instruction set. It has some unusual properties. It's choice of register size, 10 general purpose registers. I strongly suspect it's XH6 derived because XH664 has its particular set of registers with the XH6 originals and the eight new ones. It also has an additional register that's read only to look at the stack. And there's also a program size limitation. So we only have 4,096 op codes to drag the information we need out of the kernel and copy into our maps. But you can load more than one program and have it running simultaneously. The instruction set encoding is fairly wide because that's much easier for software to pass. So we don't have to go into all the details but you can see it's got a very large 32-bit immediate value sitting there on every op code. It's got an offset, I think it's a shift. It's got two different source and destination registers. They're not as tightly packed as they could be and then you've got a eight-bit op code which does expand into particular classes but I think the reality is that you tend to use a 256-bit look up at that point. So essentially op codes are encoded to make it comfortable for software to play with. It is very much intended to be software. That's a move. There we go. In most of the, in a lot of the architectures, there is now just-in-time compiler. The op codes were designed to have an easier and quick mapping particularly to risk machines. In the case of AR64, it's very close to one-to-one mapping for many, many of the instructions. So I've put up a quick map of the links between the EBPF registers and their ABI role on EBPF and the register that's allocated to ARM64. The one to particularly look at is probably the second line where you'll notice that the arguments that are passed are literally mapped to the same registers. So by and large, by the time you've gone through the JIT process, your code that's coming out is already essentially conforming to the ABI and is already looking to be able to call into C code with zero overhead. The verifier is what brings a lot of the kind of safety properties. You know, it is able to look at the program and verify that it's safe to load into your code and let it do no harm. And that alone makes it excellent for debugging because if you've ever added your print case or anything else, as soon as you start adding complex things to your code, the first thing you start doing is debugging your debugging code. So being able to load that and knowing it to be safe is a very good start in making sure that your instrumentation doesn't interfere with your normal operation. So they're loaded from user space. They're running in kernel space verified. The loader will check that you have licensed your code into GPL and for K probes, which can be attached to arbitrary op codes in certain circumstances, it will also check that you have claimed the correct kernel version you're trying to attach to. It's not a complete check because, of course, different configs will have radically different binary structures, but nevertheless, if you're loading K probes, which are kind of tailored to the kernel you're running on, you need to provide a kernel version number. You can make efficient function calls, but they are white listed for obvious reasons. In the case of a K probes program that you've got, particularly the map functions, you can access very easily. There were a couple of others. You can see in the source code if you go to each of the different K probes, each of the different BPF program loaders, there's a call out to check the whitelist is appropriate. So the whitelist that you have for networking is not the same as the whitelist you have for a K probes program. And then there's the control flow graph. So you can think of this as it's prohibiting loops in order that you know that your program finishes in a relatively short amount of finite time. You're not allowed to have loops inside your EPPF code. So some compilers will help you by loop unrolling. In some cases, others will simply, other code gens will simply prohibit looping and you have to hand code it. But that does give you that property that you can be pretty confident that when you load it, it will obviously perturb the timing. But the risks to your program are pretty small. It will actually also simulate every instruction and observe the change in behaviors of the register and the stack, which we'll come to in a second. So this is, you can see the negative numbers there. That's basically saying don't jump backwards. That's going to fail the verifier. So although you can express it in the op codes, the verifier will prevent you from being loaded into the kernel. And then we're tracking the registers and the stack. And this again is all part of allowing that fast interface to see because after kernel function, the R1 to R5 BPF registers, which are the arguments will potentially have been used as scratch space in your C code. And by the time it's come through the ABI and mapped it down to the ABI registers used by your native architecture. So the verifier will, at the end of a call, basically mark those registers unaccessible and it will then do simulated execution going through. And if somebody attempts to read a register that is undefined, it will actually fail validation. And so that's a further step to prevent ex, so obviously capils are designed to exfiltrate information, but particularly when you network code other places, it's also able to prevent you exfiltrating things that you have no right to know. And similarly, the load install instructions that act upon the stack via the stack, read only stack pointer, will also be checked to make sure that you're, you're reading in your stack, et cetera, et cetera. So you can see a couple of programs there that deliberately do something wrong. The first one is simply a program that reads a register and it's not allowed to move R2 into R0 because you can't read R2. The second one, actually escapes my memory, the one that's the stack check, yeah, so it's trying to read a piece of stack that it can provide. And again, that will be failed by the verifier. So the final thing that comes in is maps, maps of data structures. They're data structures with a FEST, relatively convenient C interface and with the syscall interface. So the syscall interface, everything with BPF goes through this BPF system call. If you're trying to keep up with anything I say later in the talk, type manBPF into your modern distribution and you will see a instruction of what those BPF system calls can do. So you can create maps. That will give you a file descriptor. You can then use that file descriptor to call the other map functions, to lookup elements, insert elements, delete elements, whatever you need to do. So you can pre-fill information and send it into the kernel. It's much more common to drag information out of the kernel but you are able to send information in as well if it needs to be guarded in any way. And the way maps work is actually quite cute because in your user space you copy the FD, the file descriptor, into one of the immediate for your opcodes. So you can see here top, what will be your left, there's that red FD which is where we have inserted a user space file descriptor directly into a constant within the opcode and then we've left a blank slot afterwards. And then what will happen is during the loading process that file descriptor will have been looked up as checking that it belongs to the process which is attempting to load the hook and then it will be transferred into a regular kernel address because postverifying we can't change the code there's no real risk to that. So it will load that address in. I believe it will then mark the registers on readable except when you're passing it to a call. And so that, effectively that's like linking essentially and the linkage process is novel, maps, file descriptors, maps, the enumerated list of functions you're allowed to call again to their genuine addresses. So by the time you're actually running your code even if you're interpreting it all the addresses or the look-ups have been done for you. And that's very, very important for K-probes because they will not be running in the process that attempt to load them. They'll be running all over the shop. So they couldn't possibly manage to look up a file descriptor at that point so it all gets looked up ahead of time. That's the wrong button, sorry. OK, so when we get to the end of the talk it's starting to get a little bit like magic. What you can do with this is quite stunning. So we're going to start off by deliberately writing a program in assembler. I shan't pretend this is a good idea for any purpose other than trying to dispel the idea that there's any magic happening here. I still couldn't fit it all on one slide if we didn't use some helper functions. So you will see a few functions actually of these BPF map look-up elements and a couple of others like program load. You can see roughly what they're doing. I mean, they're all sitting there, taking their arguments and condensing them into a structure or some other manner to allow you to make the direct system call. So they are fairly thin veneers around the system call. You can type man BPF if you don't believe me while we're talking to double check. But this is our assembler program. I was assuming we're going to have good projectors and I think we do. So I hope that at least some of you can read that. I've highlighted some of the most important bits. So at the top we're creating a map because we want to extract some information at the kernel. And then we've got an array that contains all our opcodes which are simply populated by macros. So one of the hash includes that you can't see off the top of it is taking a Linux header file into the system and that contains all these macros that allows us to write out the opcodes into structures, which is all very convenient. I've highlighted the one in the middle, which is the one where we've got that sudo op to load a file descriptor and to allocate the slot after for an extra immediate. What it's actually going to do this code is going to, I believe, I had to read it in the hotel this morning, it's counting the number of times you call sysread. So then you've got the Linux version code that I've mentioned because it's a K-Pro program, it will be checked on the Linux version code. If you're loading a Berkeley packet filter program, it will not be version checked because version checking that would cause all sorts of crazy ABI problems. The idea is that because we debugging, we've got some idea of what we're working on and they're going to check that. And then at the bottom here, we've got this program to get the output out again, which is then calling the map, dragging out the value and presenting it to the user. And then there's this little function in the middle that I didn't put in a box, the attached K-Pro, which we've blown up on the right. That's how we managed to actually attach the program to the K-Pro. So you can see at the top, we've actually gone out to a system call where it's going out to this just to make it more compact because it's quicker than trying to write this and see. But ultimately, it just goes out to K-Pro events and creates the K-Pro for us using the language that, if you were at Stephen Ross's talk yesterday, it's the same language to describe your pro point. So we get the idea back once we've created the event and we can then attach things to it. It is as simple as that. So as soon as you've managed to create the K-Pro event, we can then attach code to it. You can also attach the static trace points and different tools give you different providers that are laid attached to things. So we've now learnt what it can do we've learnt how it works so we've decided we're going to deploy it. There are now tools that we can use. So we've done it. This is now all adding sugar. But some of these are very high-calorie tools, should we say. So we're going to start off looking at the kernel examples. Why aren't you moving? Kernel examples. These are quite good because these are exactly the same form as I presented right at the beginning. These are the ones that have the two C files, one with GCC loading the user-based code or clang, actually, if you want. But you're certainly using clang to compile the BPF program. They've all been named with their kernel portion and their user-based portion. But what they have is a slightly richer library coming in at this point. So because we're compiling the BPF program with a tool like clang, we have access to things like magic elf segments and so on and so forth, which the loader can interpret. So when we look at the actual code, on the left-hand side you've got the code that runs in the kernel. But in addition to the probes themselves, you can see the GPL license which allows us to load it. We've got the Linux version. At the top you've got this sec maps. That allows you to enumerate the maps you want to create and then the loader can parse the list of maps that you've put in your BPF program and create them automatically. So it's all stuff that most kernel programmers have seen to create lists in the linker to save you having to type everything out twice. The same sort of tricks are essentially being used here to add value to the loader program and to prevent you having as much boilerplate in your user space code. And you can see that coming out on the right. So this program in particular was tracing leaks, so it attaches a return probe to the allocation function to capture the pointer that was allocated and uses that as the key in a map and I think it records the time as the value. So it records when something's being allocated and then later it can remove it from the map when it observes it being free. So the hook to remove it from the map isn't included on this slide for space, but it's there as well. And then the right-hand side, you can see you've got our... walking over the map to print out the results. So when we've finished running, we can walk over the map and display the objects that we lost. I think that's it for the sample tools. There's lots of samples in there. As any of these examples I've put up, you'll find many, many more examples in similar places. So there's lots and lots of ideas about how to use these tools effectively to be found in these various vaults. OK, so we now move on to ply. So ply is one of my personal favourites I have to admit, because it's totally stripped down and simple. It is not as powerful as many of the tools, but a bit like using orc from you can see the syntax is a little bit orc-like. Just like using orc, when orc is the right tool, it's so quick to write something. So it's based on this orc-like mini-language. I would imagine it to have been influenced by detrace. A lot of the tools in this area are. It's being attached by this thing called a provider. In fact, we'll decompose in a minute. This particular program is basically counting the number of processes while it was running that returned an error code. And it's counting them and it's actually presenting them to us in numeric order. So the one at the end is the highest count. Ply is unusual. We're going to look at a lot of orc-like mini-languages in a minute, but ply is unusual because it doesn't have the concept of begin and end. It just walks through the code, works out what maps you've created in your code, and then it automatically walks through the maps at the end and dumps their contents to the screen. The idea is if you put something in a map, you'll see it. So you don't have to write any code at all to visualise the data when you've finished. Which is a blessing in a curse. It's awesome because it's automatic. If it doesn't present it to you in the most attractive way that you would like. Tough. OK, so let's decompose it just a little bit. I'll bake them on two slides at once. So this is looking at the provider in a bit more detail. So they're called providers in ply. There's a number of them, but trace is the one we're using here. It's going to look at the system calls. It's then saying what system call we want to attach to. Is this exit? And it's actually got a condition. And the important thing here is that when you do KPRO events, there's all sorts of filters and clever things you can add to your KPRO event before it actually triggers. So you can actually have preconditions in the tracing system before you actually get into your EBBF program. And this is using that trace precondition as a predicate to say I only care about, I only want to run my BPF program when CIS exit returns an error. But then we get to the main bit of the code. So that at, if there's any Perl programs, I won't ask you to raise your hands for obvious reasons. But if you're a Perl program, you'll recognise this as like the anonymous variable. So by saying at with nothing after it, we're creating an anonymous variable. You can also name them as well. You can create multiple maps in your programs if you have the need to do so. But if you're writing a simple program, you do tend to just use that single anonymous map space. I mean it's storing com, which is the name of the program, and it's then counting it. So there are different ways to aggregate data, but there's two built in. There's this count, which simply boosts the count for each thing. Quantize is quite interesting as well. It allows you to visualise distributions. So one of the common examples imply looks at the reads of the distribution of the size of read you've been performing, which can help you identify programs that are doing something stupid across the whole of the system. I have to say this LDD is the reason to love Ply if you're an embedded developer. You'll recognise that if you're familiar with LDD's meaning. It does not have any dependencies that are not the C library. That's a big deal, right? So if you're trying to deploy it onto a different system, A, it's amenable to cross-compiling, and B, when you load it, you haven't got to copy any libraries, no dependencies. It's easy to cross-compile because you don't have to bring in any extra libraries. So that's one particular reason to love it. You've got your native thing there. You do need to point it at the kernel directory of the kernel you're running with because it gets the Linux version code from there. So if you run it without that with Kernelder on your host distro, what you'll probably find is that it picks up the 4.9 headers that your distro uses as a baseline and it's using a bugfix 4.18 kernel or something like that, so you won't then load the programs for you. So you use that with Kernelder to make sure you've got your Linux version code correct inside Ply. It also means when you upgrade your kernel, you don't have to recompile Ply. I won't actually look into that, but I haven't had time to check if there's a sneaky way around it. I mean, at least a command line option to allow it to override the built-in version number would be convenient. OK, so we're now moving to more and more highly calorific tools. This is probably the biggest gun in the EPF kind of world. It's a program that matches together clang and scripting languages so that you can write most of your code in a scripting language and you just write your probe point in C. And the code for it looks a little bit like this. So this is, I think, a Python version. It's tracking task switches. So it's got this BPF program at the tops. The BPF is an object in Python. You give it the argument of the C code you'd like it to build. And it will then arrange for that to be compiled and loaded and other bits and pieces brought in. It's using this BPF hash macro, which is similar to the sec maps version that I showed you before. It's just adding up the calories a bit and creating a more convenient way to write your BPF hash. But that also means that the loader can still automatically create the hash for you, ready for your program to start with the hashes via Python. So by the time you come out all the data that was in that hash has now been transferred into a thing that feels duck types like a Python dictionary. And that means you can add much more high-level tools. It comes to examples at the end where you can see the end point when you've combined the power of BPF to gather information and the scripting language to digest it for you. There is the slight downside that it is much more difficult to build BCC than Ply. And that's because it depends on Clang. I believe it interacts with Clang the binary, which means there are ways you can make it harder to build. But yes, he's requiring a lot more code. Clang's packaging is a little bit inconsistent between distros, so you get a few interesting problems crop up from time to time. That was the guide that Leo put together. It's in the slides for you to find I'm not going to talk to it, but if you're really struggling to get it to build, there's that. Or I think I've got a link to it later in the slides. There is also the, did I put this in? So BCC on embedded systems. It is a native build, has lots of dependencies. It's actually quite intensive, but if you have to compile LLVM yourself, it's a lot of CPU cycles and all the disk space which you might not have. So I tend to cross a compiler ahead of time on a more powerful machine than my actual target that's the same architecture. But there is this thing that was written by Joel Fernandez. So he did a cross build version of BCC, a sort of cross compiled BCC and strode to gain attention with it. So he's now more commonly effectively using a churrut or container like system. So if you want to run this big calorific BCC tool on a modern phone, on an Android system, you actually just churrut in, run it with all its dependencies intact and then come out again. So that's there at the bottom if anybody wants to investigate that. OK, so system tap is the one I need to talk about because system tap has been in this space for a long time, is again a very orc-like language where you attach a probe to something and you do things with it. So this is an example of a BPS. So what happens is the system tap guys have long had a kernel module backend where they do the safety checks, essentially in their language parser and then they generate C, compile the C, load it as a kernel module, attach it and run it and they added a second backend, a BPF backend. And you can do some really cool things with it in particular because system tap has access to the dwarf information, which means you can talk to about arguments by name. So you notice I've trapped cases read, I'm printing the file descriptor and the size of data that we're running and that file descriptor and that count are the name of the arguments in the function. So you could also pick out variables of the things going on. For BPF purposes though, I've called it revenge of the verifier. The EBPF verifier is more aggressive than the system tap language. So system tap has lots of features to try and assure the safety of what it's loading, but the verifier is actually more aggressive. So the language permits looping but 3.2 didn't implement loop unrolling. So if you try to put a loop in you can express it syntactically, but the compiler will give you an error. If you try to unroll you naively at least, you can sometimes get the opcode limit. And the other thing is that some of the sort of superpowers in system tap are being able to print all the arguments or all the locals, which is super powerful. It's diving through the dwarf it's giving you access essentially to every bit of information it can find to present to you. And that causes verification. So system tap will compile that but the verifier will then refuse to load it. None of this is a surprise at least not to the system tap developers because that's a run time early stage development currently lack support for a number of key features. They're aware of it. They've put it out there to play with. It has some awesome potential but whenever I was using it I got frustrated because the library code there's lots of great examples of what system tap can do and neat things you can find out. And when you run them warn or other of those three things I've listed in the limitations rears its head and you can't actually run the program. So you go straight and you go straight back to the default back end. So I don't want to put anybody off system tap. It is awesome but system tap with BPF is still a little while away. And then, oh, two days before I had to give my slides to the Linux foundation so they could present to everybody a new tool was announced and I felt I'd be selling you short if not I didn't tell you about it. So I haven't really been able to run this very much yet but this one has so Brendan Gregg who is the maintainer of BCC has blogged about it. It was a comment who's written by Alistair Robertson. He's super excited by the tool. He's blogged about it. It's a great blog post. The essence of it the sort of superpower it has is you can see it there. It's hash included Linux kernel headers and then when it's gone to VFS open it's chasing down the name through structures through multiple levels of structures to give you additional information about the thing you're trying to open. And that's tremendously powerful. It's not quite as powerful as having full dwarf perhaps because you can't access arguments in the middle of functions but at the same time the richness with which you can look through structures is just unparalleled and the natural way to express it. This is a picture of the inside structure. The thing you can't have to absorb all this but I have to say I've copied these pictures from elsewhere because it was only two days before my slides were due in. I have credited to where the pictures come from but these are straight copies from their internal documentation. It's directly linked to Clang. So instead of calling out to Clang to compile something it's actually using Clang's library interfaces to gain access to a C-like parser to handle all those hash include files. Which is awesome. It means BPF Trace has some serious superpowers. But because things like Clang is inconsistent packages in some of the distros it also means you're probably going to have to compile Clang yourself. So I was able to compile it very easily on my laptop and I'm still trying to build it on my Debian Stretch Arm System which I keep giving me link errors whenever I get to the end of it. So I can't tell you that much about it other than the fact that it has awesome potential and it's brand new and it's definitely one to look at in the future but I can't give you much more than that right now. OK, so I've got ten minutes left. I'm just going to go through some of the examples where I've talked about like I say, if you'd seen these at the beginning it would have been magic. So this one's taken from samples BPF inside the Linux kernel tree. I'm actually not going to show you the source because it's quite long and it doesn't express very well. The nature of the samples inside the kernel directory is because they're both written in C they tend to be quite verbose, they're quite big it's hard for me to get them on a projector for you. But this is essentially going to look at the various power states that the CPU has occupied during a test case. So C states and P states I hope we get this right around one is the core, the power state the core has gone so that's a P state C state is the cluster state so it does a lot of power management work but what we care about is the residency you spend on each one. So if you've an idle workload that you're testing you expect to see heavy resident in the, I hope it's the lower end of the, I'll come up which ends up I'm not a power management engineer but you expect to see heavy residents at one end of the scale if you're running a flat out workload you expect to see heavy residents at the other end of the scale and if you're running a mixed time driven real time workload you expect to see it ticking between different states where it can manage the safe power and power engineers spend their lives trying to run workloads and push things in one direction or the other ironically it's not always in the same direction because sometimes you want performance and sometimes you want power so these kind of metrics are really important but it's really underlining that idea that we're moving beyond trace with these programs because you can aggregate the information in maps you don't have to pass trace files to generate it there's not a big intermediate set of processing to do it, it can visualise in real time these kind of informations now this particular example is so important to power management engineers they've actually got this idle stat tool to do most of this already but I hope you can imagine other things in your working life where that ability to aggregate the states that's moved between is helpful and the fact that you can build these tools relatively cheaply idle stat was the blood, sweat and tears of quite a lot of work if I remember rightly whereas the fact that you can construct these relatively cheaply is a great step forward ok, so this is the one that more than any other eye-viewers being magic it's using that orc-like ply language if you use orc you'll know that you spend more time writing one line than anything else and ply has a library of one line in its documentation this is a variant of one that I saw there I did modify it slightly because I thought it worked more differently so just imagine that you're running a system and it's really hammering a library function preferably not like memcopy because memcopy has a different execution time depending on the size of its argument so let's assume it's a library function that's getting hammered and has similar execution time to whenever it's called it came up in my flat profile I could try doing a nested profile but that's hard to deploy because the tools are difficult to get hold of it's easy to do a perf top trying to dive down deeper requires me to have more stuff installed on my target and what I can do as I can attach this K-probe I attached it to the K-mem cache alloc node because I knew that would get hit a lot while I was running the test you attached it to whatever function you like t5 means run for five seconds and at the end of it tell me the stack trace that was most frequently used to reach the function so we've got a dictionary and the key going to the dictionary is the stack so that means whenever you get a new use of the same stack you can increment the number and you can calculate who's called this function most recently and this is why we did the stuff in December at the beginning because you can actually see how all this works gathering a stack is something Linux can do trivially it's all internal there is some special cases with all the maps so a stack itself is actually a map that you insert into a map but that's kind of a material it can be an index on your dictionary and that gives you that immensely powerful one liner that you can load with a program that is only using the C library it's not one trick but I can do other things this is also a technique using the stack but I showed you earlier a C-like program a C program that was trying to track leaks in the KMM cash shallot node this is a ply program to track those same leaks you can see this one fits on the screen I haven't had to kick out one of the probes so there's a slight hack going on which is that you cannot gather the stack from a rep probe in ply I suspect this is because ply is unaware of the dwarf information therefore can't work out where the stack got put but nevertheless on my platform you can't read the stack from a rep probe but what you can do is you can read it in a normal K probe attached to the beginning of a function and you can store it with a magic null into your dictionary then when you hit your allocation in your rep probe you can take the return value which is the point to the thing you've just allocated as your key and you can store the stack as your value your value and then if you want to be tidy you can clean up the zero thing so it doesn't show up in ply at the end likewise when you have done your free you can look up the first argument because that's where in this case KMM cash free stores the point of your freeing and you can write the special value of your dictionary and that will clear out the key and so if you run it as you show I've run it for a second, I've put track on and I can see my at comes out and normally it would be empty so if there's no leaks you don't see anything if there are leaks you will see them come out I mean KMM cash unlock is very frequently used sufficiently frequently used that if you just randomly run this program then one time out of ten you'll see a couple of allocations that haven't managed to be freed in the window obviously if you're doing a time based window onto your leaks you have to remember there might be some things in flight at the point you do the summary but nevertheless this is going to show you the stack trace of leaks of whatever resource you like the key thing is you don't have to track memory leaks because memory allocators quite often have lots of tools to help you find leaks but if you've got other types of resource with alloc and free functions you can also track leaks of those very quickly with these kind of experimental pros and so the final thing I'm going to fit in the timer that's excellent so this one is using BCC and BCC like I say is that marriage of scripting languages and a probe and so this is showing you how to use a tool that Brendan Gregg has included in the BCC distribution it's called tools trace and on this occasion it's able to take as an argument the stub clock set rate the rate colon 2 which is eventually all going to flow into a K probe event and the tool is doing all the argument parsing for you generating the piece of code on the fly because it's just a piece of text and scripting languages good at manipulating text then you can generate the BPF program on the fly in python hand it to the compiler get it built now I don't want to pretend that tools trace any sense a simple piece of python code it's quite a few hundred lines and if you get more advanced on to things like there we go there we are Argedist you're getting even more chunks of code but Argedist is I mentioned earlier that you've got this idea to quantise information so Argedist is another way to Argedist is another way to look at the distribution of an argument to a function so it's using minus I to say you'll find the information in this header file it's then saying go and look at this function it has this prototype and I would like to look at the rate argument which is the second argument and it's then gathering together and presenting to you the distribution of that argument now this one is inside one of the power management drivers on a high key so clock set rate is where you realise the actual setting of the clock it will send a message to the microcontroller and says change the clock to this speed so on the Linux side of the universe this is the last and most detailed view of what power stage you're in at any one time and you can see the residency so it's bending in this case it's flicking between the fastest speed and the lowest speed so it's going from idle to full power right for certain workloads again if Leo is doing his power management work he might well be he might well be trying to get it to occupy middle ground but that's just a sample of the types of tools you can see so I would strongly recommend that people if they play with any of these tools go and look for what other people have done because that's the so learn by example tool because there are a lot of examples out there quite often you even find that the example is something or anything those are dist trace.py there's lots of other tools like that in the BCC distribution so quick summary before we say goodbye to each other so hand rolled as ways to work with BCC pure asm it's only really hack value in education I can't think of any practical reasons to use it other than to let people know there's no magic happening pure C again you can see right inside it that level even then all the magic is probably available for you to see and there are great examples in the kernel that you can learn from I don't really like writing my own probes in those languages just because of the time consumption so we then move on to the orc-like set these are all the stuff that have got that attached to a probe and do something so Ply key point is easy deploy especially on embedded systems it is less powerful than all the others you have to remember this to get that convenience there are things it can't do it can't pass the sea headers it can't do dwarf system tap dwarf parsing is properly awesome when you have dwarf parsing compiled with everything else it's a seriously seriously powerful technique but it's not BPF at the moment if you really want to make it easy to write your probes BPF trace has hash include superpowers and BCC I would describe as a tool for tool makers because you can dynamically generate your C code you are going to compile with clang you can dynamically parse arguments you can take all sorts of things and really tailor it to a solution is that if you want to prepackage up a really complex BPI type probing system you can provide it in BCC and you can then provide man my magic tool all the stuff will flow out very naturally so BCC is a truly impressive way to get tools it is more powerful than anything else on display here but it costs time to write in a way that the orc-like languages don't so when you can express it in an orc-like language the orc-like language is probably the one to use and in top corner everything is awesome it really is a lot of fun writing some of these bits and pieces so many many thanks to all the people who have worked to make it awesome because I am more the enthusiastic user than the heavy contributor so yeah thank you all for your attention if you have any questions in the future that's an email address that will go to both myself and Leo who is the contributor to the slides so just say you were at ELC we will help non-members if you're if you just say I was at Daniel's talk at ELC and I didn't understand this bit or whatever else we'll be happy to help I am at the limits of time so I'm probably going to have to take questions in the hall if anybody wants to talk to me is that right? I haven't seen the next presenter arrive but at the same time I wasn't overrun because it's unfair to other people so thank you all very much for your attention