 Our next speaker is Brendan Blanco with Infernal Low Latency Tracing and Networking. Hello everyone. I'm Brendan and I'm here representing the Iovizer Project, which is a new Linux Foundation collaborative project. It's about six months old, so this is my first time at this conference and it's good to be here. And what I'm going to talk about today, first I'm going to talk a little bit about the motivation of the project. And I'm going to introduce you to something called EBPF, which is a feature in the Linux kernel. I'm going to talk about the BCC toolkit, which is one component of the project. And I'll also show how Clang and LLVM are used by BCC in interesting ways. And we'll do a few demos for networking and tracing. And I'll leave it up for questions. And while we're going through this, even though it's a bit of a technical talk and probably there's a bit of content, if you have any questions, do feel free to speak up as we go, because I want to make sure that everyone understands this is meant to be educational. And a little bit of bureaucracy. So these are some of the founding members of the Iovizer Project. I work for Plumgrid, which is one of the core contributors. And I spend most of my time working on contributing code. So Iovizer actually goes back a long way. In some of my work at Plumgrid, I've been spending time trying to build network applications for SDN, for cloud, for data centers. And that requires writing infrastructure, writing infrastructure applications. It's not the same as writing a web application or something else to talk to databases a lot. It's a bit low level. And while we went along and we're building some of these applications, we went out looking for toolkits that we could use to extend that functionality. And actually it came up short. And so as we were building our product, we were trying to build that. We've ended up building out that SDK and now we're contributing back some of that toolkit. And some of the things that we want to build, you need to extend the Linux kernel. You need to add functionality. And that can be hard for someone who maybe knows something about networking or storage or other high level concepts or maybe different disciplines. And to write code for the Linux kernel there's a bit of a barrier to entry. You have to write a kernel module or you have to build your kernel. You have a lot of rules that you have to follow to build into the kernel safely. And how do you live in both worlds is a question we were asking ourselves and we think that there should be a better way. So let's give an analogy. So I'm going to compare a little bit to Node.js. I don't actually code much in JavaScript and you can like or dislike that particular language. But it has some interesting points. So writing multi-threaded applications is hard. That's always been true. That will probably be true for a long time. And if you want to have developers who have a problem they want to solve and they have a toolkit that's built towards how the computer works rather than to how the developer thinks. You're going to have a bit of a friction. And so in Node.js you have a different model than maybe what's typical and we've seen some of the applications built on Node.js that have taken off because the syntax, the event-driven framework kind of models the thought process of the developer rather than the computer itself. But at the same time, because there are smart people building that language, that tool, you don't have to sacrifice all of those things that you would want writing a server application. You can use the V8 engine inside of Node.js which is pretty good at transcribing JavaScript which is a pretty high level kind of fluid language into machine code. And it does that on the fly which is pretty impressive. And combine that all together with a repository of modules of code that other people have written that you can combine together in interesting ways gives a nice velocity to your web applications that you're trying to build. And that ends up building a community and moving the ball forward. So what would you need to do the same thing for infrastructure applications? Well actually the environment is completely different. You're maybe getting data from A to B very quickly or you're getting data from your database onto your disk and you need that to be done in a certain way. So you have some high performance requirements that you have to meet. And if you're writing for the Linux kernel, you can't crash that. The servers that we're writing on have to be up for years would be nice. And you want those things to be reliable. And if you're developing quickly, you can't reboot the system every time you have a new piece of code you want to try. So you'd like to have in place upgrades and you'd like to have also debug tools. You'd like to get visibility into the infrastructure apps that you're writing. And especially nice would be if you have a programming language abstraction that matches the problem that you're trying to solve. You see is the language of the kernel and that will probably always be the case. But C isn't necessarily the best language if you're writing for networking. Networking has packets for instance have a very well defined structure and there's been a lot of work of people putting thought into how a language that's networking specific would look. There's working with some people that are writing, that are developing P4 language which is a way to write switches or network device implementations in a high level language. And they have some smart people that are working on taking that and compiling it into other implementations. And actually we have some collaboration with them. So while we have an infrastructure and we want to build towards the kernel but you don't want to have a custom kernel. You want this to be something that's upstreamed. And doing this with kernel modules works but depending on what customer is maybe you don't trust the person that's hosting the kernel module and if you could do this without turning on some painful flags in your kernel config that's certainly nice. And like we said you don't want to reboot your system. So not all of these but these are all nice to have. We can work in an infrastructure that doesn't have some of these and for some of the problems that we're trying to solve there are already solutions for these that may or may not meet some of these restrictions or satisfy them. So as a result, looking at that problem statement, IAVISR project is the engine and the tools and the community and all these things put together that are trying to enable people to write applications for infrastructure, for data centers, for moving data around in their system. So let's show something in action just what they appetite. So one tiny application that I've written in the toolkit that we have we'll call it BCC, BPF compiler collection. And this tool you can app get install it or compile it yourself. It starts with a Python interface and we're going to run this program which is going to attach this little C snippet. And every time my kernel, I'm going to run this on my laptop because it's a live demo, not a simulation. Every time a new process on this laptop is spawned it's going to run this C program. It's happened yet. Here's spawn a new terminal. And hello world. Now I have something else I want to write. So let's change the program. So that's our hello world for infrastructure applications. So there's an infrastructure that the kernel provides called trace printk, printk from the kernel that are written to a trace buffer and the syntax for these is documented. And if I'm not mistaken, so we'll have obviously the process, this is the CPU ID that the particular printk is coming from. These are a series of flags that represent the context at the time of the print. Here's your time stamp since the system was booted. And here's your message. This infrastructure is not something that iVisor provides. It's coming from the Linux kernel and there are a whole bunch of other pieces of infrastructure that use the printk in K probe and tracing infrastructure in different ways. So we're leveraging on that. So let's dive in a little bit. So BPF is one of the building blocks of this tool that we're using and we see BPF programs, we call them BPF programs and in a kind of a visual way you can see here the different components. So there is the user space component that we saw. There was a Python application where you're kind of driving what's happening in the system. And there is a system call interface between that user space component and what's in the kernel. Inside the kernel there are different hook points where you can attach to a particular event. And you can attach your function to those events using various pieces of the kernel infrastructure. And the functions that you run are those C programs that we saw written so the Hello World was a single function example. And what I haven't shown yet but we will a little bit later you also get access to a set of tables, hash tables or arrays that the program can write data into, read data from. User space can write data into, read data from. Clean API between user space and kernel space to get data back and forth in a programmatic way. So BPF, it's a pretty old technology. It started even before the 90s in other OSes in Linux. It came in 1990 and primarily used for capturing packets for looking inside the data of a packet and figuring out whether you want to capture it. And the problem with doing it for instance in user space you could do whatever you want to analyze the packet but you have to copy the data so it could be slow. So in the kernel there is this infrastructure for looking into data. Starting in kernel 3.18 we started to upstream some extensions that I will go into later. And we call it BPF program and a lot of the documentation in the kernel calls it program. It's not really a program. There's no process ID. It's really just an event handler. So that event handler, it's a small piece of code that gets executed. And there's maybe think of it like a little scripting language inside of the kernel. There's an interpreter that can take those instructions and run them for you. There's an instruction language for that. And for more details there is a man page inside the kernel if you have a recent system. So the instruction set, originally the instruction set was it had just like two registers, no stack. You could load data from a packet and give a return code. Very simple. And some of the extensions that we added to really enable new functionality was to expand the number of registers at a stack, add conditionals, and the ability to call functions. And that comes with a big caveat. You can't call any function. There's a very, very restricted set of functions you can call. And there's an access to data structures. So maps, hash tables, and arrays. You can look up your deletes. And some of the examples of helper functions that are available would be like to do things with packets. You can do things with kernel memory. So we'll show some details of that later. So the program that you can attach to, it's a limited set and it's more coming in the future. So K-probes or U-probes are a kernel functionality for setting a break point effectively in the kernel arbitrarily. So if you know an interesting memory location of a function, you can set a K-probe at that to print something. That's the original infrastructure. We extended that to be able to run a function, let's say when another kernel function is executed. And it does that every time. So that's one way you can attach BPF programs. Socket filters, which there's the original use case of TAP or raw sockets, you can capture data. A recent one, which is pretty interesting, I haven't used myself, is to do packet fan out. Suppose you have a protocol which has multiple streams. So think of something like speedier, HTTP2, or quick that needs some per packet load balancing to different sockets. There are applications listening depending on what's coming over that connection. Seccomp is an interesting use case. So with Seccomp you can, for every system call that's made, you can choose to run a program to determine whether a process is allowed to run that system call or not. So this is a pretty powerful security feature that Chrome, I think is one example user of, that will run a BPF program to take some logical decision to lock down different tabs in your browser. And the one that I'm most interested in are TC filters and actions, either packets coming in or out of an interface. So with this you can, every time a packet comes into a network device, you can choose to run a program and you can modify the packet. You can take the packet and drop it or allow it based on some of your own criteria. You can forward it to a different network device other than the one that was originally destined for. So you can actually implement new behavior inside of one of these filters and add new functionality based on that. So we have all this power and so now the question I would ask me is, why should I trust you or why should I trust the kernel or trust these programs? And the answer is that, well you shouldn't. You should trust the Linux kernel to do it for you. So when I start, when I try to take one of these programs that I've written and load it into the kernel, I have to tell it what type of program this is. If I'm writing one of the K probe type of filters, I'll say that this is a K probe function. I want to say kernel, here, take my BPF instructions, these set of instructions, and load it. And the kernel is not going to trust anything. It's going to look at all the pieces of that array of instructions that I've given it and make sure that it's valid to run this program. So if the program makes a function call using one of these HEPL functions, let's say to modify packet data, you can't modify packet data in a K probe because the K probe is related to kernel functions not packets. That wouldn't make any sense. And then the kernel will, let's say, take the set of instructions and it's going to verify that it's a valid program. What does that mean? So the valid program is something that will execute in the kernel safely and it shouldn't have any loops because that would be an infinite loop. I've injected code in the kernel that's consuming an entire CPU, it's going to run forever. It shouldn't have any illegal instructions so it's going to verify that the syntax of that program is exactly accurate. And it's going to verify that the memory that the program accesses, either reading or writing, it's only memory that's in the program, on the stack or directly from a valid function. So you can't null the reference from one of these programs. This is an easy one. So let's take my hello world and just do, that should be interesting, right? Nope. So here we can see the BPF instructions that the kernel received and you can see it parsed them out into its data structure and made sure that it's not accessing any memory but in fact there is an invalid memory access here. And it only went on. And so once we verified the program, the kernel is going to take those instructions and actually if you're on the right architecture, JIT will compile that to your native instructions. So the programs that you're loading will run at speed if you're on X86 or ARM64 or S390. Who here knows what S390 is? Okay, not bad. It's IBM's big honking machine. I actually saw it at the Linuxcon a couple months back. So just to revisit, these are BPF programs and you can do lots of things. So we're looking kind of just at the C subset of the programs that we're loading but the workflow itself, we're kind of building up. We have a tool chain that will take your program and you should be able to write in a high level language like we said and run that in the kernel. So we have this workflow that leverages LLVM to take a C program or really any program that you could parse with LLVM. We actually have some examples where we have a custom language that generates the LLVM intermediate representation and then you can also write C. So the clang front end has a BPF output option. But there are some quirks about the programs that you can run in the kernel. So we saw there that take the printk for instance that we had. We passed the printk string as one of the arguments with our hello world in it. Well in C, when you declare such a string, that goes into the global section. But a BPF program doesn't have a global section. It doesn't have really any section. It's not an L file. It's just a series of instructions that you're giving to the kernel. So you have to somehow trick the program into using a string that's on the stack. And there are other such restrictions that we can show. So taking that first one as an example, we want to be able to convince the clang that the string is on the stack. And so to do that we go through this process where the BCC library that we've written will take the C programs and go through this workflow to create valid BPF instructions. And if you've never worked with clang or LLVM there's actually some really cool things that you can do and I would urge you to go ahead and take some of those. For instance there's a C++ interactive mode C++ demo that you can do. You can write C++ on the fly and it will in the same process keep generating new instructions for that. That's actually one of the features that we use where there's a jit that you can take a valid program or take the syntax, the intermediate representation of the program and convert it to native code. Here's a little bit more detail about the process that's going on. There's a rewriter that can take a C program that we feed in and convert that into another C program. Both are C programs. It's just that the second one is actually something that will convert to valid BPF. And the programs that we're writing, they'll go through the standard LLVM workflow after that and you'll be generating optimized BPF instructions rather than something that's simply written. So for example we had that trace print K that we had in our hello world. If you were to load this trace print K without the translation it would be rejected by the kernel because it's accessing an invalid pointer. But you can use the rewriter to basically expand this as a macro. And you could have actually used a C macro in this case, but some of them you wouldn't have been able to. And it would rewrite this program into this other syntax. So here's another example. So when you're running a BPF program, like we said, you're only allowed to access memory that's on your stack. But if you want to do something interesting in the kernel you probably want to do things like pointer dereferences. And if your pointer points to some place that's not on the stack well, what do you do? And in this example we would be attaching this program as a K probe to let's say look into the previous PID when we're doing a task switch in the kernel. So every time the kernel is scheduled from one process to another you might want to do something interesting with the previous PID. So you would write this syntax. But you can't actually do this. There's no dereference operation in BPF that you can use. What you have to use instead is this helper function that reads the kernel memory and validates that basically does a runtime check to make sure that the pointer that you're asking for is a valid pointer. And it will fill in the data structure for you. So it's basically doing a checking that your pointer is in bounds. And it's doing that at runtime. Here's another example. So if we were to write a network BPF program you might have some interest in accessing the IP address the source and desk IP address in each packet. So if you wanted to do something like this very simply just look at the source and desk address of the packet you could write a program that let's say does the arithmetic to figure out what address offset from the start of the packet to do. But actually to simplify some of the programs you can translate this to a helper function which will read the contents of a packet and this gives for instance the offset 16, 0th bit of the 16th byte and give a 32 bit result. So going back there is another example here that we have hash tables and arrays that we can use inside these programs. And the implementation of that is actually a little bit interesting. When you load the program it creates a file descriptor for you that's unique to your process and it's not actually a named data structure. You have to have it allocated for you. So to be able to use the map functions you actually have to know the file descriptor of that table and pass it to this helper function. And the re-writer will assist you in that. Any questions on the re-writer? You can and we should be able to do that for instance with a debug flag that's exposed through this API. Let's take out the null of the reference first. Actually it's a one-liner so you can only see it here at the end. That's the trace print K. You can see that this translated output here. There's a runtime flag to do that. So with some of the more complicated examples that are here this is not that interesting but there are some interesting examples of the re-writer. And I encourage you to look on the GitHub and experiment with those. So I have one demo that I want to show that's about switching gears, looking at the networking. What's something that you could build with this toolkit for networking? And suppose I had a set of machines that are maybe they're running some cloud workload that have a VXLan tunnel established between multiple machines that are doing an overlay. That's something that's very typical in let's say open stack environments. I might have some troubleshooting that I would want to do and I have some idea for how to add some metrics or analytics into that. I'll show you what I mean. So I'm going to start up a simulation here where we have nine virtual machine hosts or container hosts or whatever that establish a VXLan tunnel between them. And within each of those hosts I'm going to start a whole bunch of clients talking to each other over that VXLan tunnel. So let's just have a team up here with a whole bunch of things going back and forth. And so this tool I talked about maybe a couple weeks or so, a week or two to write this demo on top of the tool. And what I can see here for instance is all of the hosts that are talking to each other what bandwidth they're using. So here for instance I can see I'm running this particular analysis on the 172.16.1100 and it's looking at all the other hosts that this host is talking to. And I can with this visualization hover over the different endpoints and see the utilization. And that's fine. I could do that with just monitoring TCP dump or mess flow or various other tools can do the same thing. But I can also let's for instance filter by question. This is a core diagram implemented in D3. So the front end is D3. There is a back end in Python that's collecting the output of the BPF hash table that's holding the statistics in the kernel and just presenting them as JSON whenever the browser requests it. So I could for instance filter by VXLan ID so I can see the different tunnels that are going across these hosts. I can filter by them. And I can also dive down into, maybe I'm interested in a particular endpoint to see which inner IP addresses are being carried over that. So I can actually filter by endpoint and see the packet inside the encapsulation. So I can see the statistics for this 172.16.0.3 talking to 0.1 and see what contribution that is having. So let's add a noisy client in there somewhere. And suppose I had some trouble on my network and some service wasn't reachable. I could add this to the tools in my tool bag to be able to say, well I can look at this graph and say, you know what? There is definitely a problem between these hosts because I can see here just with the simple glance while I was paged and I'm under the gun to figure out what's going wrong and try to stop the problem. Just with a simple glance I can go and say, well there's a problem between these hosts and in fact I can tell you exactly who is consuming that bandwidth. And I can go and track down using whatever other tools and find that client and stop them and give it a couple seconds while the bandwidth evens out and we're back to normal. So how do we do that? So we have here's one of more complicated BPF programs which can parse IP packets and keep counters across multiple layers of encapsulation. So there's various C code here I want to show. We don't have to memorize this while you're watching this. But we can see that it parses the outer packet, keeps some statistics in some helper location, calls another helper function to do parsing of the inner packet and then we'll combine those inner and outer IP addresses into a key and increment statistics on the number packets and the number of bytes seen associated with that tuple. So that's how we wrote that program. And there are lots of other things possible. Like I said, this is just the start. So some things that we're thinking for the future, where we want to take this, at least from the networking point of view, is to create this idea of IOModules. So fungible pieces of code, you know, kernel space and user space that you can download and add new functionality to your system at runtime without worrying about, without impacting production. And so, you know, I'm making the analogy again, remember back to the Node.js at the introduction, an MPM repository like thing of IOModules. You can download and just run them. From, you know, we want to enrich the SDK to right now there's a Python interface and you kind of, it's a direct access to that CAPI and you want to kind of clean that up a little bit, add some nice IDs on top so that you can, let's say, combine multiple modules together, connect them, kind of create maybe a simulated network or real applications using that. We're also working on a UI so that when you're connecting those modules together that you can see what's there and really implement it on your server. Like how are packets flowing through the various components and be able to deploy them on the fly, you can download them, create new topologies or create new use cases. Okay, so at this point actually I'd like to hand off to a colleague or someone who's helping out with a project from Netflix. Question? Well, the way that the Python API is implemented right now you can't do more than one. We'd have to change a bit the API and for, it does support more than one and the things that come in the order that they're attached. Well, so there's actually one of the features within these programs is that there is the ability to call other programs from this program. So I could conceive of a program that you can attach to a K-Pro when you have more than one and you can customize the order. So that's actually possible. First, go ahead. So the question was how do I see this fitting in with the unicorn? Unicernel development. So if those Unicernels support BPF as one of the attachments. It's a very lightweight and robust tool and it's usable in lots of different places and it doesn't have a lot of requirements to be able to do the C to BPF translation. The way we built the library, it's a single .so and so you can compile that off-box and take it along with you. LLVM and Clang itself is built to be a library so you can package it in that form and it works just fine. So I don't know if anyone has tried that but I think it's possible. So the question was if you try to do two hello worlds. The second one would stop. There's a single place in the kernel file system that you can attach K-Probes and the way that it's implemented right now each one has a unique name depending on what you're attaching to. So if you're trying to attach two hello worlds both to SysClone, they both have the same name so you can only attach one of them. Let's switch into the next talk and he will answer that question. Do you want to use this or that? My name is Brindan. I'm also a Brindan and if you join us to work on the BCC project it might be easier if we call you Brindan as well. BCC and BPF can do lots and lots of things so much so it can be difficult to really comprehend it. And so one thing I'd like to do is show a particular use case that BCC makes possible. I've been creating these various tracing programs for BCC and then publishing them. So these are all open source and the one I'd like to start with began with a post I did on 18th of January. So this is new stuff. EBPS allows us to write arbitrary programs that the kernel will run. So one thing that the Linux kernel has not been able to do for a long time or forever is to do frequency counts of stack traces. I do performance engineering at Netflix and I deal with stack traces all the time so I deal with them for profiling CPU usage, looking at blocked code paths. So using EBPS and BCC I was able to hack the functionality so that the kernel can collect stack traces here. It's on the submitBIO function and then frequency count it. So while I'm tracing here I hit this code path one time I hit this code path 79 times. So I'm submitting block device I owe via VFS re9, xD4 and so on. This is a fairly basic capability. It's really useful to have and it exists in other traces. It's really useful to have for kernel exploration. So I'm trying to get my head around this function what code paths lead to this function. But what's important is that this works on stock standard Linux 4.3 and how I did this was actually by hacking BPS because EBPS can do all sorts of crazy things. We're going to make that a first-class citizen with some API calls for stack traces. What's important is now that we can do this we can start to build other tools. So who's used Flamegraphs before? Excellent, so like at least a dozen people. Flamegraphs are a visualization for profile stack traces. This is an example of a CPU Flamegraph and I'm using it to understand why I have system time, so time in the kernel. The x-axis has no meaning. It's not the passage of time. It's just I flip around stack samples to maximize merging. It's actually alphabetical. The y-axis is the stack depth. What's useful here is and the color is random so don't worry about colors for this one. The wider it is the more often it was on CPU. So I can look at this Flamegraph and I can say, well, we spent 57% of our time in do-page files. That's great. I can investigate other, like here's sys-open. So why was sys-open on CPU? So sys-open called do-sys-open and that called path-open add and walk component and so on. The top edge shows you what's on CPU and everything beneath it is ancestry. So Flamegraphs have been around for a while. They're really useful. We're using them at Netflix to solve issues. We got them to work with Java so we can do mixed mode Flamegraphs and you can see the Java code and go back into the kernel. That's fantastic. They solve on-CPU issues. So when you have CPU usage and you want to know why, you want to split it up. There's a whole other category of off-CPU issues. So when my application blocks because I'm waiting my turn on CPU, I'm blocked on a lock, synchronization lock, conditional variable, disk.io, network.io, and everything at PageFalse as well. It's doing swaps, so I'm waiting to be swapped in. How we traditionally tackle all of those other issues is using a variety of tools. Let's use iOS stat to look at disk. Let's use TCP dump to have a look at the networking and it's a lot of work. In the kernel, the scheduler deals with all of these off-CPU events. The scheduler switches the thread off. So for a long time, we've wanted to attack all of those off-CPU issues with one approach, and that is instrument when the kernel takes you off. Look at the stack trace because the stack trace tells you why the kernel was taking you off. The stack trace will be, I'm in a PageFault or I'm in block.io or I'm in conditional variable. That was not possible on Linux unless you go and get an add-on tracer. But now, EVPF and BCC have the ability to use K-Probes, which is kernel dynamic tracing. I can trace the kernel scheduler functions and I can pull out stack traces. And so now, I can do kernel off-CPU time flame graphs, which is really exciting. And it's complementary to on-CPU flame graphs. This shows the stack traces where we were blocked. I'm just showing the kernel path. So I can now see make. I was actually doing a Linux build for this one. Make was blocked on sysread, VFSread, pipe-weight. Must've been waiting for something else. I guess the make command had launched some other sub-command and it was waiting for its output if it's in pipe-weight. Oh, no, there was something else to the right. What's this guy? Of course, there's a little bit over here where make was also waiting on do-weight. That's probably waiting for a child, a child's child process to exit and the parent's doing the wait. And so you can explore this and get a big clue about... Here's SH, the bond shell. There's AS. Let me bring up... I think I ran a sleep command in here. That's sbscandering, nano-sleep. That was over here. This one caught... I ran the sleep command, the command line, sleep three for three seconds and you can see sleep and it's just nano-sleep in total for three million microseconds or three seconds. So I'm just sanity testing my own visualization, making sure the number's add up. So this is great. Ops CPU flame graphs. I can retire as a performance engineer. I've just solved everything. I can do Ops CPU flame graphs. I can solve everything except it doesn't work that well. If you have a look at a lot of these co-paths like SSHD, for example, SSHD says... This is on my blog so you can look at this later. SSHD says I'm blocked on sysselect, do select, poll and then that's it. Now we're going to the kernel schedule. That's not very illustrative. It doesn't tell me what it's really blocked on because I've gone into the... I'm waiting on a file descriptor. So what am I actually waiting on here? I'm waiting on another process. What happens is another process, which is at the other end of the file descriptor, will do its thing. It will then do a wake up and then I will go back on CPU and I'll read from the file descriptor. Now, the wake up is a kernel function. So I could trace the wake up as well and then get information on that wake up. Enter the next visualization, which I haven't blocked yet because I only did this in the last few days. So this is brand new. Now I'm taking an off-CPU flan graph and on top of it, above the gray line, I'm pasting the wake up stacks. So you can see I blocked on this code path and then this is the thread that did the wake up for me. So it gives you the next level of information. So let me just try to bring something up quickly. What have we got in here? I was looking at SSHD earlier. So I can use the control app to search. Let me just switch this over. There's like SSHD code path, so I click on that. So SSHD, okay, we know it went into Poll because it's waiting on a folder descriptor. Now I can see it got, I've only done five stack frames here. Poll wake, it was woken up by TTYReceiveBuffer, which shouldn't be a big surprise because SSHD is waiting for a piped IO. Great. Except there's kind of a problem. Now we go into KWorker, U16 colon 1. So now you discover the next problem, which is always the case with software engineering. You think you've solved it, but you haven't. That is another thread. It's blocked on something else. So you need to go to the next level of wake ups. In fact, I can explore this a little bit. So that was U16 colon 1. So I just searched for U16 colon 1. It's down here. And U16 colon 1, it blocked on. Return from fork. It's blocking on these guys. It's waiting for these commands to finish. So like the CC compile, the boon shell. And it's because I'm doing a Linux build over SSH. And so you have these smaller commands. They're running the generating output. And that's being printed to the screen over an SSH session. And by navigating through this wake up flame graph, I can see not only what I blocked in the off CPU stack, but it's wake up. And I can never get around to it's wake up and it's wake up. Awesome. This, all of this information is frequency counted in the kernel. So I was able to frequency count not just the off CPU stack trace, but I was also able to save the wake up stack trace and then associate it with the off CPU stack trace and frequency count it. I've not been able to do that before with any other tracer. It's possible with eBPF and BCC as they have more capabilities. And I can encode a lot of this in there. So what I want to do eventually is turn this into a chain graph where I can paste the wake up stack on top of the wake up stack and so you can go all the way to metal. Because metal wakes up everything. So then you'll see who woke up, who woke up, who woke up, who and then woke up me. The only person who thought that this would be a good idea, I know some of the engineers in gaming have similar issues where they've been down because they really care about performance and frame rates when you're playing computer games and they have similar issues where it's not just what I'm blocked on or the first level of wake up, you have to walk all the wake ups to fully understand it. So I'm really excited because it's one thing it solves but BCC and eBPF solve lots of other things. So just to give a couple of demos, we've got... there's open snoop. So on the IO visor BCC Github repository, there's a lot of scripts and the way I've been sharing them is I've got a text file for each of the scripts, so things like bio latency, bio latency example. So this is doing histograms of block IO latency, aggregating that in the kernels for efficiency, so it's only printing out the summary to lose user level. And this starts to look similar to other traces like detrace and system tap. It is. Some of the things we've done in other traces before, that now the eBPF is part of the kernel, they can be implemented. And I think I can run a couple of these at the command line just to finish. When you're doing some of those analysis, what overhead did you see collecting those statistics? It's pretty low. So eBPF is jitted, but also the way I'm creating them, I'm using as much as possible to do internal maps and aggregations. So I'm only printing out summaries. So I'm pretty impressed so far. We'll see how it stands when we start to do user level stack traces as well, because that's a bit more CPU work. But for... I was doing the off-CPU tracing, just very rough numbers. Going up to 100,000 events a second, and getting a 3% CPU tax. And so that's not too bad. I mean, I do need to understand that. I didn't even turn... I forgot to do CCTL minus W in turn on the BPFJIT command. Yeah, there's a BPFJIT to enable. So it should actually be lower than 3%. But it's really impressive so far the overhead. So... Where's the tools? Yep, question? That makes sense to me. So I'll let Brendan look at the... the error. Maybe your laptop... Pseudo. Oh, Pseudo? Oh, that's an operation not for me. I see you're running it. I can't type Pseudo on Devara. I'm serious. Okay, great. So really quick to run, really quick to start down, frequency counting on IP stack traces. The question is what about things like graphite? So in reality, this is awesome, but even companies like Netflix, there's probably only going to be a few of us who write BCC tools and use them. A lot of people use them from a GUI. And so we're developing our own GUI vector where we'll be able to get these BPF metrics in. But the same will be true for lots of people. You have your own performance monitoring product or analysis product with a GUI. What BPF means is the kernel can now do all the cool things like heat maps and... histograms of latency in kernel. What you need to do, if you're the developer of your GUI or a customer, is get the GUI to access it from the kernel. You can write BPF programs directly in C. And in the kernel source code, there's some under sample flash BPF. But if you want to use the beast, I advise a BCC Python front-end, it makes things a lot easier. And so a lot of the tools I'm now publishing are written using BCC because it's quick. And that's what the front-end's been developing. And if Python isn't your thing, there's an underlying C library that you can working on go bindings for a different, for the networking IOModule use case. So it's pretty versatile. Question? Yeah, the x-axis on the unCPU flame graph is not time, it's population. And so the left to right ordering has no meaning. And that's different, it's different from what a lot of people expect. What I'm doing is sampling stack traces at 99 hertz, throwing them all up, re-sorting them to maximise merging so that you can see the shape more clearly. And that's it. So the left to right ordering doesn't matter, just sampled at 99 hertz. The off-CPU flame graph works a bit different. That's why I'm just tracing the scheduler events, measuring the time you're off CPU, and then throwing that information up with the widths based on the time. And again, reshuffling things to maximise merging. The widths mean how many samples? So if I had 100 samples in this function and one in another, it would be 100 times as wide. So it is useful information. Visually, the wider it is, the more it's on CPU. That's the stack depth. So as you walk down the stack depth, for the wider. I think we're... Okay. Yeah, I guess we're at time, but we'll keep asking questions if you haven't. We'll take a few more questions, yes. Can you repeat the question, by the way? If you had a few takes that was never woken up, can you figure out which process you were waiting on? But never do the wake up. Sometimes there are things tracing can't do. And that includes, like, in the absence of an event, can I figure it out? So if another thread is never doing the wake up, well, I can't trace the wake up, because it never happens, it initialises. So if you're premeditated, you could trace the initialisation by few takes, so you'd have a log. It's kind of tricky. That sort of thing is better suited for a kernel debug, if you can interrogate the kernel as to... Right, a lot of the kernel debuggers are not real-time. Well, I mean, that's something... I mean, that's an opportunity for you to join the development community and post that as a request or we can figure out how to... Maybe it's possible. I don't think we've ruled it out. Yep, I'm ruled it, because you're right. You could have a log of who grabbed things and then the time that it was last grabbed. It's not the first sort of use cases, like, I'd go for traces, because traces are usually based on the event. So the absence of an event makes things a bit hard. It may be a way to shed some light. Right, as soon as you're talking about mutexes, memory allocation, schedule events, we're in the territory of 100,000 events a second and a million events a second, and you really care about overhead. And so that's part of the reason why EPPF is making more things possible, since it's lowering the overhead down, we get to try things like the wake-up flame graphs, which I know start to get prohibited using other traces. There was another question up there. I've used every tracer. I've used every tracer there is. KTAP was great. It got into staging. It was then asked to support EPPF so that it could be a front-end, just like Python VCC is. And then development on KTAP has come to a standstill. And so I like KTAP. I thought it was fairly innovative, and I wished it made it into the kernel, but it did need to integrate with EPPF as the back-end. And so KTAP seems to have stopped. As for other traces, system TAP has done a lot of great work in terms of the front-end as well and also support for USDT and user-level things. It doesn't appear that it's ever going to be mainline, but if system TAP were to use EPPF as a back-end, just like VCC is doing, it could be a different story. And also, as Brendan said, one of the big values of EPPF and this work is the verifier. So I will write scripts in other traces that I can write really quickly, and then they panic my system 10 minutes later. Here, I write really quickly, and it just says, no, I will not let you dereference that. And it's a more short-term pain because I have to make sure it's right, but it saves me for the long-term pain because by the time EPPF will run something, it will not panic the system. So that's another difference, and that's why we want a lot of the other traces to use EPPF as a back-end because it is safer and has lower overhead. The way it gets compiled, it does depend on kernel headers, but not the sources or debug symbols. I guess one more question, and then we'll break, and you can ask us questions in the break. One more question? All right, I guess it's time to break. Thank you. Thank you.