 Welcome everybody for the first day of the L15X and today we're going to be taking off our embedded track with Joel Fernandez from the software. Hi, so thanks for the nice introduction. So today we're going to talk about BVFD, which is a project I started to make it easier to run BCC, which is a powerful tracing tool set on an embedded remote target. So I'm going to start off with a demo and then I'm going to build up towards BCC talking about Linux tracing in general and then how that builds up to BCC and what BVFD is and hopefully the demos will make it clear how I'm using this stuff. So a little bit about me first. I work in the Android kernel team at Google. So we ship all the kernels on all the Android devices out there and my two areas are Linux scheduler, which is a piece of code in the column that takes care of placing tasks on different CPUs and load balancing the tasks between the CPUs and scaling the CPU frequency and stuff like that. The thing I work on is tracing, which is f-race and perf and all these powerful tracing technologies available in the Linux kernel. And BCC is the latest thing that I've been working on. So I'll start off with a demo of a tool that lets you look at all the file reads and writes that are happening across the system. And also this thing is how powerful BCC is and because we're in the kernel, we can't miss anything, anything that's involving the kernel itself across all processes. We will see that in the kernel. So I'll just show you an example to a file top. Just showing my setup, I have EmbeddedTargetHikey964 which has the high-silicon 960 SoC and it's running Android and it's connected over Wi-Fi to my laptop and so everything I'm going to be doing is over the Wi-Fi network. Someone will be tracing the Linux kernel remotely over Wi-Fi. So before I run file top, I had to set up some environment variables where my kernel sources are, how I'm communicating with the target, what the target architecture is, stuff like that. So I run file top now, you can see some warnings that happen when the tool runs. I don't want to go into that now. But right off the top, you can see the Wi-Fi service in Android is continuously reading from Wi-Fi stats format to get information about the Wi-Fi interface. So you can see that you have this global view of what's happening on the Android system. And so I'm going to open an app called the Contacts app. You can see it did a lot of reads and it's reading the contacts database there and I'm going to create a new contact, just like how you would create a contact in a regular Android device. I just created a contact and you can see that there is this entry here that shows that the contacts database was written into 104 kilobytes, 53 bytes in total. So it's kind of nice to be able to do this over Wi-Fi to get that kind of observability. You can see that there's not a lot of data transfer happening because all the data collection is happening in kernel. As these like thousands of events are firing, the data has been collected in kernel aggregated there and periodically sent to user space. In this case, over Wi-Fi. So all this builds upon a lot of technology that's not visible. I just ran files off but there's so much going on under the hood that you do not need to know. That's the beauty of this. So broadly speaking, there are these six different signals that you have in the kernel that you can monitor. This is not related to BCC or eBPF or anything. It's just tracing in general. These are the signals. You have the static trace points. First of all, which are points in the kernel that are inserted, trace points that are inserted at compile time. So at compile time, you can have a static trace point that emits some information if it's when it's enabled. So they call static because it's done at compile time. These are the S-trace events that you have. And then you also have dynamic trace points which are trace point probes you can insert at runtime that you don't know in advance. That's how file top works actually. It instruments VFS read and VFS write in the VFS layer dynamically. So these trace points are not decided ahead of time. That's what they call dynamic. That's the K probes infrastructure in the kernel. That's what allows you to do that. Then you have the user space equivalent of that. U probes lets you do that for user space applications. U probes is dynamic. Again, you don't need to do it ahead of time. Then you have something called USDT which is the static trace points for user space. And then last two are Perth events. So you have your performance counters. Those are also signals that you can attach hooks onto like cycles and cache misses and stuff. Finally you have the sampling signal which lets you run some profiling code or some code periodically a certain number of times a second. That's how Perth profiling works. You have an interrupt that fires 100 or 500 times a second and you can run some code that can observe the system. So these are the various signals that we can build on top of to get the observability that we need. Now let's come to BPS. So BPS lets you write programs easily that can run when any of those signals are triggered, when any of the signals happen. The way it works is you have a BPS program that's compiled to BPS bytecode and that's loaded and verified by the kernel. So once it's loaded it's sitting in the kernel ready to run. And then you can attach that program to any of those signals I just mentioned in this slide. And program runs in kernel, collects some data that the signal might have emitted or you might just want to read some kernel data structure or something like that. To get that information that you're trying to observe and finally the BPS program in the kernel can output that in different ways using maps. Maps is a data structure that lets you aggregate information as you get it. So for example histograms is a kind of a map. And there are various other ways to image information from the BPS program. So BPS is a technology that lets you do that. So now let's come to BCC. So what is BCC? BCC actually puts all of that together. So it lets you set up all of these signals others talking about, the dynamic signals especially, activates them and all of that. And it also does all of this stuff, takes the program that does the observability like file top in the demo I just showed you and builds a BPS program out of that. And finally it does all that automatically so you attach it to the signals that are required by the tracing tool. And it also has a set of libraries written in Python and stuff that can read those, read the maps and present the information to the user periodically. BCC kind of makes it easy. The beauty of it is it does all this powerful stuff behind the scenes that you don't even know. Like when I ran file top you didn't see any of that happening. You just had to say file top and it was showing you all this information. There are like 100 plus tools that you don't even know, need to know how they work and just run them. That is opposed to running tools like Perf or something like that where the running those commands might be more complicated and interpreting the information. So BCC does all the powerful stuff for you but really it's very easy to run. So feel free to interrupt if you have any questions so far please. Okay. So traditionally BCC has been typically run on the same device where you have your kernel sources and your clang stack and all of that and the kernel is running on the same machine where you're running the tracing tool. And that would have been fine but unfortunately you also need kernel sources to be able to run the tools. As I showed you, you needed to set an environment variable to run file top that pointed to the kernel sources. So all that is, that doesn't work very well for cross development kind of development model where your target is remote and your development tools are on a different machine. So that makes it very difficult to run these powerful tools. So what I did is I separated out the whole stack into two parts. One is the stuff that doesn't need to run on the device you are tracing and then the stuff that needs to run. So the stuff that needs to run on the target is just the stuff that does a low level things that are not compiler related like loading a BPS program once the bytecode is available or creating, you know, attaching the BPS program that was loaded to a trace point for example. These things are low level and are not like, not related to compiling the tool or compiling the program that produces the BPS bytecode and stuff like that. So, you know, and the way I got that to work is I started this project BPSD which kind of encapsulates all the low level things that BCC needs and have that have BPSD run on the target. So in the demo I just showed you BPSD is running on the on target and BCC is running on my laptop. So just to summarize why we want to do this, you know, we don't want to sync kernel sources to our remote target. We don't want to download them onto the target. We might not have space for that or the sources might get out of sync with the target. It's really unnecessary. With this kind of a model, you don't need to cross compile Python, Clang, all of that stuff. It can run on your host. Quite easy, very well run on your host. And this fits the cross development flow where as I was saying, you build on your host and run on your target. You don't do it the other way. And also your host machines are more powerful. They have more processing power typically than battery powered devices. So it makes sense to do the compilation of the tools on your host. And also many embedded targets don't have symbols in their executables. Things like that. On the host you have a greater chance of having that information. And the tools need that information to need the symbol information in many cases to produce meaningful results. So that's why these are all the reasons why we want to do it. Any questions so far? No, I'll move on to the demos and show you some more demos. So I guess I'll just move on to the demos then. So this is a tool. So I just showed you five top. It's a source like hardirqs.py which summarizes the amount of time that is spent in interrupt handlers over the period of time that is specified in this case 10 seconds. So every 10 seconds it's going to show me the total time interrupt handlers spent on CPU. And that's useful because you might run into real-time issues or latencies and stuff like that. If interrupts are spending too much time you need to, as a designer, you need to make sure that your interrupts are not taking up too much time. So these are all the interrupts that spend time in the last 10 seconds. The second column is in microseconds. So I'm just going to do some UI activity and show you that the GPU interrupts are now going to take a lot of time. You can see that Mali here took like 141 microseconds, milliseconds, out of the 10 seconds. So that's hardirq.py. Just to show you the tool itself, what it looks like. The tool itself has an embedded, it has a C program embedded in it that is compiled into BPS. Oh, shoot. Yeah, no problem. So where this tool works is it uses K-Probes to instrument handle IRQ event per CPU function in the kernel and it attaches the count-only function. And then when that runs, there are two modes, actually. You can either count or you can time. So I think the default is time. So for the mode in which it times let me show you how that works because that's what I demoed. So for that, trace start is executed whenever handle IRQ per CPU runs and it makes note of the starting time stamp and trace completion runs whenever handle IRQ per CPU finishes and it uses the time it stored in the start. It takes a difference between that and adds that to the histogram that I showed you. So some other tools that I wanted to show you. This one I won't be demoing but it's useful to see how much time was spent doing block IO operations. This summarizes, shows you a summary of every so often of all the processes that were doing block IOs, how much time they took to run. Here, Biotop and Biosnup there in the slide. The difference between Biotop and Biosnup is that Biosnup shows you an entry. It shows you, it doesn't refresh every so often. It shows you like a tracing type of output. It doesn't clear the screen, but Biotop clears the screen and both these tools show you the latencies that the bio operations took. So this can give you an idea about whether the disk IO is slow or not, for example. Another tool is CacheStat which is useful to monitor what's happening in the page cache in the kernel, whether you're having page cache hits or misses, stuff like that. If you have too many misses in the page cache that can indicate that that is what is causing a performance issue. So I'll just show you a demo of CacheStat. So I'm going to remote into my target board over ADB and I'm going to run a while loop that reads a 100 MB file. So there's a 100 MB file. I'm going to write a loop that reads the file. And then in another window, I'm going to run CacheStat every two seconds. You can see in the third column here, CacheStat is showing me that all the cache accesses were hit because the first time the 100 MB file was read, it was cached into the page cache. I'm just going to drop all pages in the page cache. And so the next occurrence of CacheStat, you can see that there were a ton of misses. That's because the page cache had to be repopulated with the file. So the fourth and fifth column are not working because it's reading files that are on my laptop instead of the remote target to populate them. But the first four columns are valid. So again, to run this tool, all I had, you know, it was very trivial. I just had to run the tool. I didn't have to specify any complex arguments or anything like that. It's very easy to use. If you want to do more complex stuff and don't want to write your own program, there's this set of tools called the Multi-Tools. It's one of my favorites, the Trace Multi-Tools. And the Multi-Tool will automatically, will generate an EBPS program for you and compile it with BCC based on the arguments you give it. And you can do a lot of powerful things with it related to K-Probes and stuff like that. In this slide here, I want to see all the open, the file opens that are happening on the system. So it's functioning in the kernel called Deuces Open, which receives arguments from user space to open files. And this uses K-Probes to instrument Deuces Open behind the scenes. So I'm telling the Trace Multi-Tool, I want to instrument Deuces Open and print the second argument that is fast to Deuces Open. Second argument is the file name. I can just do that. It's already showing me. I think that's the while loop that I wrote in the other window that's continuously running. So I stopped that. And so if I do some activity in Android, it shows me all the opens that are happening, surface flinger and all that. Some files are being opened a lot of times. You know, it might be better to open them just once. And that might give you an idea of maybe something is unnecessary done and this tool will tell you that, yeah, you're doing this and it's not necessary. So it's very useful for that kind of performance debugging. And I wouldn't be running a demo of this, but it's still called RunQLength, which uses the Perf Sampling mode. And every time Perf Sampling event occurs, the BPS program runs, and it looks at the run queues on all the CPUs and adds information to a histogram about what the run queue length, like how many times a certain run queue length occurs. So histogram has that information in the end. This will give you an idea about whether there are too many tasks running on a system causing performance issues. Sometimes you might have, so you can either have too many tasks running on system or you might have too few CPUs. Like if you have too many tasks in the run queue, then the run queue is overloaded and you might have performance issues. So this will let you look at, get that information. Many things are not working right now, but they are very close to working. I call these the boring issues because there's a clear path to getting them work but just need some work to be done. So that involves user space, dynamic and static tracing, symbol lookups, new tools, tools that need to read stuff on the remote target but they read stuff on the local machine. That is my laptop in this case. There's a full set of issues I've created in the issue list. I already welcome any contributions there. Maybe we have some more interesting issues that it's not clear how to solve. One of the issues is that BPFT itself, the reads and writes it does, can cause, or the other system calls it makes, can cause a lot of tracing activity itself. So sometimes that interferes with the tracing tool output. So I'm thinking maybe we can blacklist the BPFT process from being created somehow. Some tools can generate a lot of output. So we have to reduce the amount of data that is being sent somehow, maybe batch the data together, stuff like that. Perf polling is another issue that we're looking at. And also logging is not working. So there are a couple of interesting issues that it's not clear how we solve but we have to solve them. So that's all I had. This project has come to the open source so you can go to GitHub and download it and contribute to it. I wrote an LWN article that goes into much more detail about how it works. And Brandon Greg has set up nice resources on how to run these tools. So I would encourage you to go check it out. So I guess we have a lot of time so we can just do a Q&A and any discussion on... You said that this works around some of the issues that you have often tracing remote targets like all those problems... So usually as an embedded developer you have the kernel sources available because you build them on your host drive. So you just find BCC to that using the work that I did. Now it's possible for you to set an environment variable pointing your kernel sources in that and BCC will look at the kernel sources on your host, not on the target, on the host and use that to build your BCC program. Debux symbols are not needed in the general case to run a lot of these tools. They're needed to interpret the output sometimes. Like there are tools that do profiling of stacks and stuff like that and so then the output will just have instruction pointers if you don't have symbols. So for things like that you need symbols on the host but otherwise you don't really need symbols. It seems like this has an applicability beyond that it's so serious if you had any intentions for authentication, authorization and securing the connection between the client and the host and also what you do in failure mode. So say I've launched something against my host my Wi-Fi connection graph or something like that is my BPS program going to continue running on that host or does BPS need a hand? Yes, so the first question about the other ways you can use this the way it's been written, the architecture right now is very plug and play kind of thing. So for Android I have an ADB module that does Android specific stuff but BPFT itself is written with a standard IO interface kind of thing so every command is like one line and you send that to be, you can run it locally and just give it a command string and it'll respond to it. So to load a program we give it the byte code in base 64 encoding. So somebody else was telling me they wanted to use it on servers and over SSH so that's definitely possible to add very easy to add I would say. The other question is failure mode so right now it's very early stage so we haven't really thought about that so any kind of cleanup will have to be done you know for Android what happens is the ADB connection dies the processes that were started remotely will also be killed automatically so and then all the resources will be freed that way but yeah but cleanup will have to be thought about there's a lot of those kind of low hanging fruit issues that have to be worked on so yeah feel free to pull the code on GitHub and ping me if you want to work on this stuff and contribute yeah right now it's just a single device but potentially that could be done yeah I don't see it's just a matter of somebody taking it and applying it to their use case I don't see a reason why that cannot be done yeah