 Good afternoon again, and this is the topic which is, this presentation finishes something like a half an hour ago, because the topic is really recent. So, BPF is everywhere. It's getting more and more, like in security, and I saw today a Twitter post that Dave Miller had retweeted, where people are wanting to have BPF for all their operating systems, so that it doesn't become something that ties you to Linux. So, you want to have the same thing on the other side, you want to have the same bytecode running there, and that's something that's possible. You can have this thing running on SmartNix. So, I mean, it will float it the right way, let's say, after all those years. So, it's flexible, it has improved performance, there's lots of innovation, but then we have a problem as a support organization. How are we going to support this thing? There are lots of, it's popping out everywhere, and up to some two weeks ago, one week ago, it was just like out of the radar. You would not see this easily on a profiler, you would not map it back to, I mean, there are all sorts of things that are covering this presentation that tries to solve this problem, tries to get the BPF to be observable, to be auditable, so that you can use it and support it. So, BTF, BTF type format. BTF is a type format, now it has more than that. It has an extension, a second ELS section, that has line number information, and that you can use to ask for the jitted binary and the original bytecode, so that you can insert this into a profiling session, a tracing session, and later on do post-processing analysis using this information that the kernel will provide to you. It has its origin on the equivalent compact C type information that came with Sunday Tracer, and the start for Sunday Tracer is interesting because you don't have the header files, so you had to have the definitions for the structures in some other place. So this other place was the compact C type information. It's a subset of DWARF, you cannot do with it everything you can do with DWARF, but it's always available. All your programs have this as a section. It's kind of like the CFR, the confirm information that came from the debugging information area to be always available, so that you could use this kind of thing with exception handling C++. So it starts as a debugging information, but it can be used for all the things. You have a new, you have CSBPF, which predates BTF, of course, and you have, in the kernel source, you have a library of libBPF, tools libBPF, and now it has code to notice when a BPF program that you did, one of the steps from BPF trace all the way to this thing running, it will build the object file, and then Clang will generate BTF information, or as I'm going to show you, there are other ways for you to generate this BTF information, which was the first method for getting existing debugging information and generating BTF. There is this BPF-BTF load, so you first load the BPF, and then afterwards you have to associate with it the information that you find on the object file for the BPF program. The kernel, as it does with BPF, it validates this thing, it validates this information to check that there are several checks that we're going to see, and tools can use this BPF object getInfoByFD to get the information and then use it in any way they like. The first thing you may think about using it is for predi-printing BPF maps. A BPF map is a structure that he used to communicate from the user space tooling side to the kernel BPF programs. You can have a map that is accessed by multiple BPF programs inside the kernel and the user space that they coordinate using these things. There are all sorts of... How kinds of maps? There are hash tables, there are maps of maps, there are arrays, there are all sorts of things. You get kernel data structures for bucketization and make them accessible to BPF programs as a BPF map. Then you have this BPF tool, which you use to query the kernel about what are the programs that are loaded, what's the size of these programs, ask for the jitter code, and etc. To enumerate, I'm going to see what are the maps that are in place, and you can then ask for the value for the first one or for a specific entry, and then you can predi-print these elements using the BPF information. So BPF maps. This is an example of a map in the perftrace BPF site that I'm working on to collect pointer contents and to do filtering using BPF. So it starts, so you have this structures call, which is in the BPF program, and the only information that it has as of now is if this is enabled or not. So when you do perftrace-e open and read, it will get this table that has 512 entries and it will be enabled for the open and for the ones that you selected. And this BPF map, it's a construct that is in the perf sources that can be used in BPF programs when you use it with perftrace. I did it this way because for you to associate the type of a value and of the key, you have to, you get here, the int is the key and structures call is a value. So if you expand this, you're going to see that structures call, BPF map, and then the type, key size, value size, and the number of entries. That's the way you create a BPF map, be it in BPF trace, not BPF trace in BCC, or in perftrace. So BPF will query the kernel about BPF, about programs, about maps, about map contents, and you can look up for values and even add things to the map and etc. So perftrace uses BPF to collect Cisco pointer payloads. As I said before, the trace point only has 6 integer arguments and it's no use for me. I mean, most of the time, you want to see what's the name of the file, not the pointer to where it is. And it uses BPF maps for filtering as well, for telling, oh, don't trace the tracer to avoid the feedback loop, for instance, or if you ask it to do a system-wide tracing and you notice that something is generated a lot of software scores and you are not interested in that, filtering that specific PID. So you create a hash table and put on this hash table the PIDs that you don't want to be tracing. And for telling what to collect. So you have on that struct Cisco, now you have just the, if it's enabled or not. But later on, you'll be telling that the first argument needs to be copied and it is a string. And if you're going to tell, for instance, to emulate S trace, write or read when you want to take a snapshot, the start of the buffer and then print it. So you're going to say how many bytes you want to copy when you get to that specific Cisco. And for Cisco's where there are a struct, BPF will provide a way to know how many bytes to copy and later on how to print it. And even for filtering as well, you can get the offsets and stuff and then create the BPF program. So putting a BPF program plus maps in place. If I do perf trace dash A, which is for the whole system, all CPUs and dash E, nano-slip, and do a slip $100 just to be sitting there and looking only at the nano-slip system call, it will, because it's configured to do so, besides looking for the nano-slip system call, it will add an event, which is the BPF program that will collect the things. So with that running, you do BPF 2 prog tail 6. I did tail 6 because there are two BPF programs that are running as it started by perf trace. There are more, there are more before that are used by other components of the system, but I'm interested only on those two. So this has some information that's interesting. The gitted size of the C-center at this stage in the perf trace development, it's only 381 bytes. And then 496 bytes it's using of memory and it's using, the C-center is using three maps. There is a map for the C-scores, there is a map for filtering stuff and the number of maps may grow. And at C's exit you are using just two maps. I wish two were shared with the enter. Let's say if you are filtering open, you want to discard the enter for open and the exit for open. So you have to look at the same map. And if you call BPF 2 map to look at those three maps that are being used by the two BPF programs, you're going to see the name, augmented C-scores, that's a perf event array type of map, which is the way that the BPF can insert things into the perf ring buffer to be processed by perf trace by the tooling user space. You have the array which is for the C-scores to tell which C-scores are enabled and which ones are not and in the future to tell what to collect and how many bytes to collect for each of the arguments. And you have on the extra map which is a hash for the pids that I'm filtering. I'm not interested in... If I do a look up on this and it's there, no, it's not, so... To dump the contents of a map, at first I'm looking at what is the number for nano-slip that's the one that I enabled. So for x86, and this is a file that generated as part of building the perf trace tool, it generates a C-score table and the C-score table for x86 says that nano-slip is the C-score number 35. So if you dump the contents of that map 362, which is the C-scores, I put an excerpt where in the middle it's 23, which is x435. So you see that this is not beautified. This is just the size of the key for an array for some reason is hard-coded to be 32 bits. In the kernel if you ask for an array map with the key size of one byte, that's what I was wanting, it fails. To discover this was quite a... I was not imagining why it was that, I had to read the code to see. So it's enabled. With BTF this thing will be just as an integer and the other side as a boolean. And you can do map lookup to close the gap. If you look at... BTF to do a map lookup on the map with ID 362 with the key 35, it shows an x23 and a valid one. So it would again show it a beautified. So generating BTF info, how do you do that? A long time ago I had this tool called PAHole, which is used to dump data structures using the dwarf information so that you can see where the cache lines are placed, if there are holes that you can use to reorganize and people have been using this for a long time when they are doing KABI comparisons and things like that or when they are trying to find a space to insert a new field without changing the offset, making the data structure more packet. He saw that because he works with the Spark Port of Linux and he wanted to know about the details of Solaris and the details of Solaris were available via CTF, the seat compact type information. All the data structures for the kernel that were possible to use in a deep-trace script were there. So he wrote a set of headers and a simple data dumper and sent to me and then I made the PAHole code base to be debugging from an agnostic so that it could work with dwarf or with CTF and I wrote as well a CTF encoder so that I could from dwarf generate CTF and then from CTF, Freddy printed the data structure, do the same with dwarf, check if the two match it and then everything was working. Nowadays, LLVM 08, I think from 8, the version 8 onwards, it generates it directly. When you ask for debugging information, it will generate dwarf and CTF. I don't think that's in all of the targets, it's when you ask for the target architecture to generate debugging information and build this thing for the BPF machine, a BTF section will be there. So we have the encoder in PAHole which was the first way for you to generate a BTF section, to encode BTF, to create a BTF section, was using this thing, reusing the CTF loader from PAHole to make it generate BTF. They took the CTF definitions in front there, they said this is not so good, I need this and that and then they came up with BTF. So basically it gets dwarf, puts it into internal representation, generates CTF and encodes it in a new ELF section. So that's an example. You have test.c, the structure, and then if you do gcc-g for debugging and the generated file will be a file for x8664, I didn't specify the target with the bug info. If I then look at what were the dwarf information's ELF sections, you have all of these with the size and etc. So if you just use PAHole with this dwarf information, you're going to see an output like this. Or for the kernel, we're going to see a task structure or any other data structure. It will say where there are alignment holes and etc. If you use PAHole to encode BTF and it's dash capital J, and if you use capital V for verbals, then you see how it encoded it. It's everything based in the offsets of bits and not offsets of bytes, so that you could encode bit fields in the same way that you encode normal types. Because the dwarf has, when you have a bit field, you have to have the byte offset and the bit offset, and so that's one of the techniques used to compress the debugging information. So this was the only way for you to see how it was encoding because it was just a verbose mode. So the BTF section that's generated, it's that one, dot BTF. And this is Clang 8 generating BTF. You see Clang dash target BPF, that's G and that's C, and then the end result is, say, 64-bit LOSB, later on, and EBPF, not actually 64-64 anymore, and then there are three sections. The BTF X is not generated by PAHole. That's where they are inserting a line number information. That's something that even could be done from the line number information from GCC, but it's better done by Clang because it knows the offsets from where the lines, the equivalent source code lines are with the binary generator, the BTF byte code generator. Last month I wrote the PAHole BTF loader, so it reads the BTF stack, puts it into the intermediate format, and then you can use it to fairly print the BTF things. So if you do PAHole dash capital F, BTF, you get the same thing. And they didn't do the PAHole BTF loader because they were in a hurry. The thing is, the first loader for BTF was in the kernel. It was in the kernel. LibBPF notices that there are some data structures that have in its name, and the score underscore BTF map, and then the map name in a specific place, and it gets this thing from the ELF file, and then it calls C's BTF on the FD of the BTF program and says, associate this BTF information with that BTF object that is already in the kernel that was already validated. This will trigger... This is the BTF map, the macro that does that. It just gets that type, and then constructs the variables in a way that the loader will then be able to associate. The kernel validates the BTF, validates the header, BTF magic, and the version flags a lot of stuff. For instance, if you... some of the validations perform it, if you look at the graph BTF verifier log in that specific file that is the implementation for BTF in the kernel, you're going to see that it's validating lots of stuff, several things that are related to how a data structure is encoded. And this is an example of using the Ftrace subcommand of Perf to set up a function graph tracing only for the functions in the kernel that has BTF in its name. And then you run Perf Trace that will use the BTF to collect the things. So you are using Perf Perf Trace to trace and see it loading the BTF information. That's something interesting with all those technologies. You can combine them together. You use Perf to develop Perf, you go on getting those things. So the first users for those technologies are people developing the technologies. So you can see that... I mean, this is just an excerpt. There are lots of other things that it does, but it gives an example of what it is doing using Perf to set up Ftrace to show BTF in the kernel. So BTF diff is just... it just uses paho-capital of BTF file. It generates from BTF and from Dwarf. And then there's that flat arrays because as of now, the way that it's being coded by LLVM and the way that CTF was was to further pack the data structure. Any multi-dimensional array that you had in a struct would have the values of its dimensions multiplied and then you have just one dimension. It was flattened. So when you get it from Dwarf, you have to do this step and then combine it with... So it should produce the same results and it's being used for regression tests. So encoding multiple CUs, compile units, in the kernel, and in the kernel we have lots of objects, and for each of the objects you have the definitions of all the data structures that it uses. So that's one of the reasons for the debugging file to be so big. So what one guy at Facebook did as part of the process of compressing all the data structure information for the Linux kernel into just one place is to... You get all those things and put into just one BPF payload. But you go in such a way that you're going to rewriting the number of the types in the subsequent init thing to make them all unique. So because in the first object, let's say, in the kernel, you have integer as type 0. Okay, integer, the basic type int, probably going to be all the time 1, let's say, 0 to void. But some may have test struct, some may have file operations, some may have a mix difference. For each of them, they have those numbers. If you mix two together, then you will mix the number one from the first one with the number one and the second one that may not be the same thing. So there is a step to get all of those into one in a way that this is not mixed up. And then that's the vapor war part that Stanislav was saying. Facebook is writing an algorithm that they describe it publicly but they haven't published it yet that gets this big file now with all the types from all the compile units of the kernel. And they claim that they reduce by 97 times the size of the dwarf information that you have. So the idea is that the kernel will have all the ambulance will have these types, just like Solaris had in the past. So one of the things that make a rise from these is that for lots of things, you will not need to have the kernel headers anymore. You can rebuke your data structures from this BTF information. Again, not just for debugging, it's for creating BTF programs and even for this and that perhaps. And this is more recent things that were being described in the Linux Plumbers Conference in Vancouver last November. So the idea here is that since you have in the kernel all the data structures, you have in a compact form all the data structures for the kernel so you know what task structure is. It's this data structure with those fields with those types and those types like this and this and this and your program that you compile it somewhere with some machine of some architecture or whatever, it's a BTF bytecode. Together with it comes the definition of the types in the kernel that it uses. So when you are loading this BTF object, it can check that the types that it uses are the same that are in the kernel that's been running at that time. If it's not, they can use some heuristics, they can use, oh, the PID changed it four bytes or changed it from this place to the other place because there was an optimization to get into different cache lines as explained by Ica on the C2C example for Postgres. Then as part of the jitting, it can relocate the access and make that thing that was for older version, a different version of the same data structure to work with the new one. That's because this is trying to reduce the needs or to eliminate the needs for you to have the toolchain on the machine that you are going to run. So you build those objects files and you ship just those objects files. When you are going to run it, you run it with something pre-compiled. So that biolat or all those things will come together with a .o that are going to work anyway. There are limitations for these. There are some changes that are not so simple and then in those cases, it would just bail out and say, oh, get a new version of this thing. This is something which was recent. I maintain the tooling side of the perf subsystem. Sometimes I work in the kernel as well. This is one of those cases. The visibility of BPF programs. Two new records were developed over the course of the last four or five months. Lots of discussions with Peter Zewstra that maintains the kernel part. Alexei Sarovaitov, the guy who came up with the eBPF concept idea and several other engineers both at Red Hat and at Facebook. So this perf record case symbol is a new method of the record for perf where anytime some subsystem in the kernel puts in place new code like for ftrace or for any other, for kpatch perhaps or for any other thing, you get this registered in the stream, in the perf data file or in the rig buffer if you are consuming it right away with perf top or perf trace. So this solves a problem that's not specific to BPF. BPF puts code there. So we're going to be using perf record case symbol. And then there is another one which is perf record BPF event. Let's see. A little bit about case symbol. So you have the normal header, then you have the ADDR where this thing is and then the length of this new code that was put in some place in the kernel at ADDR. And then you have the casing type. This data, if this is something else or if there are several. Some flags that today they are known just for future and to use that whole. And then there is this case symbol name lan and after that as is usual with other perf record metadata events you have information that's based on what you set up on the sample type. So you can have the TID, you can have a timestamp. Sometimes you don't want the CPU because you are tracing just that CPU. So to make the perf ring buffer more compact or use less space per event you use this thing. And then BPF event which is the last one which combining with the other will tell that the program was loaded and you have this on ID, the program was loaded and you have this tag which is unique for the BPF program. So you have in perf record let's say using it as an example. You have a thread listening to this BPF event. It gets several, makes several calls to the kernel to get the FD associated with that program then the BTF associated with that and gets the information, gets the information, what's there? The types, the function prototypes, the original BPF byte code, the jitted code and source code line inspiring that was loaded when you inserted the program. This was, what to do with that? This, that was a mishap. Annotation, auditing and Intel PT needs the jitted code to do its motions. So that's it, that's the end. There is this information about the algorithm, the initial BPF doc patch that was sent to NetDev and the patch that introduced the BTF in LLVM and it has a nice description of the motivation, why things were done in this way or that way and so on and so forth. So that's what I had to say and if you have any questions so I think that's the time. This is a debugging for format of millions of kernels and Orc and now BTF, which is different, which differs from CTF. I see a lot of design here. Yeah, can you repeat that? I mean, the kernel has Orc, which is on Winder. What is a debugging? Exactly, exactly. I mean, the thing is, the kernel grew in complexity so you need to tame this complexity and this complexity should be better tamed by the people who are put in the complexity there. Like for Ftrace, Ftrace was something that nobody would imagine when it was being done that it would be enabled in production for you to trace all the things in the kernel but then because the kernel developers knew that that was something that was really useful and they didn't want to say the kernel with the bug one. So they worked over the years to find the most performance, the most lightweight way to get that with security. So I think that the kernel community working with this in mind with observability and making sure that the thing is safe, it's the perfect thing to happen. More questions? Yeah, exactly. That's the tag. Yeah, yeah. I mean, it's... Have you seen when we were showing the BPF maps and the number, there was 549... That's not... There was 549 maps. Is that because he was increasing it and not reusing the previous one? Like a PID. More widespread work because we have a lot of field space tools that can deal with Dwarf. Yeah, right, right. Something new and not to replicate all this work from this new program to be easier probably to just convert them. Yeah, right, right. What? He was asking if there was any tool to convert back to Dwarf from BTFS. So not that I know of, but PAHOL could be the tool. It has a CTF encoder and a BTF encoder. But I never talked about doing a Dwarf encoder. Well, we could do that if there is some interest. Information. Right, right. Barf is good for debugging. Right, right. After debugging you just use... The way that is in the justification for the introduction of BTF in this here. He said, it will not generate just BPF. It will generate BPF and Dwarf in tandem. So the tools, the old tools will find a Dwarf there and it will work. Just BTF, what's that? No, I don't care. Not really. I didn't took the time to see that, to check that. I know, in 2015, I think in Santa Clara, they were starting to see this need and then they were talking with Dave, which is the upstream for BPF. And he said, oh, talk to Ronaldo because he has this tool that does the conversion, etc. So they got that thing and they looked at the definition for CTF there and came up with BTF. There's a lot of discussion on the NetDev mailing list and on the... And Young Hong Song, that's a guy at Facebook, is really approachable. I will ask this. I got curious now. I'm going to ask him. If I get some answers, I'll pass it back to you. Any other questions? So we can go to the party. Thank you. Thank you.