 It's great that you came here to see about what I have to say, and I hope it's it's interesting let's see What is it about There were several talks about BPF so you know what BPF is it's important to Is stress that BPF is not just about tracing or debugging BPF is being used it for high performance networking and now sorts of other interesting things where you add functionality to the to the kernel in a safe way and That is being experimented a lot and you should expect lots of other things to Be implemented as BPF in the near future in this presentation I'm going to be talking about things that will help have these new functionality implemented and in place and I hope that by Listening and hearing about it You can get ideas of how to do even more stuff with these infrastructures being put in place so BTF it started as a way for you to To have a more compact Type information Information about data structures so that you could somehow use this to more easily Create the BPF programs so that the kernel can use this information to validate what the BPF programs are doing and Lots of other interesting things It started with that the data types, but then nowadays we have file number of The file information the file name for that specific program and line number information so that you can do things like annotation When doing profiling I will show some example of that and now so there are representation for global variables that as well I will be showing us how it's being used In some feature it is pretty spreading the the BTF users is spreading it's being used in most if not all of the Bleeding edge BPF features that are being implemented I will describe Kora, which is in how BPF a BTF is used in there, which is a compiler once run everywhere I wish in the past was the motto for Java, but this time seems to be Done, right? I'll talk about a little bit about BPF trample lines that will Enable some of the features that they'll be describing strict ops, which is a way for you to For functionality in the kernel that is implemented as a table of functional operations To be implemented as a BPF as a series of BPF programs a little bit about dynamic re-linking and KLSI I will not be describing but this uses the the the features that are being described before so They think the first thing to note about BTF is that it's becoming always present When when you wanted to do some analysis Using the pre-existing debugging information Dwarf you would have to install a Separate package, which was really big and this gets in the way so Sometimes you'll be even having that information, which is quite complete and allows you to do a lot of stuff You ended up not using it because Sometimes there is size constraints on the system that you are Wanting to do some some analysis But with BTF is different it's Compact it goes from hundreds of megabytes and Dwarf to a few megabytes in the Linux kernel now the it's if you want to use those BPF features and It will be made available in CZFS in this file So it will be always there. You don't need to install anything else. So the types are gonna be always there It's Right now on 550 RC six. It's a three megabytes about a little bit less than that and the talking about the Producers and consumers of BTF the first one was be a whole which is a tool that The current community uses to find alignment Spaces that are left Because of alignment in data structures so that you could when adding new feuds use that space and to see how The data structure is laid out in memory were in in which cache line. I'm gonna show some examples It loads dwarf Which is what the GCC and generates as debugging information for a Linux kernel and And codes BTF. So there is an internal representation that First convert dwarfs to this representation and then you generate BTF In the past the CTF which was used it for Dtrace I Did support added support for that a long time ago and then nowadays that that code was retrofitted to generate btf so Since it generates from the warfare and you want to be to be Sure that the information that what the disconversion happened without introducing problems There are some tools like btf diff together with payhold where you You will run it to produce output using as input the btf information and As well using the input as the dwarf information then you compare the results and it should be the same Another regression test for that is called full circle, which basically Gets the debugging information Regenerates the source code Rebuilds the source code with the bug info and then gets this new the bug info to compare with the first one So it does the full circle regression test again, so P hope was a Karen or now you have this option config the bug info btf enabled and So as the part of the kernel building process as the last step The whole is used to get the dwarf information conversion to btf And then it uses a dead duplication algorithm to get because dwarf is is Implemented in such a way that every object file that it composes the kernel and there are thousands of them has the complete type information for the types used in that specific object and That's why the resulting VM Linux dwarf information is so big so what P a hole together with Libb pf worded the dead duplication algorithm is there Thus is to reduce the duplicates and then that's why it gets so small P a hole plus btf so besides encoding btf you can decode btf and that that's Wish hopefully we will make it more use it because since now we have it's all the time in since Karen VM Linux You can use it as a as a current developer a day-to-day workflow so There are lots of things that were present there already that continues to work with btf like you can ask for the size If you just do p a hole dash dash size it will try the kernel debugging information present in the btf that this is in czfs So with what this screen is showing is basically you are asking for the size of all the data structures in the kernel And then you sort it and then you see which ones are the biggest ones and the second column says How many alignment holes are there so you could? Do this to try to optimize the kernel? There is another thing that you can do with this btf information and with dwarf as well wishes to ask Wish of the data structures in the kernel contains some other data structure So if you ask for what are the data structures in the current that that contains a list had it will tell you For instance task struct has many of those and then you can you ask for the dash dash x to have the offsets in x There are lots of that another thing you can do is ask for What are the data structures in the kernel hat that have pointers to another data structure? So let's say which data structures in the kernel have pointers to the stroke bpf prog all of these with this member names So if you look at sk future There is a pointer to And here is there is one aligning a whole hole that ref count is four bytes And then this new one has to be aligned at eight So if you want to add some new field here, and it's less than or equal to four bytes You can put it here and the other one xdp attachment info and You can not and since it defaults using the kernel one you can just use the name of the type and it will show You this is really really fast, let's say if you do It's instantaneous if you do let's say if you Rower spin lock is a type def and then it this is a Rower spin lock. Oh, there's something else inside. So you can ask it to Expand the pointers and it will Show that Rower spin lock is an arc spin lock that is a q-spin lock that is a union that has Atomic T which is Counter and so on so forth can expand everything and you can use these offsets here To see the offset from the start of the stroke It can help you with ops the coding and other things where you have multiple substructs. So Let's get back to BPF tool BPF tool is a canonical tool for you to introspect the system or do lots of operations related to BPF The kernel provides several Interfaces via the C's BPF C school and one of them is this BPF get FD by ID So you you have a BPF program and then you ask for for the BTF information associated with that program and then you can do things with it like Doing pretty pretty printing map key values or Intermixing source code with bytecode or jittered code of things like that For instance BPF tool that there is this new subcomment BPF tool BTF which is recent and Then you can see that there are a lot of things that you can do you can ask for the source file It will try to regenerate get the the source file for some specific program you can do things like For instance, there is some use of BPF in perf trace. It puts in place several programs BPF to pro pro will list those programs and then you can have the tag and you can have IDs here and if you do BPF to BTF dump map ID 168 Which is the map that I I wanted to see it will tell me that the key is an integer four bytes and it's sign it and if I ask for the For the value what are the values in this map? It will show you that the values in this map is Systrix is called and then it will expand the types needed for you to create a program that you that sets up to those values Because you ask it for format C And you can as well ask for dumping a file in the C format for That that BTF VM Linux So I was asking for a strict FPU and then you're gonna see that it puts this funny things here Types with fields without a name because it has a need for that BTF doesn't have information about alignment is a specific alignment explicit alignment So if you look at a P a hole dumping the kernel BTF for that same structure You're gonna see that there's 48 bytes here Between this and the other one if I ask for the same thing using dwarf dwarf has information about the attribute Align it so you know why this 48 bytes hole is here because the developer asked it that the State for the floating points registers be aligned at 64 bytes So the kernel was like payholding colded BTF the first consumer the first piece of software that was Reading that information was a kernel. So the kernel You you you say BPF BTF load you load a program then you associate the BTF with it So that later on you can ask for this information on a profiler or on BPF to or somewhere else it Validates BTF and it validates the header if it was a bit BTF magic with the version the flanks And if you look at the BTF verify, and there are lots of things that it tries to verify Just like it verifies programs BPF programs verifies the type information associated with it if you use perf F trace and see and run perf trace Asking for the is sleep none of the sleep syscall It will and perf trace is using a BPF program and it's loading the BTF information associated with this this program. So using perf of trace for the BTF Functions in the kernel that has BTF in its name. We can see that is doing the value Getting to the BTF verifier verifying log members verifying several information Since the kernel BPF BTF VM Linux is raw BTF. It's always available. That's that's I described before so it's compact and here you can see the comparison with the Dwarf the equivalent dwarf information, so it's way less space that's Used it BTF uses this for this specific kernel and the dwarf uses this the sum of those ELF sections BPF compile once run everywhere This is the first the first serious user of all these things of BTF and So the problem is that you have Let's say an organization where you have lots of machines and several of those machines are in departments or wherever So you don't have the same kernel running on all of them so you want to Deploy some BPF programs some set of BPF programs on all of those kernels and You don't know if the data structures in though that are used on those BPF programs change it so these feature Saves information in the BPF program and has the type information for the kernel and can compare I can compare if the things that are being used in the BPF program are In offsets different than the ones that this specific kernel is running it goes Way more than that and I will describe All the things that is done so that one program can run on different kernels Even kernels where some features are not present or so that it can fail gracefully Telling oh no in this kernel. I need this specific field in this specific structure, and it's not present or it changes types So there's a feature that comes in some in Clang which is built in preserve access index that initially you had to In every access to data structures that you think that could change From kernel to kernel you had to use this this thing Of course you would do define a macro to have a shorter form But it was cumbersome When it's loading the program it looks at those relocation records and then if everything if the data structures the same That's okay. You load it to run if it's not you go on all of those places and you fix it up before you you load the program So if a KBI for instance for BPF programs is not an issue This is actually this is KBI it's fixed on the fly It records even built bit filled accesses So another thing that is present it's Extern variables, so if you on a BPF program you say oh, I have this I Want to use this extern variable and the extern variable its name and Linux kernel version it the Clang will build the the BPF bytecode in a way that records all these has to be resolved at link time and Who does the linking the linking is done by Libb pf and Libb pf looks at the external ones unresolved ones and if it's Linux kernel version It goes there and those constant propagation. It just replaces at the site It is using with the value for this variable this external variable and and that's the same for all the Config entries so if you want to have a program like a congestion control algorithm implemented in BPF And you need to know the value of config hurts, let's say you just declare it as a a Extern variable and at load time that specific kernel will have the right value Inserted where it's used as a constant not not as a variable that we incur in access and go into the cache It uses config it uses these boot config you name or fallbacks of proc config Or in your program you can use some specific thing to override this because perhaps in your system. This is not available If you look at all those things are in self tests and those self tests are Like the canonical example of fusing those features So VM Linux that that's another thing that you can do with BPF to you do BPF to BTF dump and you see this file, which is raw BTF and And you ask for format C And it will generate all the types for the kernel in such an order that it's Compilable and This this thing at the beginning. It's an improvement for that Relocation record requests that we saw You Use this pragma clang Person attributes say preserve access in index For all the types that are from here to when I do a pragma clang attribute pop So all the types in the kernel if you use it in this program that is including this VM Linux dot age If they change no problem You the lead BPF will do the the relocations and your program will continue working even if you change some current core kernel data structure So let's see the the first first struct in diesel is unsurprisingly Atomic 64g or this head so everything is compatible. Okay So with this in mind, let's let's take a look at something that was implemented something or two or three weeks ago Which is called? Run Q slower, which is initially was a BCC program But the BCC the BPF compiler collection or how it's called It basically it basically traces high-scaddling delays. So BT BTF BPF to allows you to create a BPF program build it and then from the object file Generate an escalator that provides all you need to process the evidence produced by these BPF program It's ultra-generated from the object file That we lock relocatable VM Linux is this thing This thing that I mentioned it is this thing here Because it's necessary for BTF type at raw trace points. So that's something Another thing which is really powerful. Well, when you are using let's say a BPF trace, you don't see those things but Those things are implemented in terms of of this kind of technology that I'm describing now so and We're gonna see what is this BTF type it in the example. So but but then I was trying to build this thing and Doesn't work It doesn't build My I was thinking that was a problem. I was talking but then Update your system I Have been Dealing with BPF with BPF in the perf a context perf has some support for BPF And so I had I had Have been building LLVM and clang for a long time But but I was building it from the git mirror from the SVM thing where they maintained it They had separate things for clang and for LLVM. Okay, no problem. Just get update remote update pull and build But then when I was trying those that that thing here. I did the updates and it worked So I I was talking with young Hong Song at Facebook and Andrew in a crib it does a btf thing on the lead BPF and on the on LLVM side and they say no you should you should use these other thing here now at first he said oh, it's working for me It's working for me. So I I did some googling around and then the NS all this github move and Okay, they stopped updating the repositories. I was using two weeks ago or three weeks ago so I got it to I move at the repository and then and Build it to get close to LLVM 11 to get this pragma attribute preserve access in and to get Some other information that are needed for all the features So with all that's needed in place what and that's something interesting this thing now It's it's living on tools BPF run queue slow run queue slower. So the idea is that This is a way for you to create tools That will be auto auto-contained it And and maintain it in the kernel source. So it's like that the kernels repository source for Some specific tools. This is the initial one. So it basically will will will link the BPF Will link BPF to will install BPF to it does all other things for you But make sure is that the latest lead BPF is there BPF to is there generates VM Linux dot age Compiles the runs queue lower BPF dot all which is the kernel part which is the BPF program and Generates the scale from from this one and compile the thing that and let's let's What was that it generated generated a binary Which is half a megabyte and That was something that andry and if you strip it you throw away the debugging information which is dwarf It's not useful at all and even more in this context You have this tool that you run like this It's a binary auto-contained with everything with the kernel part user space part and It don't need to have any tool chain on the machine. You can run it on any kernel version And it's accessing and so so I was running it passing parameters in the command line So baller plate is being reduced to create this thing. Everything is a single binary. It's small runs in any kernel Let's unroll the magic it's similar to BCC has a user space part and a BPF kernel part It does not natural is strict a few The reference so you don't have to use BPF probe read or wherever the order dance And then you come to let's see how is the common user space BPF header so this Include file basically defines What is being communicated from the kernel to the user space part? Which is what what is the event what is being collected in the kernel in the circumstance that we'll see and It's passive user space for construction So the user space part is this you have to write it and then it uses our PE it's a user space program It will initialize global data with the options that you collect from the command line The skeleton notice it that there were external variable that there were global variables So it makes it so that the program can Alter the value of those variables before loading the program to the current Sets up the preferring buffers that that's the way for you to communicate from the current space to user space Reads the event everything using Liby PS So the event handler I mean at the beginning you have the same which is the auto generated and and the other one that I showed and the to handle the event you have some Standard parameters, and then you cast this data to that strict event You get the time and you print it the task feed and and the delta The main loop is basically you parse the things you open the BPF this was generated auto-generated by that BPF scale BPF to scale and here You set the minimum the common line argument it the skeleton generator and say generated it's so that there is this substrates this field and then there is one struck with that name that is in the BPF program as we'll see later and Then it loads the BPF attaches it Print some column and then sets up the handle event You can handle loss events as well if it's a high-frequency event you can you may lose and then you want to be notified You create the perf buff buffer with the Callbacks and then that's it. You'll run it in a loop Printing the thing so that that's the the user space part The current BPF part is this run keys lower BPFC and Uses per feed a hash map for timestamps It sets up things it connects to BPF trace points. I will explain what this is when we see them To you to use normal point of the reference sets up the evidence and push through the space via the via the via the Preferring buffer so it starts like this you include that VM Linux that was generated from the BPF information for the kernel Include some helpers include that common part which has the district event definition and then here you declare some global variables that You you that that's another one I suppress it so that the the program was smaller to show but you could set the target feed as well I'm only interested in in Scaddling latencies for a specific target feed and Then here it's how in C you define a a map Yes, you see that's type hash that 10,240 the type for the key is you 32 the value you 64 and you declare it as start And then here it's another one. It's just the per-frame buffer This is still some boilerplate. It should be more compact and then if you look at that The BPF that was generated you're gonna see that that those two symbols which are global and that's what the BPF to Skeleton generator sees and allows the tooling user space to set in the process of loading via Liby pf the kernel program starts with the simple function that uses some BPF helpers like BPF time get an S and the PF map you update a lamp to to it raises the timestamp and Stores using the start map With P ideas the key and the timestamp as the value then it starts the interesting thing which is the The BTF trace points so you you know that for a trace points The put this trace points sked sked wake up He the the parameter is just task struct So you can just come here cast it and access it directly at the verifier. We'll make sure that this is safe and then you have the sked switch which where you have a different prototype and you have to The task that was running in the next one and you get information from the next one and then you do the look up for this thing and you get another timestamp and subtract from the the one that you got from the the map and and check that it's More than the the specified minimum latest in that you want You if everything is it's it's alright. I want to pass this to use a space So I create the the event the PID the delta and use another helper for BPF to get the task the current task calm and And I and I push this to use a space and then I delete that thing from the map So that the next delay can find it and not there and then it the object details It's it's it's this the the BPF the BPF bytecode is 31 kilobytes And it this includes the the btf information the debugging information for all the types that is using like task struct is there It's it's a BPF bytecode and You have the relocation information for the fields that we're using this. I'll just Briefly talk because the other time is running out. This is another new feature Wish we will allow Parts of the kernel that are implemented as let's say plug-ins Let that where you have multiple implementations for some specific piece of functionality to be implemented as modules In the past they were implemented as modules, but now we have this compiler once run everywhere infrastructure and It should be a general mechanism for any kernel is strict ops and this opens the door for lots of Parts of the kernel to be re-implemented as BPF programs So I think that people it shows that this is the congestion algorithm because it's difficult to get one That's that satisfies everybody so in the kernel right now. We have all of those congestion control algorithms that that's quite a lot So the first example that they did was that the center TCP Which is one of those congestion control algorithms? You have the self test for it. You can take a look at it It's not the same. It's just the initial thing to test How this infrastructure will work? so It has helpers for TCP sock in that connection sock instructs large. I will show a little bit You keep the structs with the same name as in the kernel But with just the fields needed for DC TCP that that's something interesting. I mean that the feature initially was for you to to realize oh This field I want is at a different offset That's okay If you I remove hundreds of fields from task struct, but the ones I need are still there Okay That's what this thing is doing so for passing arguments is the same thing I described it before but and these if you if you look at the TCP helpers dot age for this specific Conjunction control thing it has TCP sock with those fields. Yes, it even fits into a Just one screen But if you look at TCP sock in the kernel It has a 135 members It's way more than that is that that's not what what you are seeing here. It's not the full TCP sock It's the subset of TCP sock that's used by conversion control algorithm and this is allowed by the compiler once run everywhere Of course when you declare something like this In your BPF program the offset for this thing will be different from the offset of this thing in the kernel So it has to have some sort of relocation While this is being loaded into the kernel So you have a map a new BPF map where you that you use to register and register in introspect this strict op So lead BPF will receive a pointer to this congestion ops and you populate the map through a series of operations There are several details that I Don't have time to look here and you can do a BPF map dump and we'll see how many users How many sockets are using this specific congestion control algorithm implemented in BPF? And that's how you do internally That's a different a new type of BPF program strict ops and then you you declare some variable some function here This will put this into some specific. It's a special ELF section that that lead BPF will find and and do its magic and Here inside you it's it's kind of it's just like you implemented to speak congestion control algorithm in the kernel as a kernel module It's no different than then it but this is BPF program and and the way it looks natural It's because of all those things as I described so far that now The the BPF verifier to know that This program has that that the program that this thing will connect to in that strict ops table Receives just one stroke sock This is that this is the type and It has access to the type and it can validate all of those accesses So the BPF program now looks like just kernel code But it's not kernel code. It's a it's a bytecode that the kernel will verify Something that was not done before with kernel modules. So kernel modules are extremely unsafe compared to this and Then that's that's another Another method that's implemented That there is BPF trample lines when presentation is is Available online But that's the time I had to present a thing that stopping here. I shouldn't have provided Enough is for information. Do you have any questions? Yeah, but BTF for kernel models should be at that directory when it's done, but it's not yet So that will be different. Let me go back That they'll be on that as less sees BTF Directory you're gonna have one per Like like the lead modules hierarchy one one file per module And then when you are handling and dealing with modules Leap BPF will know where to get those that information. I don't know probably BTF will Refer back to the types that are defined in the main file so that so as to save space Well, I mean, I'll actually say that he would be here. Oh, he's there. I think that's been used quite A lot already in several organizations including one that he pays his salary so It's mature, but it's in the point of view experimenting and trying to see how this fits into Your organization into your use cases I mean There are there are several things that you can do to to check this is just There is the name of the field and then there is the type of the field So, I mean Yeah, yeah, yeah, there was a name. There's a few there's there's a name type of set size So it could have the same type and offsets But the size of the type let's say if it's embedded one Changed and so you could say no no no this type that I'm accessing changed in size. So I will Bail out at program loading. No, this is this is all upstream already I mean, yeah, there are some stuff that are here that I'm BPF next There are some stuff that are here, for instance, the PA hole generating BTF information. I think it's already 5.4 years 5.5 will come with more stuff and I mean there are lots of things happening Some some of those things require new versions of the clang and LLVM as I discussed it So it's I mean, it's it's What I'm describing here some of the things that I described in here There are some more here at the end that I didn't have time to Do that where don't let's say yesterday or the day before Like this Here, let's see BPF dynamic relinking that there was no time for for talking about it But it's something that will help a lot with some use cases where you have to replace existing code with a different one or Combining to to change in a link of things that you want to run into one so that you can replace Atomically that there are lots of new features in the area It's really in flux. Yeah, but that getting per se to see and go into the data searcher. Yeah, right Yeah, I mean that that that's something that that remains to be done Yes, so you can get a heat a cache miss at some point and then you have the information about the offset in the cache line And then you have the the physical address So you have to go back to to a type like going back from to through the series of moves and etc all the way to the a global variable or to a a Parameter to this function. So it's something that somebody has to do And perhaps these will require some changes in compilers, but for ppf programs This should be much easier now because we have the the relocation records So that because what's being accessed here? You go there and look up back and then you know But for the kernel as a whole it you have to use all the techniques because there are all sorts of optimizations that are in place So I think that the best thing you could do is to go to the IP address see the instruction See what it's doing and then try to go back back tracking all the way to to a global variable or to a Parameter and then try to Navigate from that. Go ahead. Yeah Yeah, yeah, exactly all the way all the way to the push. Yes. Yeah But for Yeah, there is a cost associated with that. I mean you but but you have to have these Probably this is gonna disable some optimization or Perhaps I don't know. I haven't looked in depth at this. I just know that this is Required that you have to know where the fields are being used. What are the types so that you can go and do the relocation? Up to you up to you up to you if you are completely sure that the data structure that you are accessing the kernel is set in stone Like like like the ABI wants like for for Cisco arguments. We don't need BTF for that It's set in stone for that specific architecture. If you access it You following the ABI no need for relocation hackers. Any more questions? Three minutes more. I think I'm gonna write some more slides Okay, thank you