 Hello again Again, my name is Israel share a book for redhead and besides Perth. I also work on eBPF and related things So basically this presentation is about how BPF trace works Internally if there's still the guy with the question how printf works in BPF trace I have that answer on slide 30 or something like that. So This is this will be the one part of the presentation showing actually how you transform the program How the program that you put to the BPF trace how it's transformed to the bytecode to the eBPF bytecode and one or two suggestions That can make your program faster to omit some some useless stuff if you survive that there's Second part of the presentation and it's about the recent and planned BPF trace features So I will go through some of the stuff that we are working on and which is about to be about to be merged first of all the big picture so This is the circle how BPF trace works. So the user provides those programs that Jerome was explaining how you put them together BPF trace works through the is using LLVM libraries to take that program compile them to the eBPF bytecode So for every every probe that you specify in a program. There's a there's a piece of the byte eBPF bytecode BPF trace using the libbcc and libbpf to take that program load it to the kernel and attach that code to the probes once this is done all is all is Set up and the kernel starts to go through the probes It triggers the eBPF bytecode the eBPF bytecode runs and it produces some data the data are stored in memory areas that we call maps or Are stored in the perf ring buffer BPF trace is actually watching all those things the maps and the perf ring buffer reading the data and Printing the data back to your terminal. So this is how all is glued together The scope of the presentation as I said is how the program itself that you put to the BPF trace How it transforms to the eBPF bytecode so how you get from the human readable to almost human readable and later on the kernel actually can Jit the code that means it translated to native instructions However, this is not the code of the presentation. I will just show you how this transformations are being made To actually understand how the BPF trace works internally, it's good to know What's the environment the eBPF bytecode is running in so it's not It's not difficult. You have ten registers Which few of them have Special functions register one to five works as a functional arguments So whenever you call a function the BPF code needs to set up those register When you get to the function from the function if the function returns any value in r0 You will get the value from the from the function There are two special special values that every eBPF code is executed with and they are stored in register 10 Which is pointer to the stack Stag is piece of memory that you can use and Initialize and then pass pointer to the stack from the stack to the BPF helpers Another special value is r1. It is context of the probe Usually it's the place where you would go for the For the arguments for the probe so when the probe is executed You are tracing some function function has some arguments the context value Let you get the values for the arguments the helper calls eBPF subsystem provides many helper calls that You can use in eBPF bytecode it provides like the PID the timestamp all those things that that you need in the in the assembly, it's the whenever you see call it's the call to the helper and How to get the actually how to work with maps you need the descriptor for the map and in the bytecode is This symbol map square brackets and ID Idea of the map is the map descriptor. So anytime you see that it's actually the value of the of the map descriptor The output of the bytecode is something that what's produced by BPF tool So if you write like the BPF tool proc dump x-let it ID and ID of the program This is the output This is the disassembly you will get and it's quite easy all the arithmetic operations are quite easily under understandable So let's start directly with this easy one. So You are telling BPF trace to put Probe on the syscall of right trace point and you are telling the BPF BPF trace to Put number one to the map value. So not so much useful program, but it serves as nice Example how you access the map The map is accessed by the helpers there are several helpers how to access map and anytime in a BPF trace when you read the at sign It means you are accessing the map the map which is used in BPF trace is almost always the hashtag and Hashtag which is per CPU. So those two are the most used things. So when this program is actually Executed It all goes to update the value of the map. It's a hashtag. Hashtag means that You have key and values glued together But if you have program like this which no with no key we will get to the keys later on You actually access the zero key of the map. So the pseudocode is easy Put key put zero to the key put wall put one to the wall and call the call the helper So you need to prepare the register to the air one. Ah, right every helper has like see C-style declaration which you can get from various header files. This is where I get mines So you see the C-style declaration of the helper so first argument is the map So this is how you get the map ID put it to register one Then you prepare the value zero Put it on the stack and take the stack pointer and have it as a second argument The same way you prepare value number one put it on the stack. You have the third argument The last argument is flex which is zero for this and this is how the program will be translated and work Initialize the data call the helper map is updated number one is there Let's do somehow more useful map update Just a disclaimer, this is not what you should be doing is just for the sake of the example I will get just in few slide how why this example is wrong, but it nicely demonstrate When you write Updates for maps like this. What will actually happen in the code? So in this example, you say in this syscall write trace point Update the map value with the one every time you go through the probe the map value will be increased by number one and The pseudocode is quite easy First you need to get the value from the map that means The program itself has no idea it doesn't have any Any data space it always need to look for the value So this is how it's done first you look up the value with the key you need the key is zero because we don't use any key You check if the value exists you increase by one if it doesn't exist you create value number one and you You update the value the way it transformed today BPF code is like this first you call the lookup Then you check up on the values and increase whatever you get from the map and Finally you you update you update them up. So this This scheme that first you need to look up the value and update value Will repeat every time when you in BPF trace will use Access or increase the value or do whatever to the value of the map So it will generate in those two call helpers and all the things in between Now while why you shouldn't be doing this is That for this purpose when we want to count actually we have a special helper which is called count and It's doing exactly the same thing, but it's doing it on the per CPU map. So the first example is Having the value of the hash But it's just a single hash. It's a global hash and when it's accessed accessed from multiple CPUs Here you have like race window where the CPUs will fight about the data and will End up in the in the wrong result If you use the count Function it will do exactly the same but in the proper way It will actually use per CPU Hashtap which you can see right here, which means that there is a hashtag for every CPU So every CPU goes and have access to its own Hashtap its own value and So when the program finished for every CPU you have the value and the BPF trace will actually summarize all those values And print you just one final value One final correct value Just for the fun if you compare those things So this is like benchmark I'm using when testing things it actually creates many processes and they are Reading and writing in between themselves. So there's a lot of bright c-score executed. So the base Base time the benchmark Executed was 78 seconds. If you use the per CPU way of counting the c-score It's still horrific, but it's not that bad as the 262 66 seconds for for the regular update. So not only that it's wrong. It's also really slow We can do better you can do better for some of the cases There's a there's a feature called global data in ebpf that actually Allows you to skip the lookup and update to the map and Just provides pointer to the data to the program the thing is It's only usable for simple values without the keys. So it's usable for this This example where you have one value and you increase that then you can use Those five instructions that will actually do the same as the previous one, but much much faster. So The code will just get the pointer to the value And you will use the locked increase and Those four instructions basically is all you need but again, it's only It's only if you have simple case like this once you start using keys. This this will not work There's already a pull request For this to get to BPF trace the pull request recognize cases like this and uses use this kind of code and It actually results in a great speed up. So The last one is just like four seconds. That's four seconds more for the same benchmark. So Again, it is just for the map without the keys, what do we do with map with the keys so The answer in this is that we are doing basically the same thing when you have the simple map without any key The key is zero. Now you are starting using the key Like this, you can put the ad sign square brackets and say like CPU and now you are gathering count for every CPU that's executing executing the probe and What happens in the bytecode is that the bytecode From this point is still the same you look up the value you increase the value you update the value back What's in addition, of course you first need to know what's the CPU? So you call the CPU helper And put it as a key to look up and put it as a key To the update the rest of the bytecode is basically the same as you would use Without the key you can actually change the key. You can put more things to the to the map I Actually, not sure what's the was the limit for the key, but I guess you can get crazy and put many more values there and The things you need to be aware is the logical thing when you use something as the key first you need to retrieve that key so CPU PID and command all those things needs to be first gathered put to the stack as the key And then there's the lookup and update of the map then the the code is basically the same but just using using the new key How do we access probe argument? so We have various probes well trace points in K probes and BPF trace allows you To see the values of those functions of those trace points. So every trace point has some argument And those arguments are passed to the eBPF program in that special context value at the beginning is the R1 Register and it holds the pointer to the trace point data But it holds the pointer And the memory is somewhere else that means you need to read the data first before you actually can use The value and we have another helper for that BPF probe read So what happens when you use actually this is how you this is how you use the argument in the probe you put the arcs like it's a structure pointer you use the seafing the differencer and Access the argument the argument is like if you go to the trace FS and check on the formats for various C scores Or basically all the events every event has this format file You will see the arguments that you can use and that's basically what BPF trace is looking to to find out the value Where in the data? This is actually how the trace point data looks like this is like the structure of the paid trace point data And BPF trace is using this pointer adding this 35 which is the offset of the count saying that it wants to read eight bytes and reading the bytes Returning it in I don't know air zero. Yeah, probably no In the destination, yeah So any time using this arcs count this This thing is generated this probe read and that's actually The problem BPF trace is not smart enough to know that it already Already that it was already reading the values. So anytime you use this arcs count This helper call will be generated The way how to work around it if you know that you are using something from the arcs And you know they are using it in other places in your program Just put it to the temporary variable and from the time on only that variable is accessed and not That code is generated Only once For the k-probes it looks differently the k-probes Again past the special pointer to the data And the but the data itself contains all the leg all the registers which the k-probe was executed with and The arguments are basically passed in the register values like the first arguments for every function Well x86 might be different on other architectures, but Most of the time the values to the arguments for the functions are put to the registers so what the program this simple program does when it's arc to is basically The k-probe way of accessing the third argument to the probe It goes to the registers finds the proper register the register that actually carries the third argument and It reads that so no special no special helper is needed needed here. You just read the value from the From the data that is passed to the ebp a program so From 4k probes. You don't need any any work around you don't need to be aware of by using those arc arc values Finally printf so BPF trades have like many Many functions defined for BPF trace not just you can call like helpers that are defined by the ebpf subsystem which transform to the Calling to the kernel, but we have also many Like inside functions one of them is printf and the way it works Actually is the way that works most of the functions in in BPF trace So when you want to print the string every printf is actually recognized by by the format string that you want to print so this is like global ID of that printf and That's basically what BPF trace is using in the program so for this simple example It stores the ID because there's just one so the ID zero Stores the ID to the ring buffer So every time you are actually using printf or another function BPF trace Great ring buffer pair of ring buffer and it's like the way of Communicating the data between the kernel and the BPF trace So what the program does it stores the value number one, which is the ID of that string stores it to the Ring buffer and BPF trace is actually monitoring the changes in the ring buffer. It sees the value It will get the ID It knows what format string is for that ID it will get the data if there are some data I have another example with the data and It will put it together and printf Printf the data to the console So this is actually you can see that's the that's the delay and that's something you should be aware of and using the functions, there's always a pair of ring buffer in the middle and If your program is printing like like really a lot the ring buffer can get Get populated like that. There's no space because every time you read from the ring buffer You make space for another data, but when the kernel comes with the data to the ring buffer and there's no space It just drops the data. So If you're printing just too much and BPF trace is busy with other things maybe or is slow enough to read the ring buffer You can start losing the data and not see what you are actually supposed to see How it works with the data actually so I have this Tracepoint syscall write That's the that's the string and those are the values I'm printing the arcs fd and arcs count and Again, we store the ID which is zero. Of course, there's only one only one string Now we need to read the arcs fd, which I got in previous slide to that So you will read the value of the arcs d store it to the To the stack you read the value of the arcs count store it to the stack and you call the perf Event output helper which puts all the data the ID and the data to the ring buffer and again BPF trace We'll check if there's any data on the ring buffer. It will see the data It will see all you want to do the printf it will put the data together and Print the data to the console as I said, this is similar workflow for other other Other functions so like print system cat exit and probably Other functions They are using the same workload Just the put ID on the on the ring buffer put the data BPF trace in the user space do the rest predicates predicates are a really nice way to Like syntactically say if the probe should be executed or not So you can in this example you are gathering the data from power trace point CPU frequency and populating the map for CPU and Then you profile with 100 megahertz and putting the data from the frequency to the histogram and There's a pretty pretty cat that says don't execute the probe Unless there's some value in that in that map This works actually how you would imagine it works so Frack CPU is basically the access to the map with the With the key which is CPU so you need to call you need to get the CPU ID Use it as a key To this frack map get the value check if the value exists if the value is different than the zero if it's zero we just exist exit and If it's not zero the program continues So not really interesting what's interesting is that once again If you use the same value of the frack CPU later in the program The same lookup will be generated the BPF trace The previous example for the arguments and again here and probably many other When you access something difficult not just my mixes This is this is the scheme. It will be the code for this value will be again The lookup for the value will be regenerated again. So For this example, it's actually faster To skip the predicates read the value to the temporary variable and then use later on the temporary Value of variable. So you will just do the lookup then you will check the value and The program continues without any other any other lookup. It will just It will just use the value Of course, there are different predicates, but only The expression in the predicates repeats itself in the program. It's useful to do this You will get rid of one one lookup histograms so having this example that's actually using the histograms as Jerome explained the histograms are nice way how to bucketize bucketize the output so like for this example, I'm saying that I'm reading the frequency values and I'm putting them to the bucket We have two functions for histograms, it's his function and linear histogram function, which is this one L his and What they do they basically Take the value that you provide Trumps form it to the bucket value and keep the count for the bucket and later on the BPF trace We'll display the value. So like here Actually, the arguments of the L his function is like zero is my minimum 5,000 maximum 200 is the step like the size of the bucket. So later on in the output, we will see Like the frequencies from 800 to 1000. There were so many So many values and like you will get the values For for every occurrence How that works internally So You have the at sign here with the megahertz megahertz Map so again, it's a map update. So you need to look up the value and you need to update the value The his L his function basically only works to provide the key for the value So as you can see in the pseudocode, it's actually something you would expect if it's less than minimum is zero Or minimum nevermind if it's more than a maximum it's the maximum bucket And all in between we will actually put to to the buckets that it goes to Once you have this bucket number you use it as a key to access the megahertz value And you keep so this is the key and the value of that bucket is basically The increase count. So this is what this pseudocode is Is doing so you look up? The value for the bucket check. It's different than zero you increase If not number one and you update how it looks in the code So I skip the mob look up you already been through that The L his function is basically just to compute The key so it's all just arithmetic functions That address the pseudocode I had on the previous slide So it do the do the math and end up with the number B Which is used for the lookup to get the value You check this is zero. It's not you increase and you update the value back to the map So this is this is how when you use histogram. This is how how they work The his function not the L his function, but his function is actually using the buckets Which has which which are which goes power to size so they the function doesn't have this minimum maximum and the step It's already being defined by the by the function, but it's it's actually it's actually the same mechanism When you whenever use the his it will end up with just a few of the arithmetic functions. So Not big deal You can of course combine the histograms with the key like you can use the megahertz and say I want to also divide those values in the histogram with the key and Again, it's the map access and same the his function only provides the key and What this function of what this example does it just change the key together So the bucket and the comb will be used as the key to access the value to the map and Increase the value store the value. Okay. There were histograms begin and end functions There's no rocket science behind those of course it will start before you do anything it will it will execute the program and It will execute the program at the end. It's interesting how it's actually made It's made as a you probe So the BPF program put Some course like the begin trigger and end trigger to its code and attach the you probe to itself So when it goes through the code The begin and end will be executed. So Quite easy. No no big science Apart from that you can actually see that in the profile the way it's being done now And that you should be counting on whenever you're using the begin and end you will define the you probe in the system and BPF trace is not doing it in the smart way. It's actually doesn't Remove the you probe right away and anytime you have defined the you probe in the system There's extra code in the kernel that needs to care whenever there's a you probe define some mapping excesses some VMA functions In the system actually got executed extra because you have you probe in the system So whenever you're using BPF trace with the begin and end This is something you can see in your profile. It's not big deal, but probably depends on What is your what is your benchmark? So This is something just to be Just to be aware of Okay, that's that's the first part of the presentation about the internals We'll do the questions later Now it's the part about how actually about the plant and recent features the Main feature that got recently to the BPF trace is the btf support What it means So the btf means many things but for BPF trace it means that the BPF trace have access to the kernel type information and And It's nice because it can use this information to actually write program like this where you can access the current task, which is the pointer to the current task object in the kernel and you can access that the pointer Without actually using any include values. You don't need to you don't need to use the kernel develop package or There are several ways how to get the proper headers and you always need to take care that you have those headers that Your kernel is running on you don't need it anymore There's a file exported by kernel in the ccfs file Which contains all the type information about the running kernel. So BPF trace is reading that file and Whenever you use Whenever you like the reference some kernel object Like this current task or some argument which points to kernel object You will it will use this information and it knows Where's the parent field? Where's the pit field? So it doesn't need any any headers As of this moment, it's actually supported in federal 32 and It works like from scratch Actually have demo for that Okay so if I get to the Rohit system and Install the BPF trace Bf trace what I prepared actually running on the background some file access and what you can What you can do you can use this okay, you don't see that You can see you can use this program which put the k-prop on this kernel function which is Which is used during when the file is open and you can you can get the values So it's printing the command if you if you go and Hunt the parent you can actually see that it's that's bash And if you put like many parents This way you can actually get to the in it ID server Yeah, okay still not System D and we should be lucky now Okay, so you will actually get to the to the idle process So this is really nice way to you don't need to care about the error files anymore. You can access the The thing right away Next thing we are working on is like support new probes for the BPF trace and support props road trace points and the tramplines which are like brand new Trampline rate road trace points are not brand new that they are there for some time now, but only Recently, let's say they got the power of using the btf arguments and That's actually it helps you what it allows you is that When the trace point is called with the pointer to some kernel object you can use that pointer As an argument to the helper and the helper can treat this object as a kernel object It actually it goes through the bp a verifier With the btf information and it knows that this pointer actually points to the to the kernel object And it can use it directly so it can use any Any functions that kernel would use To do the to do the job. So our road trace points are already in bcc Also, the benefit is normal trace point you need to pair to actually run that so there's a pair flyer for the road Trace point. There's just pure BPF registration. So it's actually also faster So it will be in it's in bcc soon in BPF trace. Tramp lines are really nice way To probe the functions. It's like K probe on the beginning and end of the function So it's for many functions in the kernel and Again same as for the road trace points. It can it will give you the btf Arguments so here I have example if you have a trample line on the VFS treat File which actually takes the file object as a as a argument. You can use this argument in the file path Helper and the file path helper actually doesn't need to check on that argument It knows that it's a file pointer It's a pointer to the to the kernel object and it can use Kernel functions to actually get you the path to the file So this is not in the bcc yet. Not even BPF trace. This is something we work on Not only the btf arguments for the trample lines. They are also really fast Just to compare. So this is Kellogg's touch tool Which gives you Like the overview about the latencies in your in your locking. It's getting the information from it puts the Props on the mutex functions and counts like the timing in between of them and displayed it If I run my benchmark So this is the base of the benchmark without without this on the background the normal K probe Will make that benchmark ten times slower if I use the same thing with the With the trample lines. It's just two times more slower. So it's really it's really nice New functions So I have like three functions that are either almost there or we are working on first FD path is The function that gives you the full path From the past descriptor so anytime in your probe when you have the file descriptor value You use it in this In this function and it will it will result in the file in the full path Of that of that fire. This is quite useful for for many people actually Was the state of it we have the support in BPF trace about the change For the helper for the actual kernel helper still wasn't merged It's just waiting for some review and we need to hold off the pull request Before it's actually accepted in the kernel Another function which is similar to previous one and I already mentioned is to get the same thing the full path for the file from the file file pointer So with the BTF Powered arguments you can actually pass The argument from the function directly to the helper and the helper will use like VFS kernel function to get the root. So it's quite efficient and it will get you the full path Another plant Function is escape the output. So there's already a helper which can when you put When you put the Socket buffer pointer it will It will actually Store it to the to the ring buffer and the plan feature in and BPF trace is actually Have a helper that you pass the pointer to the socket buffer You put the name of the file and all those buffers will be stored in the file like in the TCP dump file so Like net for guys, maybe I can use it to to see what's being generated in every and retrace point And with that That's it You have any questions is the thread ID part of the data that you can get in a trace function You mean like if you can get thread ID. Yeah thread ID. Yeah, I think so Yeah, actually, that's just one helper which will get you together the process ID and thread ID What about thread local data? Would I be able to get access to the thread local data? So we have this per CPU thread maps that you can use as a Thread is per CPU. Okay. I'm not sure about it There's something with the local storage in the helpers, but I haven't a chance to look at it Probably there is if there's a use case. Yeah, if there are the things underneath we definitely can do something about it I'm not sure if any of the Jaeger any of the other tracing folks have done this But let's say for example, you figure out a way to instrument your server to add a span ID To the thread local storage, but you can trace everything that have let's say for example, the operation is per thread So that way I can have it have a map inside the kernel to trace every single right call that happens on behalf of that particular span ID Yeah, I guess so you can actually It's easy to put it in the thread local storage Sure But you can actually have the map and use the key like the thread ID for the for the map and you can store whatever you want in the Sure in the map value So yeah, there's another level of indirection there because I have to go from the thread ID to another map and another map to whatever data Tracing. Yeah. Okay. It's doable. Okay. Thank you Another question Hey, you mentioned you use Uprobe for the begin and end commands Use what? Sorry? Uprobes User land probes Uprobes Uprobes, sure. Yeah, and I was wondering if you run BPF trace the same binary several time for different tracing in parallel Will those Uprobes confuse each other because they are attached to the BPF trace binary I'm not sure I understand but you mean like if There's another Uprobe defined if this will screw up your measurements or sorry. I didn't get it I was wondering if the you attach a Uprobe to a specific function and itself But if you if there are several instance running in parallel of BPF trace You will have several Uprobes that will trigger each other So the Uprobe is local to the file, right? It's bound to the file. So when you insert the Uprobe to the file When you actually execute that particular file and it hits the Uprobe during the execution, you will get that particular Uprobe So it's tight like to the I know of the file. So this is how it's tight to the file So if you copy that file, it will be different That was my point if you have several process but from the same I know from the same BPF trace binary Then they will Intersect each other. Okay Yeah, please come to me after presentation and we can sort it out. It's possible. There's a bug in BPF trace It's quite it's quite nuanced in being developed Right. Is there any other question? I guess not. So thank you, Yeji