 Hi everyone, thanks for watching, I'm Dimitri and today we are going to talk about BPF performance. Just to make it clear, we are not going to talk about how to get performance insight into some certain applications using BPF, but instead we are going to talk about how to understand performance of the BPF programs themselves. One note, this research was not happening in the vacuum, all these results are actually happened as part of our work on advanced cluster security and this means that all the use cases we're going to talk about are somewhat dictated by this type of systems. So now a bit of a background information. Originally, when we're talking about performance of a single note applications, everything is more or less tried forward. So we have our user space application, this user space application does some business logic, it relies on some kernel features, it utilizes some kernels for some particular purposes, and we more or less understand which metrics do we want to acquire from the application, where to look at in case of kernel, we can profile everything and everything is more or less transparent. What happens nowadays is that quite often we get systems that include some BPF components in them. And it turns out that unfortunately performance insight in this particular case are not that straight forward. And it could be tricky actually to understand what's going on inside of these BPF parts from the performance point of view. You may want to object why bother because those BPF programs are supposed to be very smart and being executed just extremely fast, like lightning fast. And you would be actually right, except that quite often it happens that those BPF programs are actually staying on the hot path in your system. Unfortunately, for example, they're going to be fired every time when they have a syscall or something. And there could be a huge amount of syscalls from a loaded server, for example, which means that even the smallest overhead could actually accumulate over the time. And ideally, you would need to understand where is it coming from. So having this in mind, let's think about how could we get insights into performance of BPF? How could we understand and measure it first of all? So the following section, the following slides are going to follow the same scheme. We're trying to break down step by step, which information can be acquired to understand better what BPF programs are doing. So we're going to start like from the simplest one from the dark cage up to the most advanced approaches that we could use. So let's go. The first one and probably the most easiest and straightforward one is just to start collecting global information about your BPF programs. Fortunately, Colonel collects this information for us, we just have to enable one particular parameter. And then, as soon as we inspect our BPF programs using BPF tool together with normal information, we also get total rent runtime and number of run counts, which essentially can provide us some sort of a ballpark numbers like an average that we could use afterwards. They're not unfortunately detailed enough for a proper analysis. But you can use those numbers to do some sort of a sanity check, for example, to cross check some other approaches or just to get some understanding, some grasp on what's going on. It's if you wish sort of a uptime type of tool for BPF. Easy, straightforward way to go. We're still in the dark ages. So one of the very extremely flexible and easy approaches, you just use print.k. So we could just sprinkle around our code every now and then some print.k with, for example, timestamps or time deltas, whatever is suitable for your implementation, and then read them from the trace pipe. As I mentioned already, extremely flexible and straightforward, you don't have to really think about anything. You just have to measure something. But there is a one huge drawback. This approach will introduce will incur a lot, a lot of overhead, unfortunately, just because of the user space to kernel communication. You have to be aware about this. It's not necessarily extremely bad in the sense that you still could use it when you don't have that many events. But most likely, it just not feasible to use this approach on the loaded systems. But still, in certain cases, it's fine. Also easy, convenient. Why not? Now let's try to be smarter. Let's think about what are we actually doing normally when we try to understand performance of some particular function inside our application, just a normal user space application. What we quite often do is that we create a probes and attach those props right at the beginning of the function and right at the end of the function. And then we measure certain events that happens between those two probes to understand, for example, I don't know, cache misses or just like a time span, how much time did we spend in between and so on and so forth. Could we actually apply similar approach in case of BPF programs? Well, it turns out to be quite a tricky question. So imagine that we have a probe that is being attached at the beginning of a syscall and at the end of a syscall. Essentially, this BPF probe, for example, measures something about the syscall, for example, in the easiest case, for example, latency of the syscall. So how are we going to do in this particular case? The most right forward the very first thing that comes into your mind is just to try to attach to the very same attachment points as your BPF program under the consideration, which is unfortunately not going to work because in the kernel there is no way of ordering those BPF programs, which means you have no most likely you will not be able to force the needed order because in this case, you have to have your introspection BPF probe being fired before this one you're trying to introspect. It's not possible to force. So unfortunately, this very straightforward idea is not going to work. Instead, what we could do is we could try to find another attachment point that is going to be a little bit earlier than whatever we're trying to profile. And another tracing point, another trace pointer probe, which is going to be fired a little bit after another one is going to be finished. So in this particular case, we try to find something that's going to be a little bit earlier than Cisco and will exit a little bit later, which theoretically could work. So let's try it out. On this slide, you can see an example of how could we do something similar with BPF trace. We're not measuring actually here latencies. We're taking a count of stack traces. Latencies will also be easier to measure. It's not really a point here. What are we doing here? We're actually attaching a little bit earlier to the function called the Cisco 64. And then also create a red point, red return probe on the very same function. In this way, we're creating a span which includes the BPF program plus the system call itself. But theoretically, taking into account that they should be stable enough, as a random variable, they will follow some certain distribution. So we should be able to compare and get some overhead from our BPF program after all, after all, at the end of the day. It's going to work. It's possible to use it, but there are a lot, a lot of problems here. So the first and the major one is that we actually do not measure what we want to measure. We do not measure our BPF program itself. We measure something like a BPF program plus extra. So, for example, in this particular case, measure BPF program overhead plus the Cisco itself, which means that especially for those Cisco's that are producing a lot of variants and they have like, you know, long tails of distribution, it's going to be quite painful to compare numbers afterwards. Another problem is that actually it's not that easy to find the proper attachment points that is going to be a little bit just slightly before whatever attachment point you have for your BPF program in consideration, because for example, even in this particular case, we have a function do system call 64, which is not traceable anymore, unfortunately. Yeah, that's pretty much it. Although we're still more efficient than the previous one. Well, although I mean, I can imagine that almost every approach is going to be more efficient than trace pipe. But nevertheless, this one at least is more efficient in a sense that we're doing a lot of work in the kernel. So from this point of view, okay, but there are still a lot of drawbacks. One interesting problem we have here, it's sort of like a side story, but I still, nevertheless, it's very interesting. So I decided to include it here. It's about, you may have noticed here that for BPF trace, we're actually using not a K probes, we're using K fungs. So what is the difference and why is that why we decided to do this for this particular example? The reason is that normally when we create a K probes and K rate probes, it's going to be essentially the same. It's a normal like a conventional way of measuring something in the kernel. And the problem is that every time when we actually want to use return probes, which we have to do this, unfortunately here, every time we have to do this, we have to allocate some handler structure. And to limit this memory allocation, there is a hard limit for how many of those thread probes could be working simultaneously. This limit called max active. And essentially it says, yeah, how many of those thread probes could be processed process simultaneously. It sounds nice, but at the same time, it could be pretty nasty side effect. It could mean that you could lose a lot of thread probes, unfortunately. So if you're going to work, for example, on a lot of server, you will see that with kernel props, you will lose a lot of return props events. You see that, for example, we've got thousands and thousands of K probes events, but like, I'm just like magnitude of hundreds, for example, of return props, it doesn't really make sense, you think. And then you realize that there is this limitation, pretty annoying one. Interestingly enough, that there is even a patch floating around that's supposed to be solving that. And it's pretty straight forward patch, but well, there are a couple of questions around it, how to deal with this approach when you're taking in this patch, so that not everybody is like extremely eager to actually apply this patch, unfortunately. And even more than that, it could happen that max active as a limitation will go away at some certain point, but at the same time it didn't yet. So we still have to struggle with it. But one nice approach in this particular example is that we could use K funcs. Essentially, you can think about this as the same K probes, but they are working using BPF programs, yeah, they're everywhere. So this approach is using BPF programs, and then we're avoiding this limitation of max active, which is pretty nice. So that was it. So yeah, this approach is pretty nice, a little bit more efficient. But main drawback is that we're actually measuring too much. So how could how could we avoid this problem? There is an interesting approach how could make things even more smarter is when we could create a extra instrumentation BPF probes for F entry and F exit tracing probes for our program under consideration. So what does it mean to measure in this particular case, we create two instrumentation probes on top of already existing one that we're trying to measure. And we're going to attach the F entry one at the beginning of the BPF probe under consideration and F exit at the exit. It's similar to K probes in the same, but we have attachment probe exactly to what we need. So we measure exactly BPF probes that we're looking at without any overhead whatsoever. And essentially, it's the best approach in the class. That's the best way we could do accept that. Yeah, there are of course, minor, minor considerations here to be able to do this. You have actually compile your original BPF program with BTF. Otherwise, it's not going to work, which means that there are certain complications around this with like all the older platforms. But nevertheless, it's probably the best way you could actually do this type of stuff. Yeah, so it's essentially that's very interesting to look at how this stuff working, because it's very interesting story. So normally, when we create the BPF programs, there is a prologue generated before the actual body of the function. And if you look at the actual prologue, you will see something like on the left hand side on this slide. So there will be a no open instruction at the very beginning, it's some sort of a placeholder that will be utilized to actually implement this feature. So what happens is that when you create those extra instrumentation probes, and as soon as they're going to be attached, for example, to an entry, this placeholder is going to be replaced by the call to whatever functionality is going to involve this extra instrumentation BPF propon, you will see pictures similar to what you see on the right right hand side for the very same prologue, but it's going to be a little bit different because of this knob is going to be replaced. This approach is very interesting and very lightweight. So and as I mentioned before, this is sort of like best best way to do this best in class. So that's why not surprisingly, many other tools are actually utilizing this approach. So for example, if you are going to do a BPF tool profile under the hood, that's exactly what happens. You create two extra instrumentational prop that are going to be attached to whatever program you are profiling. The same story for Perf starts for BPF programs also two extra instrumentation prop and so on and so forth. So very cool. But here we still have one particular some sort of inconvenience, let's say this way, this whole approach is very optimal and works very nicely, but we still have just two sort of like probes to snapshots at events space that we're actually trying to investigate, which means that we are just have two props and we're collecting some events that happens in between. Normally, when we're working with just normal user space application, sometimes we would like to profile our applications to understand what happens in between. So not only at certain snapshot points, but rather like pretty much at every instruction that we could get. Could we do something similar to BPF programs? Fortunately, yes, but there are a couple of things you probably need to know about this. So we could profile BPF programs and Perf is a great tool. It's smart enough to actually be able to correspond BTF symbols or BTF information from your BPF program. So it will be able to correspond underlying code with whatever it was able to sample. So as soon as you, for example, will profile something you and it will be able to see, you'll be able to annotate your code. So on this slide, for example, you can see an example. There are not that many samples actually, but still it's a very interesting one. So here we're simply install cycles event for BPF program. And you can see that for example, there is this hotspot around ring buffer surf, which is some sort of an interesting so that we could start digging deeper into understanding what kind of stalls happening here and so on and so forth. So that's a very powerful approach. But another problem accept that you need BTF information for that is that actually, since BPF programs, they're not like, you know, they're not like isolated programs, they are, they extend the extent kernel, which means that you're not be able to profile them in isolation, unfortunately, to be able to obtain this information, you have to actually profile the whole kernel essentially to get this information and then find out like filter out those samples that are corresponding to whatever BPF program you're interested in. What does it mean that in quite quite often it could happen, especially in the loaded servers that you will get a lot, a ton, a ton of samples that you're actually not interested in. You have to somehow filter them out. So you have to like, you know, play around, you have to probably first of all increase max frequency, you have to probably sample only the kernels part of the stack, you probably may want to try if you have benchmarking set up, you probably want to try to pin everything to a single core to limit noise and so on and so forth. So it's a little bit annoying, unfortunately, what I was thinking would be great. And what is not really supported, unfortunately, is would, it would be great actually to use an Intel PT feature to be able to filter your particular BPF program at the moment when the sample is taken, should be actually great, but per documentation for this feature says, unfortunately, that for dynamic parts of the kernel is not supported, including BPF, unfortunately. Yep, that's pretty much it. And now we have already covered in the previous session, how could we get more information about your BPF programs? Now we understand what's going on, we have some certain visibility. So let's talk a little bit about how could we actually improve stuff? How could we actually make our BPF programs a little bit more efficient? So first of all, a huge disclaimer. It's just a common wisdom. If you're trying to improve performance, first of all, the very first thing you have to think about is your implementation. So you have to get rid of all the overhead, unnecessary things, you have to replace any efficient algorithms and you will get much more gain out of those operations rather than anything else. So it's like, it's a conventional wisdom, there is no point in arguing about it. So first of all, let's try to do this and using information, for example, from the previous slide about profiling, you can even do this much more efficiently. So you could see where the hotspots of various events and so on and so forth, that would be pretty, pretty helpful. What are we going to talk about in the following slides is essentially a situation when you already have good enough BPF probe and still you would like to look at how things could be improved even further, how much we could squeeze out of it even more. So let's see, let's take a look at a couple of examples. BPF instruction sets. Somehow, it's not a common knowledge, unfortunately, so that's why I decided to put it in the slide. So the thing is that since BPF as a feature was evolving in the kernel over a long time, there are several supported instruction sets for BPF. What does it mean? That means that if you compile your program, it's getting compiled of some certain, it's using some certain set of instructions. And if you're using a newer instruction set, most likely you'll be able to use more diverse set of instructions, which is of course nice. There are many reasons why it was happening like that, because, yeah, it was evolving over a long time and there were some compatibility issues, especially with a classical BPF. That's why even some certain really crucial instructions are actually missing from the default one from the version one instruction sets. So let's take a look at this particular example on the slide. Imagine that we have absolutely that simple code when we have one if condition that depends on comparison when one value is less than another value. That's extremely simple, nothing particularly interesting about this, but if we will compile this BPF program, we will see a very interesting situation. The thing is that version one instruction set, if we'll compile our program with this instruction set, it does not have an instruction called BPF jump less than. It have only one instruction jump greater than, which is annoying because then it means that compiler has to actually create a workaround for this situation, because it's not able to use this particular instruction that's supposed to be using here. So it has to create a workaround to actually reverse this condition and use this instruction greater than, unfortunately. I have not put everything on the slide, but on the left hand side. You can see an example of jittered version of this code. It's actually much more verbose. The code is getting a little bit more bulky in this sense, but you can see that we are actually using jump after equal instruction. So the semantic is a little bit different here, but you will see that essentially we are jumping in the conditions when something is greater than something else. So if the comparison is the previous comparison has one particular value. As soon as we're starting using version two instruction sets, things are actually getting much better because now we have this particular instruction jump less than. So we could literally directly use what it's supposed to be using here and we can see that now we switched from in a jittered implementation on the right hand side. We see that we have switched from jump after to jump below, which corresponds exactly to what we actually want to do. Pretty nice and convenient. Why are we talking about here? So the point is that and as with many other BPF topics, the size of your code is actually sort of important. So besides obvious topics that you're just executing two more instructions when you're trying to implement this workaround, your also your code is getting bigger, your program is going to be bigger, your instruction cache is probably going to be more trashed or some other caches that are going to be used on and in between are going to be less efficiently used. So you wouldn't try to keep your BPF programs as small as possible. So and using and utilizing new instruction sets, you could actually achieve this, you could just shave off a little bit of an overhead that is present there. No more other details. There is more interesting information in this blog post I have put on the slides with even with a little bit more benchmarks around various instruction set. Very interesting. Just a couple of notes about how to actually use it. So for LLVM is actually controlled by the parameter called mCPU where you can specify one of the available options. So it's going to be their default one, which is used by default if you're not specifying this option, which is a generic, which is also another name for version one. It's been used by default just because it's been supported everywhere. So that's like a safe default choice here. What you could do otherwise, you could specify, you could pin one particular instruction set version. So for example, you can say one, two, or V1, or V3. And you could also allow compiler to try to find out the highest supported version of the instruction set that you have. So for that, you have to provide a probe value, and then it will try compiler will try to figure out what is actually supported. And another interesting parameter here is MRT where you can specify ALU also to get a little bit to gain a little bit more available instructions. Just one note, I haven't really investigated it deeper. But for whatever reason, on some certain constellations, probing from the compiler was actually working for me. And as although the V2 instructions that was available, it was still falling back to V1. So as I said, I haven't really investigated it yet. I'm not sure what is the reason. But if you see similar problem, just a small tip, you may want to pin V2, for example, with V3 directly in this particular case. And if you, for example, stuck without LSE, but just with a C-Lang, for example, you could utilize parameter MLLVM to pass the very same parameters inside. So yeah, that's it for instruction sets. Now we're coming to another very important big topic, very exciting one is BPF to BPF calls. So what happens here? So in the past, when we're writing BPF programs, everything had to be within the very same function, within the very same like instruction, bunch of instructions. What does it mean? It means that as soon as we wanted to create some function like that was some sort of like a common function between various BPF programs, we always had to specify that it's going to be inlined up to, I don't know, what was it for 18 kernel and C-Lang, there VN6, I think it was required. So there was just no other option, we had to specify all those functions, so that they're going to be incorporated inlined into the main body of the function, because otherwise it's just not going to work. And interesting enough, it's relatively old one, but I still see every now and then implementations like that. What happens since then is that we can actually, we don't have to do this anymore. The thing is that now we have BPF to BPF calls in kernel, where we can actually do a call between various sub-programs inside your BPF, like the main BPF blob, let's say this way. So what happens is, for example, in the previous slide, if you will inspect your BPF program, you see that everything is essentially, when you say, please dump this particular program, you can see just a single set of instructions. If you wish to not do this, actually, if you wish to, that your common functionality is not going to be inlined. And as in this particular case, you just have some function that is not going to be inlined, maybe you can even force it if you need saying no, do not inline it, please. As soon as you're going to dump your program, you'll be able to actually two different set of instructions with the call in between. And why is it important here? It's again, similar topic to what was before about BPF program size, because if you are going to inline common functionality into many programs, you're going to essentially duplicate quite frequently and quite badly a lot of stuff. And essentially, it's actually not necessarily bad. So the thing is that the situation here is similar to what we have with normal applications, just with the user space applications, where there's always a question whether to inline something or not. Because on the one side, if you're inlining something, you're going to avoid this like jump back and forth, which is, which could also introduce sometimes an overhead. But at the same time, yeah, you're probably using more instruction cache, you're probably not fitting into some other caches. So your program is going to be bigger. So there is always a pros and cons from this point of view. And which one to choose, you have to decide for your own, based on your particular situation. But at the very least, with the modern kernels and compilers, this option is available now, which is great. Another very interesting thing is that actually in the past, in the earliest version of the kernel, it was not possible to combine BPF to BPF calls with tail calls. And tail calls are actually sort of important, they've been used quite often, for example, in Falco library on many other projects. So that was a little bit inconvenient, unfortunately. But fortunately, nowadays, it's possible, starting from I believe 5.9, we could actually combine BPF, BPF to BPF calls with tail calls, but with certain limitations. So the thing is that if you will try to do this, and if you will combine them in us like a sandwich order, so if you're doing sub proc, then tail call, then sub proc again, the tail call and so on and so forth. Unfortunately, you could actually exhaust your stack. That's pretty annoying. So that's why we have a pretty hard limitation in the verifier to actually prevent this. So it's just something that you have to keep in mind if you'll see similar situation, you understand where is it coming from. Otherwise, this combination is actually very powerful. And it's one of the instruments that you can use for your BPF programs to scatter called around. Yep, that's it. And well, this following section is actually similar to what we've got before in a sense that we're still going to talk about how to improve our BPF programs. But the following slides are more concentrated on something that is more recent. Well, it of course depends on your definition of recent, but those features, some of them are have, they, some of them have landed quite recently in the kernel. And I'm pretty excited actually to look forward to use them and try them out. So the first one is batch operation. Sounds very interesting. So let me explain. Normally, you know, pretty much everybody knows that there are BPF maps or various types in BPF ecosystem. So you could create those maps and store some information in there. As soon as you have stored this information, you can also read those maps from the user space, for example. And the problem was that if you have really huge map, it could be a problem because you have read information, or for example, update one by one, it's going to be, well, it's going to introduce a lot of overhead unfortunately. So, and newest kernel lecture support, very interesting thing. It supports batching of those operations so that you can actually shave off a lot of overhead, for example, if you're reading from those maps. Essentially, well, the implementation is a little bit different, for example, for hash maps, but still it supports the very same interface. So you could create a map, store some operations or store some data there, and then read those in batches. And just to give you like a ballpark numbers. In the benchmarks attached this particular commit, it was mentioned, I believe that for example, for maps over the size of one million records, or this amount of magnitude, you could get actually quite visible, quite visible improvements even by the smallest batching, even by batching of those elements by 10 elements, which is quite, you know, exciting improvements from this point of view. If you have some use case similar to that, yeah, you definitely have to use something like that to improve your performance overall. Yeah, another thing I'm pretty curious about how it's going to work is pack allocators. So the thing is that originally every BPF program, no matter how small it is, it has to take, it has to reside in one page memory. It's fine and okay, but the problem is that as soon as you get loaded server and as soon as you get a lot, a lot of BPF programs, it could happen that you're putting a lot of pressure on the TLB in your system, unfortunately. And if you will see, there's a instruction TLB metrics, the instructions will be misses are going to be quite high, which means that performance is going to take hit unfortunately in this case. So what happens now that in the modern versions, I cannot try to actually pack those programs together, especially if there is a possibility to use huge pages, for example, so that you could pack a lot of programs together on the very same page to actually minimize an impact of TLB misses and in general reduce relief a little bit this TLB pressure on the memory. It's very interesting functionality. And the thing is that actually, you don't have to do anything explicitly to use it is going to be sort of like a transparency for you by default. They are as soon as you're using your programs. The only thing that's important here that it makes you think about whether or not, for example, you're using huge page in your instance, that's very sort of interesting in this context. Another very interesting and I would say even like inspiring feature in the sense because it's like bright now or possibilities is a Bloom Filter map. So if you don't know, Bloom Filter is actually a probabilistic data structure that allows extremely fast verifying if some particular element is present in this map or not. So the structure is actually probabilistic in a sense that you can get answers that know this element is not present in the map with a well essentially deterministically. But as soon as you were trying to figure out if the actual element is present in the map, you could get some times for positive. So you have to actually verify information that you've got from the answer from the Bloom Filter. But if you have some particular case when you need to figure out that something is not present, that's the perfect data structure for that. And sure enough, that was implemented in the kernel and you can create nowadays maps that are actually under the hood there are Bloom Filters, which is pretty great. And as soon as you get some similar situation where it's going to be useful, just go on and try it out. Very great addition. And probably the last one. But nevertheless, very interesting is task local storage. There's actually a little bit more than that. There are more local storages. I believe there are like C group storages, socket storages, and task storages, task local storages. Originally it was implemented for only for LSM, but now it was extended also for the tracing programs. So what is it all about? First of all, the node mix it up with the original task thread local variables, for example, that we got used to now. It's something different. It's BPF entity. And it was designed to address one particular problem or rather to improve one particular use case. So quite often it happens when we're doing some like tracing, for example, that we are actually trying to collect some per process or per task information. So what happens is that usually we create just a hash map somewhere in memory. And we store information in this hash map based on the process ID that we're essentially interested in. In this sense, every time when we have to do something, we have to pick up this process ID or task ID, we have to like do a lookup in this hash map, and we have to extract this information and date and so on. This is pretty nice, but nice, but the problem is we have to have this operation, this lookup operation in somewhere in memory. And task local storage allows actually to do this much more performance. So it's essentially an area of memory that is local to the task we are working with, which is pretty great, which means that the access to this memory is much, much faster because of location and everything, locality of the memory. It's, of course, smaller than just an arbitrary hash map that you create, but the access is going to be much faster. So if you have something similar, some good use case for that, when you're storing some information based on the tasks that, for example, some extra attributes or something, that's a perfect place to actually put them there. Again, it's also greatly improving the performance. And for example, I've seen there were great numbers when, for example, one queue slower was converted into this into using task local storage. The only one, like a minor thing you need to take into account is that first time when you're using task local storage, it has to be allocated. So if you, for example, compare it with a pre-allocated hash map, the first heat is going to be a little bit slower, but otherwise it's going to be like blazing fast. Yeah, that's pretty much it folks. I hope at this point you've got already better understanding how all this stuff works from the under the hood. You've got better understanding of what capabilities you have, actually, because the thing is that when we're working with BPF, we've got use somehow to the situation that we're pretty limited in what we could do. And it's just nice to see that with the newer development, we can actually see, we can actually have more features, more possibilities. And I hope that in the future you're going to be a little bit more confident when you're going to reason about performance or your BPF programs. Yeah, that's it. And I would be glad to answer any of your questions.