 Hey, I Have three topics for today Maybe not three like I have a slot tomorrow for the xxs for the hash functions but one thing is BPF static key or branch so there are There are static keys in the kernel and it would be cool to use them in BPF Then there is update on the wildcard map and what it is about and some use cases and some benchmarks tomorrow probably because we don't have time I will follow on How to use like a different and doesn't make sense to use different hash functions for different BPF maps so this is the like a set of BPF programs in empty Cilium cluster, so we Want to have some functionality to trace what's happening with packets when they follow this path and For kernel we have the power review Program it's a simple one it just connects to every function and kernel which utilizes SKB as an argument and then we can see like all events which happen with Packet through the kernel but We don't see with this tool what's happening inside the Cilium path because BPF can manual packets and drop them and redirect them so in addition to tracing all the functions which Touch SKBs and kernel want to do the same in BPF and In the best case we need to run the same tool and it will attach to both places so there Existing ways to implement this already first one is to Attach F trace and FX it but the problem here is that Cilium utilizes tail calls and all the errors here is Tail calls in BPF and for tail call we can't attach to a fan tree because The tail call jumps over the prologue and a fan tree is inside prologue so instead for tail calls we need to place some debug call Right after the prologue so we can do this By eliminating tail calls from Cilium, but as you've seen like this There are a lot of them and the state machine is not so simple. So probably we Won't be able to do this in the foreseeable future, but we will try and The another solution is just if def code and reload program programs if we want to debug something and this of course works but We really don't want to do this in prod because it's error prone and Program size can change and then they fail to load and stuff like this may happen and in any case it takes like Seconds or tens of seconds to reload datapath and to compile everything. So not Not a good solution. So the another solution is just to use Domino inline function and def replace it. This is already A question I Thought tail call is actually well the tail call that you said it skips the f entry because usually the tail call is usually You know at the end of the function it will jump to another Function and but then it should be jumping at the f entry of the other function. So in BPF We specifically jump over the function prologue if we do tail call and Def entry is the first instruction like it's the instruction before the prologue actually So I'm trying for you. Oh wait. So the BPF. Yeah, yeah BPF trample line f entry Yeah, it's the first instruction of the program. Yep, and it is right before the prologue like normally It's nope then it goes prologue. Oh, so you're saying that you mean the tail call from BPF itself Or is it? Yes? Yes? Yes? In the tail call okay, because in the kernel of the tail call would go into yeah, yeah, okay I was a little confused by that. Yeah and Yeah, so we we can just attach a dummy function which does nothing and then we When we do not enable debug we just execute it and it takes about like two nanoseconds foreign function call an empty function call it's Not not that bad, but we have many tail calls per packet and In 99% of time We don't need this function at all. So it's like just an extra 10 nanoseconds per packet in data path and this leads to Visible packet drop right so Another solution is not actually a solution is just a hug I tried to when I was measuring things just to jump to the f entry So the tail call do not bypass it So we go through prologue and then restore the state, but Basically, it's the same as calling an empty function because just an upper-side way like When we call an empty function, we just do prologue and epilogue and here we do upper-side like prologue then reverse epilogue so another solution which is Which is what we are looking into is Reusing the static key infrastructure for BPF and this Let's us to to run debug with zero overhead because we're just executing one knob instruction Which is actually like not scheduled even so Interface for this would be something like this we have a program and then we do BPF static branch of some key and This key is For example, just just an array map. We need to distinguish it somehow from normal array map here I just added like a Artificial flag which don't exist we can do this somehow Else, but it's just just a piece of memory And the BPF static branch is Is Is utilizing the awesome go-to functionality as well as like static keys inside kernel so this compiles into go to Debug so yes, will be BPF print car this inner hello and In the program it looks like this So we go to to this label we record this in the program like for example of set one when we use in the static key and When we load this program verifier doesn't know if we jump or not, so It will process the debug code here So There will be no dead code But when we load the program we Verify it and then replace this jump by no so The code just runs as there is no debug case at all Later if we update the map with like non-zero value we go to the program we go to the like table of Static keys and patch the related static keys here with the label and when we zero map We do opposite so it's again becomes like zero Expense debug and Then I had to catch the flight so I didn't go further so the the question here is like how to pass the jump of sets when we load program and probably like put it into Somewhere like Daniel proposed auxiliary BTF anything but Basically, like this is interface so Are there any like? Is there somehow like an option? I mean like this jump table I mean we do have it for tail calls when we do this direct jump versus no upright That's in the kernel, but that's like that is populated during the verification process. So if the verifier sees that all the paths that that lead to this tail call that they have like a constant Map key where we do then the lookup when this is guaranteed and we will do this direct jump instead of indirect one Is there like a way you could infer this? jump offset from You know, you know like from the verification itself like if it would be like an Like a like a conditional jump like a special conditional jump and then you would be able to Yeah, I mean you can use some new instruction for this right but It's basically the same as providing some offset to see to tail verifier like Luke. Here is a jump which you need to touch Like from So you're saying you would basically put this Yeah, so when you load the program you just put this offset 1.1 into a table and then it verifier goes there look looks where The jump is performed and then replaces this by knob and stores like the the both So it might be easier if you try and do it as I think earlier like like as a Sorry a global variable basically like when we had the talk earlier like the red black tree and that Red black tree had the had like a global mutex that protects the red black tree, right? You could do something similar where you say the Static key is a global variable. Yeah. Yeah, but this is what it is right here Like this static key is a global variable So the way that you showed it is the static key is kind of a a map and that works differently than a Global variable that you define that is like yeah, yeah, but When you when you change the value of this global variable you need to execute some Function right right so you're proposing to like add a set of k functions to Like work with global variable or what? Okay, sorry, so I mean it's it's not not Sorry to take not a big not a big difference like how you control this so Sorry, I tried to you right so I was trying to I was thinking about how how would you pass the jump table to the kernel? And I think the easiest answer would be treated like global data and then use the existing infrastructure and then in the kernel You can look at Basically the way it ends up working is you have it so you mean like creating a map Populating it with like offsets and then referencing the no no like when you in BPF You declare a variable at the top of your above your function that says static key such and such and you reference that in your C code that ends up behind the scenes that ends up being like a special map and Your loader the BPF or the ebp flier or whatever it's going to take references to that and Adjust the instruction Even just greater like global were able with some tag and then based on time egg would create the underlying implementation rate exact So yeah, yeah, yeah, this works Yeah, that would be much easier because you don't have to come up with a new method you would have to yeah There's a little bit of details like code some yeah And then the other thing is how what I don't know and what I can't answer is like how would you then toggle the thing on and off from user space, but You'd have to come up with a way of doing that. I guess yeah Maybe to go one step further. I did something similar for Global data. So what I did. I don't know maybe a half year back. I basically rewrote those To references into static jumps in the BPF before loading So do you really actually need kernel support for that because you can flip this on and off in the user space and Automically replace the program Reloaded and you kind of get the static function static Jumps without even kernel moment, right? You can replace this jump No, because we refer will think this is that code or just Yeah, I guess you'll have to go through verification again, right? Yeah. Yeah But yeah, I can do this like with Recompiling and If you don't want to do it in the user space having a global variable seems a bit Better than having a special map like you can have like underscore underscore this under the hood It will be the special map, right? It's I mean Yeah, from user point of view. Yeah, I will just declare a variable Yeah, I don't know like I said, I'm not very familiar with the BPF program layout But does it have like sections at all? Let me put where you send it in. Is it does it have sections? then Why can't you just make do like what the kernel does and just make a section and tell them and make the verifier aware of this This works That would probably use these things so when it's loaded the verifier I mean the section will just say like you have a jump label section. Mm-hmm. It's just like yeah Yeah, this this is what it's like basically. Yeah So yeah, and then I guess so your question is like how did and yeah, could see you have that and has the section there So the kernel will then know That's a jump label Right, and then do we have sections in bitf? Right. No, we just have like a set of types, right? But Yeah, oh Data second in detail. Oh, sorry. I apologize. They aren't yeah, but you could also get the BPF to Bunch it somehow or whatever Yeah And then just have to have some hook this tell the kernel enable these because I think you want The verifier and everyone to know about this because I don't think you want just depend on users Yeah, of course like a verifier will know about this because like after verification it will do this. Yes. Yeah, well just will do this but So the question is just have to tell something to turn it on right? Yeah, but Yeah, so why why I use of the map here is because this Like map some some map because it's really simple to control this like BPF update Lament and then it's toggles the so I Missed the first like minute probably or two What was the motivation to do this as a static healer because the overhead of like if you ever had a zero For for not executing it how much is the overhead for like literally one if especially if you can lay it out so that like Branch prediction usually takes the the optimal past So one if we'll win a little in wolf like reading the map, right? So it's not just if it's like another No, with global data, you literally just reading memory directly Was like when you do global variable, right? Undercover like you don't do map look up. You just just know where to look And you do it for every packet and you do it's not like once for every packet because we have like multiple tail calls So I think that's the several nanoseconds per packet. By the way, think of the branch predictor as a cache and You're using cache that you don't have to be using and once you you're right The branch predictor will do it and it'll be great But then but the only case it will only save so much in fact I just found out by analyzing the Chromebooks config that they had jump labels disabled and When I enabled it things sped up like the several tests sped up like 10% So that's a huge hit Could you go back to asm in line part? So if I got you right you say in the verifier will only Verify the pass way this True right or both No, I think I think it's both like the look up and the ref static branch here like it returns Like or true or false so we're fire will do the both so if I will okay, yeah, yeah Okay So again like what I I mean what I still didn't quite get like with this jump table like why why this cannot be inferred from the instruction like if this is like As I mentioned like when you have this if condition, this is not known At verification time, so they verify we look at both. I mean it has to look at both paths to make sure it's safe And but you also know where to jump right? Yeah, but how do you distinguish it from normal jump? Like if it's just because it Special jump instruction. Yeah, but it's like probably it's simpler just to add someone for to be tough then to like create a new instruction for this I mean, okay And then it can be changed as well because like creating instruction is creating you pay Well, this is in any case some you appeared wait I Can you go back a couple slides? I I've lost track of why you're doing this to be honest. Hmm. I've lost I've lost track of the point Okay, okay, we want to the boxy loom and you want to you want to get a call trace of the tails calls. Yeah, yeah, so When we look at packet how it travels through a kernel we can trace it inside kernel But we can't trace it inside the P of the same way. So I wouldn't Worry about performance Why? Because you're not gonna you're debugging the stack you're gonna not like if you have this For example XDP cards run in a hundred gigabit card, right? You're gonna turn this on you're gonna like and do what throw every packet through your tracer like Right, so that's basically avoiding this if I'm in the debug mode do nothing. That's What you want to avoid and I think what I'm saying is you have a production systems with hundred egg nicks with SLAs as soon as you Turn this on you're gonna like perhaps break all of those Like I don't think even in the debug case you should throw every packet into this mode And then the other question I didn't quite understand Maybe just really quick. Are you suggesting we add code to psyllium be psyllium so that this debugger can work with psyllium Or are you trying to interpose on the tail called in exactly? It's like add income to psyllium and No, no, no, it actually will help like psyllium developers to work with the psyllium. It's I've been in this position many times so I understand the value of that But I would say can we do this without modifying the psyllium code Right Because because if you need to modify the psyllium code, why don't you just write the debugger into psyllium directly? Like why are you even using packet? Where are you at all? Just put a filter inside there when you turn it on? Yeah, but exactly like putting filter inside psyllium We want to do this with zero overhead like we We can edit with like non-zero overhead or with the compilation stage But we want to like add it to like every production psyllium instance and do not spend time when it's not enabled So the and the other thing is like if you I mean you've probably run packet Where are you right? So it will basically attach to all the skb and like you can go to an ad filter to all the layers And you see where the packet is going through the stack and then you can also have the visibility for the bpf part itself That is in psyllium because right now you would only see that from the cutest layer You would drop the packet and that's it, but you don't see exactly like how far it Like went through it like into which tail call and so on so you don't have this visibility all at once, right like with the The overhead of turning packet where are you on is like you're worried about this micro detail when no, no, no Every k multiply it by like time because we wanted want to enable it only for like short period of time You're gonna turn this on and you're gonna hook every function in the kernel that has a skp struct Right the overhead of that is way worse than the calling a k probe from like SKB Alec SKB build like everything the overhead of that is much greater than worrying about this tail call thing is all I'm saying But Yeah Think Anton is saying how do we do this so that the common case where you don't turn it on doesn't cut slow Yes, yes, right okay I want to ask a follow-up question if if you could have f entry tracing for these things that was Yeah, in the case. It's cheap with that So if we have f entry tracing for tail calls, it will be like the more or less the same but What is more or less the same sorry, hmm, what is more or less the same the So to enable a ventress for tail calls as I believe it will require like a little bit more overhead than just calling an empty function or the same one, okay, so But the calling the empty function is something you do from Silyam the BPF code base or yeah, yeah at the beginning of tail call you just call an empty function Okay, at the end of tail call you can attach to a fix it at the beginning I can't and Follow-up question the the reason we can't do f entry tracing right now is because the way that tail calls work is they jump to it Kind of too late into the function basically too late. Yeah, yeah, they over jump the prologue so we don't push The stack okay once more so maybe this is not a good idea But what if we have a static key that said actually jump into the f entry prologue and you can toggle that on or off Globally, I guess and you could say enable f entry tracing of BPF tail calls and then the full case that's Okay, Alexi says what I'm describing already exists for certain specific cases for For certain kinds of tail calls So why don't we do it for the other kinds of tail calls? More overhead, okay, and could we just say dynamically toggle that or is that impossible? This thing is not only for tail calls, right? So we can enable specific pieces of code in programs as well within the in this form like No, no in the middle of program like it's it's a static branch You you can enable a disabled pieces of code inside the program as well and as many as we want Example is when we're using some maps to filter things But they're not always looked up because they can be empty if you populate it then you jump to this code If you don't then you just over jump it I think you need to get that like also what happens as soon as we take I'll try to optimize cillium and remove tail Call like it's gonna you're just gonna lose it again, right? So I think just like I think being concerned about tail calls is maybe not the point I Think to be honest with to bog down this arguing whether use case makes sense I think like aside from that even use case doesn't static keys they exist in the kernel and they're useful Right, so soon who let it we have to add them to your program How we add them that we should probably argue but whether like use case this particular is case makes sense I'm not a Cillium developer Yeah, as a Cillium developer, I was just gonna say I think static keys make sense in general I don't know about the Peru use case. I don't know about tracing everything, but yeah If you if you see a reason for static keys in general then definitely like we should we should discuss that. Yeah, so I think we're on Okay, let's find another better use case and implement this John says well what Alex says Okay, I'm out of time, but I will go and find out a small presentation so an update on like wild card map so it's Wild card map is a thing which lets us to filter different kinds of structures of rules you can just imagine that we Want to create a map to filter four tuples of five tuples or We want to create a map to filter some set of identities or We want to filter port ranges or something like this or multiple sets of ranges and There is a way to like represent this in a generic way and Even to implement this to more or less Good state so The the interface here is that we create a map we define some particular structure on this map and then we just do BPF will complement and BPF update and Here an example like we have a Map which just checks one identity Which refers to a pod or a process or something like this and then a port range So the key here will have Some meta information namely that it is the key It is the rule and its priority inside follow cups and then the identity and Range minimum range maximum So this is used it when we create rules in set map and delete rules and look up rules If we want to match input based on on this maps will create a key of different type And just specify the identity and the port Identity is matched against identity in the rule and port is matched to again against port minimum and port maximum and To make this map really generic there are several types of rules which can be inserted one is prefix you can Think of like IP prefix IPC drawer address and prefix then there is a range we can create Either wildcat ranges or Small ranges and then there is a match and match can be of two types like when we match exact or when zero inside the rule means wildcard match and We can combine several Rules of these types to come to create like a combined rule for the map and then we define a map So as an example here this idea and port range We create mob description which says it has two rules one is prefix Prefix no it should be much anyway and another is port range and then we just create a BPF map referencing this policy key and Initially like when I first posted this I added like a lot of fields to describe the Mob structure, but it turns out that I can combine everything inside just the mob key description so this magic macro Creates this like big structure and The union is to distinguish between rules and keys like one one part has All the fields for rules and another for keys and then there is this hack. So this is actually the biggest question of this Presentation so this stroke struct policy description is An object which describes the map. It's array of size zero. So it doesn't affect the key size anything but it is place it inside the key BTF and this map when it is created the The description is verified and appropriate mob structure is created so yeah, so Next I will Show some use cases and benchmarks, but the the main question here is is this Does this look? Okay to to create map based on key BTF structure If case cook was in the room he would say no He'd been doing enormous amount of forget and rid of the This GCC extension and he is determined to get rid of it everywhere possible Okay What are It's even worse than this. This is GCC only clang doesn't never supported this like this. Yep What do you mean? Yeah, I don't think you can compile with this claim. I do No, no, I definitely compiled this was clank Yeah, see as a language. I think it allows like zero So like with the cleanup that he's is doing is like when we use struck as like a kind of extensible Struct right and we declare either like Zero sized array at the very end as a last field and he's converting that to like flexible with no zero This is not because it's a zero size array. This is zero sized at the beginning So it's not like a flax array struck at all. I think this should be fine with keys at least Don't don't you also use this in the kernel like in the SK buff even like where we have like a marker with Yeah, I have a I have a bit more of a high-level question. So the wild card map. Does it? Enable some kind of functionality that you otherwise Wouldn't have or does it make something that is kind of hard to implement easier to use or maybe something else Yeah, yeah for us a primary use case is to support port ranges in Kubernetes there is no way like to To combine port to range look up with a hash look up like in one Operation or like simple operation. So, okay. Yeah, maybe we can chat after I mean there is I've I've done that with a different algorithm. Maybe not in a single operation, but it's it's pretty fast, too So, you know a better algorithm to do this error. Yeah, I gave like a really short thing at E-bpf summit I'd kind of laid out the algorithm. It's called like a linear bit vector search It's a linear algorithm, but it's it's very fast because it's cache efficient Okay, the question is related. Is there some like a efficient implementation of this data structure? Yeah, let's look so one of the use cases is is cylinder work policy and Currently we do up to six hash look ups So we we take a rule and we break it into different patterns and with zero like wild card parts of the key and Do look up by look up until we find a Much and we always have a wild card match like all field zeros in the end and We in particular will have the port here and Kubernetes now by standard needs to support port ranges. So it doesn't fit here and With wild card map we do like exactly the same algorithm Underhood but It supports port ranges here. So for the productivity it looks like this So here wild card map Performs a little bit poorer than Hash implementation It supports port ranges, but it behaves a little bit poor. I didn't Maybe it can be optimized by unrolling some things as well But this is the algorithm which is used there is a different algorithm which I didn't try yet it's three based so You know, but it just takes time to To switch things and to benchmark all this So another I Don't have a benchmark for this like yeah for like real benchmark the JYP case right JYP. We have like a huge set of IP address prefixes which map to a country or city or something like this and here This wild card map like it behaves just and just as an LPM tree and To much to the LPM tree. So LPM has prefix and address and it finds the longest prefix and For wild card would do the same like we put address and prefix and we set the priority to the length of the prefix so it's The same you API is LPM and for bigger Sets of keys and actually behaves better. So here for like IPv4. It's about like four million entries It behaves like one look up is about 700 nanoseconds for LPM, which is implemented and like original PM. It's like One thousand ten hundred one hundred and two and for IPv6 it's also about like 40 percent faster and for the For the random data, it looks like this wild card implementation like hash based implementation of LPM It works faster for and as key grows it works faster for IPv6 it also work faster, but Not initially. So there is some overhead initially and it just starts to work faster from about like one million IPv6 entries which feeds the JIP data case, but like in general case It is not as efficient. So The problem with this wild card map is that you really need to understand the use case and how to use it because Looks like it is not possible to do like really generic Algorithm with bits all the other like implementations But it's still yeah, if you misconfigure input, if you misconfigure, this is the same Set of rules as here but here this wild card map can degrade a lot and There are ways to fix it. I didn't finish it because it breaks like how the RCU works, how we update things and Work in progress Yeah So So yeah, the summary map is that this wild card map. I think the like UAPI looks okay minus this zero array thing but the actual implementation I probably will work more on it because it doesn't like it doesn't perform as well as I want like it's It outperforms in some specific cases but It's not generic enough because like people can shut their legs off with with it if misconfigure I just want to confirm It is used in the kernel the zero array as a marker and actually suck common Structure in the kernel the zero zero length array it has to there's a you know What's it called SKC don't copy begin SKC don't copy end I guess it's just used as markers within the structure to know what not to copy in the Great So I just want to confirm that it does exist I guess when I was asking about the implementation I was wondering if like it's some some very fancy three and that will allow you to it's the fancy algorithm from like Tuple merge white paper It's the last time I implemented in white paper algorithm because it took like too long to actually toggle all the pieces And it's not yet finished too Because the other part of the question would be like can you just implement it in like pure BPF code? Like you know with loops and all that stuff because you can do a lot of work now with BPF Right and then like you won't have to like define it as part of your API and like all this complex Macro like and sort of like mini language how you define the rules and all stuff if it was like tuned to your use case Maybe this is the way to go if Especially if we have some like hash function k-func helpers or something like this because NPF it will definitely plan to add hash function and now we have din pointer So it's like very easy to do so in this case it might be possible to to make like The most efficient for this for this particular case Seems like a good first step right like try to implement it with BPF and see what works for dozen and then like Yeah, I mean for this map particular it like it is generic interface which works. So you can prototype things In any case and get some more or less good productivity with it, but yeah All right. Thank you very much