 Okay, so I have few relatively connected things that I just wanted to talk about, based on kind of annoyances that come up in practice when doing some PPF applications, usually like some sort of tooling and observability. The first one is the hash map requires presizing it, which very often it's impossible to have a proper size, especially when you don't know about the workload or you don't control the workload. So recently this problem came up like when I was working on the USGT support in the PPF and that requires some PPF side. And on the PPF side, we had map. And that map is actually scaling with the amount of USGTs that user application connects. There is no way for me as a library maintainer to know what would be the appropriate size. So having some solution that allows to bypass this by saying we just don't require pre-allocated size and stuff like that and just do allocations as we need it would be awesome. And the first slide is kind of to remind John Fastavan, his talk from last LSFMM that he was experimenting with using our hash table based map. So maybe we should just revive this, maybe like benchmark it a little bit more and see if it's worse adding as a separate map. And this is the slide from that presentation that it shows that like, so I think the yellow one, right? Is the resize. I don't even know what's the underscore uses to be honest. I forgot, but. Basically, I can be quick. Yellow was like, if you just like literally put BPF operations over the top of the RC resizable hash table and then green was like went into the resizable hash table and fixed up a lot of this, like improve just the core out of structure in the kernel. Yeah, so green looks good. Green did look better. If you can do that, then I think it's a pretty good alternative to the hash map. And I think we should do that, right? So I highly encourage. I can tell you what I ended up doing instead of this is we just started allocating giant blocks of memory and just like in a map. So you'd like map, pre-allocate with no pre-alloc, but then, and so you make that very big because it doesn't take up lots of memory. And then every time you need a new block, you just get a new entry from the hash, from that for, sorry, we do like in a, yeah, you get a new entry from there. And that entry causes an alloc, but instead of doing it for every object you actually have, you would just get like shove like a hundred elements inside of a single hash entry and then sort of do something counting, but I'm not sure that makes any sense. Yes. It's all good, but it's probably like not a very usability friendly solution for like generic data structure in like BPF, right? But anyway, I think like this is one good direction. It's probably not the only direction. So the other alternative is like just use a very different conceptually implementation of the lookup table, which has map as like lookup table, right? Like it's a generic data structure. And like one of those is well-known red, black tree. And guess what matter is working on doing that as a map. So we might have red, black tree as a BPF map soon, which would be great, right? I suspect it might have quite a high memory overhead overall, especially when the values are small. So there will be trade-offs, but this is one alternative. Yeah, John? Yeah, it's a generic question. Like at some point, like all of this can be done inside BPF and you don't need a map for it, right? Like we have the new pointer stuff coming and we have like the ability to alloc. Like could this start to move into like a library? Yeah, so well, right. So that's so far what we're debating, whether it will be to look more like a map or it will be still to really kernel like helpers where it won't look like a map anymore. Like pretty much everything will be explicit in a program. It's still TBG. I think there will be some like usability challenges for like basic computer science data structure, like to always have to implement it in pure BPF. And like besides the usability issues, like the memory overhead. So like for example, for like malloc din pointer, right? Malloc din pointer, right? We have to store ref count. We have to store some additional metadata which you can completely avoid if you implement it in the kernel. So I'm pretty sure with like all the like BPF loop and din pointer, we will be able to implement some select red, black tree, but I would still argue that like we need something that's like lean and mean inside the BPF itself that's like a proper BPF map for convenience and performance and efficiency. Have you looked, we have the like, what is it? The prefix, I always forget what the map is called. We already have a map for like IPs. LPM? Yeah, LPM is there which is almost this except for it kind of wants data sizes that are like IPs. Well, it's a good segue that like there are still alternatives to red, black tree and it's like this general class of try-based implementations, right? This is the best picture I've found on Wikipedia. I don't think it actually gives you any insight in how it actually works. But the idea is that you have sort of like a multi-children node, right? And you basically like logical split your key into chunks and like split points define like how many, like how much branching you do and all the stuff. It's, I mean classic try, the trivial implementation that is like really trivial. It's just like super memory hungry. So in practice, no, I don't think anyone is doing it like as a real try. They usually have like some compressed try implementation. And like in this talk, I wanted to highlight one specific implementation that I came up like during, I don't know, some searches. It's called QP try, which according to everything I read and according to some recent upstream trials between like the QP try and there was like the Sternary search tree for like more efficient string storage, like if they share a bunch of like common prefixes. It seems like the QP try actually stands up pretty well despite like having no special optimizations or like common prefixes and all that stuff. So it would be really interesting to try that. And I have some numbers from like some log posts, someone, so like what the author of this benchmark did, like they took 200 megabyte text basically with some words and try to like build a lookup table based on all the words. And like, yeah, so he benchmarked like four different implementations. Radix tree, crit bit tree is kind of like QP tree but like less memory efficient because it like at every node it branches like only two ways while QP try actually like uses clever tricks how to store like multiple pointers compactly and like red, black tree, right? So and according to that benchmark, right? Like the QP tree stands up against red, black tree very nicely in terms of like memory usage and even performance. It's one of the benchmarks, I don't know. But so far it looked very promising and like the idea is pretty simple implementation is pretty simple. So it shouldn't be too hard to try to fit it into the BPF like ecosystem basically. The good property of the QP try versus like the recently discussed Turner research tree is that this thing like QP try doesn't attempt to split the key. Like the key as it's allocated is still like a blob of memory which makes it nicely fitting into like BPF map element iterator and all this stuff. Cause like once you start splitting key into like distinct pieces of memory, you cannot really iterate it cause like even if you can like how do you provide the memory to the program? So basically QP try seems to be like pretty nice, you know, stand in implementation like for ordered map. So regardless of whether we do red, black tree or QP tree or Bose, it would be good to have some implementation that doesn't assume like predefined maximum size or pre-allocated size. And in addition, we also get ability to have like ordered iteration, right? So like if you had some helpers to like get next key. So we have this from like Cisco side, but like if you get it like from BPF side with extended loop support and all this stuff, now you can actually traverse your data structure inside the BPF program, like as you need it according to your logic. So one topic that I wanted to touch on cause it's like when I think about like implementing some new map, yeah, go ahead. Before you get too far off of the topic of resizable maps and things, correct me if you already have it to whatever, but I think the same thing you'd want for things like stack and Q maps, right? To where you can't necessarily predict what the size is gonna be ahead of time if you want dynamically resizable ones. I don't think you have that right now, right? I know we don't on the window side. I don't think it's on Linux either, right? You gotta know the stack or Q depth, the maximum size up front. And then if you have a peak that you almost never use, right, then you're having to overallocate, right? Yeah, we don't have that. But to be honest, I haven't seen anyone using Q and stack much. Cause like they are so simple that you can actually implement them on a block of memory. Yeah, it goes back to the discussion that we were having about, you know, Dine Pointers and stuff. Is that something that you just want to leave it up to the programs that you put it into a library or you optimize the map itself, right? And is there any performance difference? I don't know. Is there a making programs easier to write whoever stuck that was, right? And so at least a library might be useful. My point is that maybe hash map isn't the only thing to look at even if it's the top one to look at. So I have one question. So regarding this like the stack and Q map, for example, I haven't seen it used in practice, but in general, like I mean the maximum element size there is, we did that because of the R limit, right? There was no particular other reason. And now it's all converted to the memcg accounting and could we actually lift that limit? Maybe it makes sense. Also for hash table and others. I mean, who cares if the user screws it up, then I don't know. Maybe it could be an option at least. Well, yeah, the limit, all of this limit started because we were hoping we didn't know about that spectrum milldown will be invented. We thought like, I'm pretty sure we fear food, the reality and all of these limits everywhere were specifically for that reason. Now, yeah, everything is rude. So like max entries don't make sense that much anymore. I mean, like at least for like hash map, it's not that trivial to like do resizing, especially when you have the hash map working in an NMI, right? So it's not just like accounting limitation. It's implementation, similar for stack, right? Like the model that the BPF maps have like with regard to concurrency, right? Like it's like sort of like lockless on read. And like you don't want to pay the price of like locking if you don't have to. And like with the resize, you have to lock basically or like do some very clever, maybe a schema of like atomic replacement. So this is definitely not a trivial, like let's just remove this limitation. It's like let's think through like how we can even like support this with the same formatting. I'm not sure what they have a slide about sleepable stuff because for sleepable, we still can only pre-locate which kind of does not make sense at all. Like sleepable supposed to be able to use schema like always because we're sleepable, not only GFP atomic, like normal hash maps but GFP kernel would be just fine too. But the whole thing is there because map look apps are RCU protected and sleepable programs, they are RCU trace. So there's still something to be addressed. I don't have it on my slides, but I wanted to touch on like how can we generalize like not caring like as an implementing some map, like how can I not care whether I'm in sleepable or not sleepable mode? Like this trick that we do with the GFP flags, right? Like can we somehow fetch it from the context based on like whether we are running as a sleepable or not and stuff like that. But I don't have it on the slides, just topic for discussion. But basically like the way that like we do so recent changes to socket local storage or was that? Yes. Yeah. It requires like verifier magic explicit code, right? Like to pass this extra argument. It's, it works. It's super magical. It's kind of takes work and like it would be nice if like if you can have something like what we have with run context, right? It's like this ambient thingy like where you can like ask like, oh, what's, what's in the context right now? So maybe, maybe we could do something like that for like sleepable versus non-sleepable tool. So I think like we tried making it like not caring about sleepable and non-sleepable in like local storage maps. The issue is the synchronization part of like the RCU stuff, call RCU task trace. It was, it was causing performance impact. But Paul McKinney says that he's working on some patches and once we get, once we like run a few benchmarks there and those patches work, maybe we could generalize all of that. All right. So getting back to NMI, my favorite topic. So, you know, some, some people who like don't, don't have to do like perfect events stuff. They don't usually care about NMI, but like performance profiling usually happens in NMI, at least like the, some portion of it. And like the problem was all these like new maps that I'm proposing to add, right? That like would do like dynamic memory allocations that it's kind of hard, if not forbidden to do memory allocations in NMI, which almost immediately rules out the use of those new maps in NMI. And like the question sort of I have is like, can we do anything about that? Cause it doesn't seem like we can realistically pre-allocate stuff for RB3 or for QP try without like, it's like arbitrary hitting some limits, right? Because like we didn't anticipate some pattern. Oh, when I gave the resizable hash table, you suggested having a cache of objects. So maybe you could cache a set of pre-allocated stuff but not the entire. But how much do you cache? Well, you make up some number, maybe it'll work. But yeah, that's the problem. You make up some numbers and it will work with some workloads and like will not work with others. So for this, we also realized that right now all of the maps that two modes, like either everything is pre-allocated or everything allocated dynamically. So thinking to add a new operation potentially to all map types, there will be allocate like N elements. And then the actual like update will be you only like malloc free. So this hybrid mode. And then like if you were careful and know what you're doing, then you will allocate the elements in the context where you can. And like in an MI, you'll just do update and then it just like grabs this cache buffers, cache elements. If you can say like I will extend this map. But you still realize, right? Like it's first like it's kind of complicated. Like it depends on the workload. It still can fail because you didn't anticipate the spike in workload and stuff like that. So having, so I know like one of the ideas, right? Like as an alternative, which is like having to pre-allocate because like even if you like so, before I go to that, right? Like for QP try, for example, even if I knew that like I need like up to one million elements, right? It still is impossible to correctly pre-allocate everything because the size of the node for QP try, for example, is different, right? Like it can have like two children's and like up to 16 or something children, right? So like even when you know that I need million nodes, like do you pre-allocate like the maximum possible size? Then like you always see memory and like do you do like some average? Then like it starts becoming dependent on like some distribution of like your data. So it's just having to think about this is kind of maddening I would say because I don't want to think about this. Like in user space you don't care, right? You just like, let's add element and it has to allocate and if it didn't allocate then I'm all too bad. So yeah, so I was thinking like how can bypassing in my limitations and like one possible solution I think is just like to offload the workload basically. So like do the minimal amount of stuff like capture stack trace, like some context, store it in like some pre-allocated per CPU memory or whatever and then like if you need to populate hash maps and stuff like that, just offload that work into different BPF program or into callback, right? Like I don't remember like what did we decide about like IRQ work queue, but like if we had something like that and we could just like schedule something to run like at earliest possible opportunity in a little bit more permissive context that would be good. Alternatively, and I'll be talking about this a little bit later is if and once we have BPF ring buffer that allows kernel program to talk basically to send data to another kernel program, right? BPF program, then like you can use that as like a queue of work for the program that runs like in more permissive context. So I see that as like one of the ways of bypassing an MI limit, it's like do minimal pre-allocated basically fix the amount of work and then like do everything else somewhere else. In approach to with the work queue stuff, this is handed by the framework itself on the kernel to kernel ring buff, are you saying that the map allocation like, how would that? So I'll get to the kernel to kernel ring buffer, but basically the idea is that like you'll have some, your program on the other side of the BPF ring buffer that will be called for every submitted record and that will be running somewhere and like we'll get to like where it can run and there you can do what you can do like even with like sleepable mode, right? So get out of an MI basically, but like preserve enough state to do something useful. That's the idea. But like kind of like to wrap it up, right? Like the problem was like adding new data structures like they're all fancy, they're all promising like good numbers and all stuff, but like only in some cases depending on the workload and like it's really hard to make a good decision when you don't really have good benchmarks. So I think like while it's probably not the sexiest and like most exciting work, we need to invest more in having like good, more or less representative benchmarks for like for hash maps, you know, typical kinds of data structures. So that would be great. Like we have sort of like benchmark framework, but like we need to simulate some workloads, right? Like load KL sims and stuff like that. So like we did it ad hoc previously, but if you had some something that you can just like, oh, I have new experimental data structures that like I want to develop, like I'll just plug into the existing set of datasets and like of benchmarks and I'll see how it fares against the existing implementation. So it's sort of like call to action. Maybe someone is excited about writing benchmarks, do it. I'll be okay to review that. Yes. So I think that it's a good point about benchmarks. It'll be nice to have like some representative workloads there. For example, Arc v copying, right? Or BPF ring buffer if we implement something using like and measure how much time does it take to copy a large Arc v or like a large amount of processes that have spawned at a given time or compile the kernel and then see how much time does it take to like, how much overhead does a logging program add on top of that. That would be a nice sort of overall benchmark, but it won't micro benchmark the ring buffer, but it will sort of focus on the ring buffer aspects because that's where it is built around. Yeah, the problem is it's hard to simulate realistic workloads without having just like production traffic and stuff. So for this particular case, the realistic workload is like, it gets stressed under heavy command execution, right? So the realistic, the end game for this particular case is the kernel compile or the worst, a Clang compile, right? This is where it really, you have really large command lines on Clang and kernel. So it's stress test. We do that, we could probably send something that measures this. Yeah. So another topic. Would the benchmarks be part of like the self test suite or would we have to write a separate, you know, runner that you could set the iterations and like treat it as a proper benchmarking framework? We already have this part of self test. There is bench binary that has like a set of different benchmarks and like you can plug into it. So we just need more. And like, well, we need like the, we have like benchmarking infrastructure. We need like the actual benchmark, right? So like it has like, you can like initialize your test and like do something like all the time and like the framework itself measures everything and it gives you like throughput per second and stuff like that. But like simulating the workload and testing everything is up to you. All right. So another thing was hashing algorithms. Like kernel and like consequently we in BPF maps and like around BPF code like use j hash, which is kind of all this. Like it was designed in 2006. And like since then we had like 15 years of like people tweaking and tuning and improving hash algorithms. So I'm pretty confident that the Jenkins hash is not the best hash anymore. So like, but why it's important, right? Like for story from my life, right? Like stack trace, for example, like at some point like we were running into a lot of hash collisions in production cause like we were running like long profiling sessions. And with stack trace, like you have to decide like your, your poison basically with hash collisions either you evict the old stack trace and then you sort of like corrupt previously captured samples or you just like drop the stack trace cause you could not add it. So like we were trying to like play with the size of stack trace map and like made it huge and we still encountered quite a significant amount of hash collisions. So like Jenkins hash is not the best hash. We need to have like faster and better hash. Similarly for like Bloom filter and like for hash map, right? Like they all use hashing and like if you can improve it, we will improve the performance overall at like wide variety of workloads. So, so the kind of the theory and like the question I have is XXHash created by Yanko Lake or that, I don't know is the answer cause according to some open source benchmarks it actually is pretty good. And like I'll give you some kind of inside that it's not obvious. Like as I read a little bit more, I realized that a lot of those hashes that like you probably heard about like murmur and stuff like that, spooky hash, city hash. They very often have to optimize for like either long inputs or short inputs. And usually like they are good only at one thing, right? So very often like all those benchmarks are done actually on long inputs like file. Like basically they optimize for like calculating digest of the file basically, right? Which makes it sort of unsuitable for hash map because usually the key is pretty small. So like when I actually try, so kernel has XXHash64, which is like older XXHash. When I try to plug it into a hash map for example, instead of the Jenkins hash, it actually was slower even though it's better hash in terms of like distribution of the values and all the stuff. It was slower because of the short keys. But it seems like the latest creation from YAM, XXH3, right? It's actually good at BOPs. I don't know how he managed to do that, but it's in this table by the way, this table is sorted by the bandwidth, by the throughput that can be handled from the top to the bottom. And like you can see, let's say like the spooky hash is like a newer Jenkins hash as far as I understand, right? Like you can see that it's like 50% faster according to this benchmark. The properties of like the XXHash is pretty good. Like there's this project, some hash here that tests like the distribution properties and all the stuff. So like this hash has like no known problems, let's say. So this is the benchmark from the same YAM where he shows like how different hash algorithms like fair for like relatively short keys. And like XXHash just like blows everyone out of water like starting from like four byte long hashes. So I haven't found anything better, please. But we cannot use it because of SSC. We cannot use SSC instructions. Well, we need to benchmark a non-SSE implementation. Like I'm pretty sure there is like non-SSE implementation. Just it might be slower. Yeah. You specifically mentioned the SSC, using SSC2 or? It was probably developed with like the SSC2 in mind like to make maximum advantage of it, yes. So potentially with later revisions of SSC it's even faster. Like. So again? SSC4, like we're, if I remember like the SSC extensions have, they moved on from two, right? Like, or this is, they reset the numbering scheme since I last revisited this. No idea. Like, go. I mean, but still, it would only be for user space. You would need to benchmark the non-SSE. Non-SSE, yeah. Why can't we use SSC, like why can't we use SSC in the code? But if you are in NaGi, in NaGi in the case of SSC. SSC is not just about floating point. GCC uses SSC for like a bunch of others. It was the same, the whole program. Well. The other thing I remember, but then we did this, maybe six years ago in the kernel. So it's really a long time where we measured the performance of the OVS because it also required hashing and it was also using jhash and we implemented the CRC stuff. But this has bias, so in the end it was reverted again. But it gave a better overall performance. But the thing is like also to consider like some portions would need architecture specific code as well, but I mean it's probably doable. Yeah, so like I'm like, I don't have any emotional attachment to XSSH3 or whatever, right? Like my point is like we should look at like what's available out there as I said, like 15 years. Like people are actually tweaking like there are tons of those hashing implementations. They are like strong in one regard, maybe it's like worse in others. We just need to like kind of look at this and consider kind of contributing back to the kernel. Like the XSSH64, I think that's what we have in kernel right now, that was added in 2017 like when kernel folks were trying to like add ZSTD support or whatever. So like we just, I guess like this is the cry to like let's reconsider, like let's maybe bring in like better algorithms that can improve the performance like without users having to do anything magically get better. That's for TCP, right? Like some of the. TCP and IPv4 and IPv6 has been moving from Jenkins to Sipash. So that implementation is already there. So it might be easy to benchmark. The Sipash? Yeah, Sipash, yeah. Yeah, so like seeing that like you have 10 different entries on top of that. Yeah, just pointing out that if you can't rely on SSH3. We can definitely benchmark, right? And see if it gives like immediate benefits like simple, right? Yeah, so nice chart. Yeah, I think I told all that and this would probably make me happy like to review, you know, like if someone has some better hash implementation. So I highly encourage someone to, you know, take interest, like do benchmarking. So and the last topic that I wanted to touch on is the evolution of ring buffer, right? Right now the ring buffer, like what it allows you to do it's multi-producer single consumer ring buffer, right? Like and you can submit data only from the kernel from BPF program basically and consume it only from the user space, right? Which is generally what you want. But I've been thinking about like how we can extend it to allow submitting data from user space into the kernel, right? And I think we can reuse most of the ideas of like the current implementation but like maybe add like a little bit more safety checks because right now like we have guarantees that our helpers BPF ring buff submit like will properly like set the like record header with proper bits and all this stuff. Once we allow user space to write this, like we cannot trust it, but I think we can do some minimal amount of checks and like if user space screws up then like we just don't process date or something like the idea is like don't crash. Basically don't go like over like the pre-allocated memory of the ring buffer. But some of the, like this is one part like making it work with like user space submitting data, right? Like but what do you do with this data, right? Like so the second part of the proposal is to have like the BPF program of sort of like syscall type, right? Like the one that runs in like a nice context where you can sleep, you can call syscalls basically and stuff like that. And sort of attach it to the other side of the ring buffer and call it for each record. Thankfully we have like concepts like the end pointer in the works so that like the variable size nature of the ring buffer is not the problem anymore. You can just pass the end pointer point into like to a portion of the ring buffer and that's what the BPF program will work on. Some other benefits of this is like you can think about this like a super generic batch syscall interface basically because like you can submit a bunch of requests and only like notify about the data ones like the ones syscall. So I think that would be great to have this because like there are some potential use cases where you would want to initiate work in kernel from the user space but do it in a fast low over head way and like calling syscall for everything. Like you can do sort of that with like test run today for example, right? Like you can just pass custom data to test run but you incur like a lot of overhead like for each syscall, well relatively speaking this would allow you to do like batching, right? And then like another extension to similar ideas actually to once you allow to have kernel side BPF program processing ring buffer samples, right? You can allow kernel to submit work and then like BPF program on the other side like do like process requests basically, right? With that you will be able to basically have like multi-stage servers, right? Like one stage accepts, I don't know, escape buff parses like requests and then like decides that like well let's look up something in the cache, right? But you don't want to do it like in the network processing path. So like you all flowed it into ring buffer and like you have another program that does the work and then like stuff like that, yeah. I guess two thoughts. It's like, it's always tricky to know if that's good or bad to stage things like that but okay, fine. The other one is like, so would you have like a doorbell then so you could, you know, give you a bunch of stuff and then kick it? Well we have this right now, right? So ring buffer like it has this default mode of like trying to notify user space as soon as possible unless user space was already notified and hasn't processed data. So it's kind of like self scaling but then you also have like ability to disable notification or force notification, right? So like we just reuse the same thing, right? But I like it, it's really useful. You can build full message notification system both ways, right? For example, right now, some fancy stuff we have in Syllium where we still use the old per ring buffer but we can queue messages up to user space and then push them even through a Go channel and then to wake up some Go, you know, threat and you can do it like also the other way around where you, and I think it's great. We should pursue this. And definitely there are- Just a time check because we- It's the last slide. So like basically like there will be a bunch of different issues that we will need to think through, right? So like one of this is like where do we run this BPF program, right? Like do we allocate like separate case threat for it? Like do we, so like for user to kernel space we can say that like as you do this this call to notify you just like process everything in BPF program. But like for kernel to kernel that's like that's impossible. So like we should at least have the mode where you can say like I want to have like the dedicated case threat like that we'll do this. And then like with the dedicated case threat we can talk about like minimizing the latency and having like the BPF like kernel basically burning CPU waiting for like the samples to arrive. That's like probably there are going to be a bunch of discussions and it's not clear like what are the right combinations that we should allow, but it's just an idea. I mean maybe you wouldn't like just one thought maybe you would need something like the case of the IQD where you just have like one thread per CPU and then you process all the different. Right, like do we share threads? Like do we pull them? Like do we have a dedicated thread? Like do we allow both? How do we wire all this in API? But yeah, but basically like green buffer is like a queue. It's pretty fast. It's pretty convenient to work with. Let's generalize it so that we can do more stuff with it. All right, that's all I had. There is no thanks slide so I forgot. Thank you.