 Hi, everyone. So I wanted to talk about the sketch data structure for quantiles in BPF, and I'll explain what I mean by that. So at this point, this is mostly a use case and some ideas about how to implement it. I started the talk with something else in my mind, I ended up elsewhere, so the goal here is to present a use case, sketch out some approaches, and then solicit some feedback from you. So I'll talk about the use case and then the data structure, and then I'll kind of take a step back and look at the bigger picture. So this is, oh, by the way, I work on TetraGon, and TetraGon is an observability system. So basically, we insert hooks into the kernel, and we observe events, and then we present them to users. And so we want to see the events, and then one of the things that we would like to do is kind of create summaries of these events. And these summaries are typically statistical metrics about what are the things that are happening. And so we want to both present these metrics to users so that they can do pretty graphs or make decisions. But we also would like to act on this metric. So for example, if we see that something like an anomaly is happening, then we can take a stack trace and generate an event so that we can get information to figure out the issue. And we want to maintain these summaries at the kernel side, which allows us both to avoid the user space coping, but also to make the decisions in line with the operation. And so what are summaries? So the typical example of summaries are things like means, standard deviations, and minmax values. And these are easy to compute. It's straightforward to do it. But it's also not very useful from a statistics point of view. Things like quantiles, things like the median or the tail latency are much better and provide much more information. But it's also hard to incrementally maintain them. So like the naive approach would be to get all the measurements. And then for the quantile, for the median, find the middle point. So you would need to basically keep all the data that you've collected, which is not a good idea. So there are different data structures that try to help with that. And they are called sketches. And the basic idea is to have a constant size and then do an approximation to the result. Indeed, there's already a BPF map for that, which is a sketch for a set, which is the bloom filter. But there are others, and you can use others in different applications. In this case, I'm going to focus on the one about quantiles. So basically, you would be able to ask this data structure, give me the median of this metric, and then you would get back a value and then an approximation error about what is the range of the error. And so one of the algorithms, sorry, one of the data structures I looked at, it's called a good digest. There is a paper about it, you can read it. The important thing about the paper is that it provides two guarantees. It provides error guarantees about how big the error can be if you maintain the invariance of the data structure. And it also provides memory guarantees. So if you keep the invariance of the data structure, you will have both bounded error and bounded size for the data structure, which is a really good property to have. It's basically a binary tree that maintains ranges on the nodes. So I selected this data structure and see how far I could get. So it looks like that. So basically, you have the root node that keeps a range in this case is from zero to 1k, and then you branch out. And for each node, you maintain how many elements you've seen for this particular range. So for this range, we've seen zero elements for this specific range, but in total we've seen 13 elements. And here you can see the leaves, where we know that, okay, this is for the eight, and we've seen seven instances of eight. And the trick here is that you can play with the information. And so you can compress the data structure and lose information. So for example, this node here, which is basically saying that, okay, we'll have like one value of 1000, you can remove it and you can merge it to its parent node and say, okay, instead of maintaining that we have one element of 1000, I know that we had one element between zero and 1k. And again, you lose information, but you gain space. And the basic concept of this algorithm is that you can do it in a way where both your error and your memory is bounded. So this looked like a good starting point. Oh, and this is basically the lookup. So what you're doing the lookup is if somebody says, okay, give me the median, then you translate this into a rank query. So if you have 13 elements in this very simple example, your rank is seven. So you're looking at the seventh element. And so you iterate the tree until you find the node that you're looking for. And then you return the range of the node. In this case is a good case because like the range is basically just the number eight. But you might end up with another node and then you'll provide the full range, which includes the error. So basically it's two operations. One is the lookup and the other is the insert together with the compression that you would do periodically based on if you kind of running out of memory. So initially, the original idea behind this talk is to propose a new BPF map. But after kind of reflecting on it, I realized that this doesn't really make sense. Because like this data structure is not fundamental as arrays and house maps. And there are probably, I don't know, tens of different implementations you want to do for having quantiles. And so it's better to implement these data structures in BPF. And to be honest, I haven't been following all the latest developments in the list. So I think there are many things that can help doing that. So the first approach, which is can work with not very recent kernels, is that you can take existing maps and implement the same idea. So for example, you can take a house map, and then you can store the tree using a key to the house map. And every dereference of a pointer would basically be an insert operation. You would need something like BPF loop to kind of do the iteration where you traverse the tree and then a lock for the data structure. So basically, so that you don't have concurrent writers or readers. But this would work. I haven't done it, but I think it should work. So that would be the first step. The second step is to look at this kind of what I call here modern BPF and see whether we can actually write the traversing code in BPF. And again, I don't have an implementation about this, but I think kind of these ideas that were developed for these graph data structures, the list and the RB trees are really good foundations to start building on that. And basically, the idea of having owning references and non-owning references map really well to this tree as well. And having something like an open coded iterator loop where basically you iterate the tree would be a good approach, I think. How well will it work? I don't really know, but I think like having an implementation BPF where you can actually have like a custom binary tree where you can do lookups and insertions is a really good exercise and see how these new facilities, how far can they get. So that's kind of one part. The second part is, okay, let's say that we fix this problem somehow and we have a map implemented in BPF. What we want in Tetragon, and I'm guessing in many other applications, is to be able to access this map from user space. And so basically perform like the lookup implementation that we did in the BPF from user space. So as far as I can tell, like the only way that we can do that is using something like BPF program where we basically set up things so that we call into the map and then somehow we figure out the interface. But this is not very flexible, I think, and it's a bit tricky. So another thing I was thinking was basically having like a map where the operations of the map are implemented using BPF. So basically if you have the BPF code for the lookup operation, then you would be able to use the map interface from user space to say, okay, like perform the lookup operation, and then it will execute the BPF code for the lookup and give you back the result. And the idea would be to use the existing map interface to do that. And kind of some ideas about how to do this. So I think we already passed like file descriptors on linking. So basically you can maybe pass the file descriptor of the program when you have this map. And then I think something like BPF struct ops is a good first kind of thing to see whether you can use it to implement something that looks like the map operations so that you can then kind of have the methods for the operations. Yeah, that's all I had. Thank you. Are there any questions? Feedback? You can probably guess what my preferred approach is. But I wanted to ask... I'd love to hear it. Well, I mean we talked about this I think during lunch, but I do think that this is a good use case for the graph collection stuff. I think one follow-up question I had was you're essentially making use of the MaxElims map feature. I'm curious whether it's possible to adjust the data structure to change MaxElims. For example, like I'm willing to trade more space for higher quality data. I already have some data in the data structure. Can I add some more notes? From the data structure perspective you can definitely do that. So like there are bounds and then you can you can have knobs for doing that. So the data structure itself definitely supports that. Cool. I think that that like is an additional push towards using the new K-funk and K-pointer infrastructure just because that's tough to do using BPF maps. I definitely share your concern about the story for accessing K-funk's K-pointers from user space not really being super well fleshed out yet. I definitely think that's currently an issue. And I also just wanted to say when you were on that slide that was proposing to do it using the new graph collection stuff I pretty much agree with your implementation ideas, especially like open-coded iterators, open code as much of it in BPF as possible, et cetera, et cetera. It's definitely worth a try and I'm definitely interested in making verifier changes if necessary to make it happen. So have you heard a more recent version of a sketch data structure called T-digest? Yes. So I think a T-digest doesn't really require it's a trade structure. It could be it has the same implementation. In that same implementation basically an array and in that array and also doesn't require kind of you can just map that array to the user space and the user space just gets it from the array content to get the quantiles. I actually tried this the T-digest like a half a year ago. I think it could be totally implemented in the BPF without the changes in the kernel and doesn't use this K-pointer or special data structures like maps or something. And the only thing that probably needs to be make it better is just T-digest requires some sorting. Requires a sorting the data you stored in that array. That's probably something a little difficult right now for BPF too, but you can use a two-level BPF loop to implement a simple bubble search to make it sorting. So with that I think we can I can already get a very pretty accurate some depends on the configuration I can accurate about the stream of the data and implement it in T-digest purely in BPF. But one thing after I implemented that I thought is that why not just explore the data through RIM buffer and do the estimation by pulling the RIM buffers and do the do this in the user space. I think the tradeoff here is that in using RIM buffer you may have some data loss, but if you don't care about the data loss and doing it in the user space could be better because in the kernel if you do this in the place I don't know Q-digest for T-digest it has this batch operation. For example if you have n elements in the array the first n minus one element is the O1 operation, but the last element will be have a sorting that requires a n log n and then you have this inconsistency or inconsistency in your latencies for for element processing. But if you're doing is a in the BPF that's in place processing it may have its latency inconsistency, but if you pull the data through RIM buffer and doing in the user space that probably will not be a problem. So for the T-digest yeah like I've had a look at the data structure I didn't try to to kind of implement it so if you have this implementation somewhere available I would appreciate the link. So I think it makes sense like I haven't looked at the details so I cannot comment like how to implement it but the second part I do have an opinion that like what we've seen specifically in Tetragon is that we we will induce a lot of overhead if we kind of push the events to to use the space not only for the processing but also for the copies so yeah it's yeah I just so the use case is again like a couple hundred gig nicks right and you can't throw 80 million events into user space right so if you want to get like quantiles of say like packet lengths over a over a nick or something right so it's just not feasible to really push that many events to user space so that's the that's the problem with that I mean we we do do similar things like that to solve this today push things to user space or summaries of things and then try to build quantiles from user space or we hard code the buckets so we don't have this ability to compress and shrink like so we have a lot of that stuff today but we have to make a lot of assumptions about the data we're going to get something like this would allow us to to be a lot more flexible with how we how we build those percentages and stuff so and the other thing you lose is the ability to react in line in BPF code if something happens because you can say for example like okay I have this new value is it over the 99th percentile because if it yes then like it's a very simple anomaly takes on algorithm to say okay like something's wrong here like this value is beyond the 99th percentile and it's like I mean you can do it in user space but yeah it's you get some more flexibility about what to do if you do it in Kerbal could you go back to the slide where I talk about how you could access it from user space there's one way to say prog run or something yeah so you kind of mentioned that that would be cumbersome could you elaborate a little bit why how would it come to some so like if you access a BPF map from user space then let's say you do a lockup operation it's clear what the keys are and the values are so you have like a data structure interface more or less um and again like I don't know another way to do it but uh like the only thing I could think of is BPF program which means that you would have to set up like some sort of program let's say an XDP program and you would have to pass your input whether for the lookup or for the insert operation into between data and data end and then somehow get it back so it's you need to like do some additional work you don't need XDP you can have like any tracing program and just context is your own like it will have key as a key that's the only thing you don't need like special XDP to construct it you just two different problem your real XDP program sharing a map and you collect all of stuff there and on the side have a tiny little program that not XDP just tracing or CISCOL style program and the key that's the only context and the output and whatever you want and you just use that so effectively it's a CISCOL code code into the kernel with arbitrary input arbitrary out so you can do that today already with the existing program okay so then would like program would be the approach for do that or is there another way to do that program is the thing but what I was wondering also in the discussion earlier with like how to access the other like we've had multiple like data structures and how do we access them from from user space like could we you know extend program to basically say you can actually call a BPI function directly and okay I just need to look at that I guess I guess the the other benefit which is a small benefit is that like if you have a program which implements a map then it's kind of abstracted and you can use it both from BPF programs and but it's a convenience it's not doesn't make a big difference I did the old one Qtigest sorry I just came so what's what's the question like I think TGi just is a nice data structure we used it in both in BPF and non-BPF context it's great is there an implementation which is open sourced or yeah yeah yeah okay can you send the link sure thanks yeah we we actually have someone in place oh it was Dan than Shazberg he implemented like on BPF there are like few different ways to implement it like using AVL trees using like a array sorted arrays merged arrays and stuff like this arrays obviously it's like easier to do in BPF so I was talking about the different data structure which is Qtigest which is kind of so the guy who invented t-digest it's like Ted dining from map are or whatever I know he he kind of like implemented t-digest because Qtigest was like too slow too big and his his use case was like a streaming applications right so like you have like tons of data and you want to like maintain percentiles and stuff so like he was designing t-digest for that use case and that's great use case yeah yeah over the paper yeah all right cool thanks all right thank you very much