 So, hi, it's me again, hope everyone had a good lunch. The second part of my sort of verifier-related presentations is going to be me discussing the BPF graph collections that we recently added, and specifically focusing a little bit on the verifier changes that were necessary to make these work. My goal is really showing you all some things that we made significantly easier in the verifier in the process of shipping this, and I hope to get people interested in using some of these verifier changes in their own stuff as well. So first we're going to talk about what's so bad about BPF maps, then there will be a short demonstration of just looking at some simple code of how the new style data structures look, and then we will talk about the interesting verifier changes that were necessary. So the motivation behind all of this work was the SCETI XT project that Alexei touched on earlier. I'm not going to talk about it too much. There is a great summary on LWN, and presumably folks have probably heard of the project. But Tejun, who's one of the main people driving this work, is a kernel expert, but he's not a BPF expert, or at least he wasn't when he started working on it. And as Alexei mentioned earlier, he pushed us very hard to generally just improve the UX of writing BPF programs, and he would always ask us questions of the form like, why can't I just write normal looking kernel code and just have it work? And one of the things that he didn't really like was BPF maps. There are some problems with the BPF map abstraction in general, and I classify them into like, I guess, two groups. One of them is unfamiliarity. So everyone's used like the basic hash map or a map, and those types of data structures fit really well into the map API. Others less so. For example, how do you remove something from the bloom filter map? There is a BPF map remove helper, but it just basically doesn't do anything for the bloom filter. Programs that interact with the latter group of data structures, the ones where it doesn't fit the abstraction really well, can be hard to parse for people who are kernel experts, but not BPF experts. And even if they understand what's going on, they're just generally unwieldy to use. Since interaction with data structures is a big part of familiarity, that's kind of a problem. The other classification of problems with maps, I would say, is inflexibility. When the map API was developed, BPF programs weren't really as complex as they are now, and there were constraints placed on BPF programs that are really no longer relevant. I think a great example of this is Max Alums for BPF maps, where you sort of need to declare the size of the map ahead of time. Obviously, this was sort of addressed with the preallocation flags, but that was kind of how things were done for a while. Adding new data structures to the BPF environment kind of leads to a square peg around whole problem. Here's the bloom filter example again. And then when I was implementing, for example, the RB tree, which takes a custom comparator for nodes when it's adding a tree, or when it's adding nodes to a tree, how would that work with BPF map update ALUM when there's a fixed number and type of arguments? Well, it basically just wouldn't. And then adding to the inflexibility issue, all of the maps and their helper functions are UAPI, just by virtue of helpers in general being UAPI. So when we make architecture or implementation decisions that aren't really optimal with the benefit of hindsight, we can't really roll them back very easily, therefore, the cost of adding a new helper or map is pretty high. But this is a self-imposed limitation. We now have K-funcs, which you can think of as unstable helpers. We have BTF, and we have K-pointers, which Alexei talked about last year at LPC. There's a good presentation talking about old-style BPF, new-style BPF. And finally, the lifetime of objects within the map is tied to the lifetime of the map itself since it holds all of the objects within itself. We implemented two, I guess we call them graph collections in this new style, a linked list in an RB tree. And instead of using helpers, all the interaction uses K-funcs. They leveraged any context allocator that was written last year. So instead of declaring max elements ahead of time, you allocate the nodes yourself and you shove them in the collection. And then finally, the nodes are sort of defined intrusively, similarly to what you would do in normal kernel code, you define your own struct, you give it a BPF list or RB node field, and all of the plumbing sort of understands what's going on. And instead of the locking sort of being hidden from the person writing the BPF program, it's exposed. You have to grab the spin lock associated with the tree or list yourself. So let's look at some code. This is a pretty long example, so I've split it up amongst multiple slides, but here's how we define an RB node. As you can see, we have this special BPF RB node field within the struct definition, and that sort of says that, hey, this user defined type has a BPF RB node, it is a RB tree node that goes into RB trees. We associate this, we associate the RB root with a spin lock, and we sort of have to put it in a private section because it can't be mapped to user space. Because it is in the same section as a spin lock, the verifier knows that this spin lock protects the tree. We also added this underscore underscore contains, BTF tag that ties the node type to the root and basically says, hey, verifier, this tree contains struct node data. And it's RB node field is called node. Okay, now that we've defined our type, let's allocate some nodes. We use BPF obj new to do this, which is basically just a wrapper around the BPF allocator that gives you back a object of a specific type, in this case, it's a user defined type of node data. When adding nodes to the tree, it's important to grab the correct spin lock, otherwise the verifier will reject your program, and when you call BPF RB tree add, ownership of the node's lifetime is passed to the tree. So even though we allocated it in our program before, while the node is in the tree, it's the tree's job to deallocate it if the tree happens to go away. As you can see, this looks more or less like normal kernel code, at least much more so than if we were doing this with the map helpers. We also implemented shared ownership for nodes. There's an asterisk here because I recently shipped this, but I also shipped a bunch of bugs, so we had to turn it off, but I plan on fixing them shortly. We implement shared ownership with this BPF ref count field, which you interact with by calling ref count acquire, which bumps the ref count and allows you to add nodes to multiple data structures. So just to summarize some of the interesting verifier changes, I touched on BPF obj new and BPF obj drop before. They are basically the malloc and free of this sort of new style data structure says give me a typed object of my user defined type. The BPF field and BPF record are generalizations of a pattern that existed before this work. So let's ignore BPF data structures entirely and let's just talk about spin lock and timer. Those are special fields that you could put in your user defined type. And maps needed to know about them. They needed to know I contain elements of type X. And within type X, there is a spin lock at offset 40. The BPF field and BPF record work really centralized this logic amongst the various disparate places where it was and made it much easier to add new fields. We also implemented strong and weak references for local K-pointers. In the code, these are called owning and non-owning references, but that's basically more or less what they are. And they express ownership or for non-owning or weak references, lack of ownership over a K-pointer's lifetime. That allows us to interact with things that we have added to a data structure without necessarily taking ownership of its lifetime. And finally, we have BPF ref count, which integrates pretty seamlessly with BPF-OBG-NEW and BPF-OBG-DROP. So most of these things are not tied to the BPF collections super heavily. I hope that if you have a need for shared ownership of user-defined types, or you generally just need to interact with user-defined types, that you check out some of the implementation behind these things and use them in your idea. That's it for me. Quick question about this BPF-OBG-NEW. Are these object kernel object or your object defined in your program? It is an object defined in your program. So one interesting thing that we use maps for is kind of communicating with user space, which is great for debugging, et cetera. Do you have any ideas how these kind of implemented in BPF structures? What could that look like? How could user space? I think that's a very good question, not just from the communication with user space perspective, but also from the visibility perspective. One thing that's sort of missing currently that we're aware of that we need to improve is making visibility from user space better. Currently, BPF tool, you ask it to show me all the maps. It does a very good job dumping information about them. If we go back to this example, if you're familiar with libBPF internals, this private section happens to be implemented as an array map. So in this example, the RB route and the spin lock are part of a one element array map. They're just part of the one map value that's there. So presumably we could extend BPF tool to recognize that there's an array map with an RB route in it and dump its information or make it possible to otherwise interact with it. But it should work in general, right? Like, I mean, BPF tool aside, like your user space could just read that memory and then it would have access to the RB tree. As long as you knew the structure, you'd be fine, right? So actually the reason this private define exists is that we don't want this memory mapped into user space because we can't map spin lock into user space. Meaning you're concerned as user space would write to the spin lock? Could you do a read-only? Did the folks who implement spin lock have opinions here? I don't know if mapping it in read-only has safety issues. Because if I could get it read-only in user space, then user space can figure out how to deal with the RB tree because we'll know the structure, right? And so that would allow me then to have kernel work on the RB tree and then at some arbitrary time I could dump whatever statistics I needed out of the RB tree, right? That would be my use case for it. You probably can make it read-only, but it still won't allow you to dump the RB tree without involving some BPF extra program that will dump the kernel memory. Because what you have in this section is just the root node. And then the rest of the tree is somewhere there. So you need BPF iterate or something like this. Anyways, so. All right, last comment in the interest of time, then we need to move on. What? All right, OK. So thank you very much for your session.