 Good afternoon, everyone. Thanks for coming to my talk. This is a talk about data structures, primarily. It's not a very kernel-focused talk. While we are developing this for use in the kernel, there's actually very little background that I'm expecting you to have on kernel-related programming. I'm just expecting you to have a general sort of low-level programming background. So we all know that kernel developers are the sexy thing to be, right? All kernel programmers are super geniuses. And so obviously, they use the most advanced data structures known to mankind. Generally, we reach for a linked list. And that's because a linked list is really, really straightforward to use. We have a great API for making it easy to use. And of course, the performance of a linked list. And by the way, a doubly linked list is the standard kind of linked list used in the kernel. We do have singly linked lists. Both kinds of lists, aren't we smart? We do have other kinds of data structures in the kernel. We have an RB tree, which is a red-black tree. Red-black trees have certain properties which keep them somewhat balanced. This one is perfectly balanced. You really can't do better than this. And this particular example I have here is a red-black tree, which lets you look up an age. Now, I have inducement for people to participate. Would anyone care to talk with the hand and suggest, what is the perfect age to be? No one? OK, I'm going to pick on an audience member. You, sir. What is the perfect age to be? No, you. No, don't look around you. 21. OK. So we start at the top of the tree. Are we in the range 35 to 44? No, we're less than 35. So we go down to the left to the 18 to 24 node. We find our number, 21, is between 18 and 24. And there we have it. We've walked two steps down, and we've found the age we're looking for. Would somebody, Mattia, would you care to suggest a different age to look for? A different age. A perfect age. 48. OK, again, we start at 35 to 44. We move to the right, because 48 is greater than 44. Have a candy. And you still have a candy. OK, so we move to 55 to 64. We are not between 55 and 64. We're less than 55. So we walk down again, 45 to 54. We found it. We found our age, and we took three steps to do it. As opposed to walking as a single or doubly linked list, where we would have to start from the beginning and walk all the way down, we found us in fewer steps than it would have taken. So this is an improvement. Now, for similar ages in the maple tree, and I haven't really described what the maple tree is yet, but you can see an illustration of it here. And this is actually a cut down maple tree, because the actual maple tree that we're working on, this would be very boring, because it would be a single node. So for the purposes of slides, this is cut down. So we get actually three nodes instead of a single node. And so for our first example, 21, we look at our pivot, we say, well, we're less than 44. So we move to the left. And then we scan through, are we less than 17? No. Are we less than 24? Yes. So we've moved through two nodes, and we've found the answer we're looking for. It's 70 for 48. We look at our 44 number. We move to the right. And we see, oh, we're less than 54. OK. We found what we're looking for. So fairly similar, but those were best case trees. Here are some worst case trees. Now, for the 21, this is great. We found it right there at the top. We only had to look at one node, and we found it. That's perfect. Matty, you let me down. You said it's to 48. Again, we only had to move two nodes this time. OK. But if we'd chosen anything else, we might have had to walk quite a long way down the tree. This tree is balanced according to the properties of a red-black tree, but it's not perfectly balanced. It's good, but not perfect. And most trees have this kind of property that you approximately balance them, because getting them perfectly balanced wastes too much time. Similarly, the maple tree can become somewhat unbalanced. This is really too small an example to show you a deeply unbalanced tree. But this is the best I could do, but you would end up with three nodes in it, each with two entries. The final one has three entries, because you can't have a maple tree node with a single entry in it. So one other kind of tree we have in the kernel is the radix tree. And over the last three or four years, I've been working on the radix tree, giving it a better interface, making it easier to use, and I call that interface the X-ray. And I realize that I have committed quite the crime. I have made it easy to misuse the radix tree. So people might now start... The maple tree really is not a good data structure for rangers. Because of how it's indexed, how it works, it just blows up in memory. And by the way, this is also a cut-down version. The nodes are 1 eighth the size that they would be in real life. And the amount of memory you'd end up wasting is way worse than it is shown on this slide. It's just not the right data structure for this purpose. So I've done a little comparison here. The radix tree has some great properties. One is that it's RCU safe. And for those who aren't familiar, RCU safe means that you can walk the tree without taking any locks, which means we have great scalability. And it's why the radix tree is used with things like the page cache, where performance is absolutely essential and scalability, as well as performance is absolutely essential. So that was a property I wanted to keep for the maple tree. The support for rangers, originally the radix tree had no support for rangers. Without changing the data structure too much, I added limited support for rangers. But people then started to try and use it with generic support for rangers. And that was a bad idea. The maple tree does not support arbitrary rangers. It only supports rangers which don't overlap. So if you want to use it for something like file locks, where you can have a read file lock that goes from A to B, and another one that goes from C to D, and C can be less than B, that's fine, that's allowed. This is not the appropriate data structure for it. This is really for one value has one pointer associated with it. The radix tree has a very, very short height. Not in that particular example, the maple tree did better. But for the kinds of things that the radix tree should be used for, it is the theoretically best possible tree to use. So the maple tree sacrifices some of the best possible case performance in order to make the worst possible case performance significantly better. The biggest problem I have, well, one of the two big problems I have with the RB tree is it's a very hard API to use. You have to write a significant amount of code in order to search it and to rebalance it. That makes it very flexible, but it also makes it very hard to use. Whereas the radix tree and the maple tree have an easy-to-use interface where you don't have to write any code, you just have to call the functions. One of the things which makes the RB tree similar to the list head, sorry, the doubly linked list we talked about earlier, is that you actually embed a node into your own data structure. So you don't have to allocate memory. An allocating memory inside the kernel is harder than user space. It may fail. You have to be prepared for the possibility that allocating memory is going to fail. And so the radix tree and the maple tree both use nodes allocated externally to the data structure that you're storing rather than embedding it. And so the RB tree embeds that node and it's only 24 bytes. It's a pointer to the parent, a left pointer and a right pointer. The radix tree has 64 pointers at each level of the tree, so it's got a huge node size. Whereas the maple tree has somewhere around eight pointers and so it has 128 bytes. Or as I like to say, two cache lines. Now that affects the height of the tree. So one example is that for a million, if you put a million entries into the tree, numbered from 0 to 999, 999, you, the radix tree gives you a height four tree. So you only have to dereference four levels of the tree to get your data. The red-black tree has a height 18 because you're only halving the number of nodes at each level of the tree. And the maple tree is of height seven. So it's intermediate in depth between the radix tree and the red-black tree. So these are the nodes which make up the maple tree, at least right now in our current code base. So we arrange 64 nodes, which are used both as non-leaf nodes and as leaf nodes. And so the value that you're looking up, the index that you're looking up in the tree, you'll compare it to these pivot values. And depending whether you, the first one you find that you're less than, you know to take the corresponding, the pointer out of the corresponding slot. And we mark the range nodes, whether they are leaf or non-leaf. So if they're non-leaf, you go down to the next level of the tree and you repeat the process. But if they're leaf, the pointer that you found is the answer you're looking for when you return that pointer. Alex64 nodes, I'm going to talk about actually in the next slide. But Alex64 nodes, the important point about them is they're only non-leaf nodes. They provide summaries of the nodes below them. So really doing with this, the radix tree works perfectly well for many of its users, just replacing the radix tree is not going to win us very many friends. So we decided to tackle a problem which really exists and will actually help customers with their workloads and us with our own laptops. So whenever you call M-Map, you create something called a virtual memory area, which we all call VMAs because life's too short to type out virtual memory area every time. So every time you take a page fault, so the CPU accesses a page of data that it hadn't accessed before, we walk our way down this tree and it's currently stored in a red-black tree. We walk our way down this red-black tree until we find the virtual memory area which corresponds to the address of the data you're looking for. We find the VMA and then we handle the fault and how we handle the fault doesn't matter to this talk because that's not the important bit. So we have various other things we need to do. It's not just about finding, it's also about allocation. So whenever you call M-Map and you don't specify the map fixed flag, you have to find some empty address space. And that's what the gaps on the previous slide were all about, that at each layer of the tree we record the size of the largest gap to be found underneath each slot in that particular node. And so that lets us skip large parts. We don't have to search large parts of the tree where we know that there aren't sufficiently large gaps. One of the biggest problems we're having right now is that if you call M-Map with map fixed, that actually overwrites existing VMAs. So you may end up having to delete. So the easy case is when you split an existing VMA. And honestly, that's the case that most people actually use. But it is not impossible to have map fixed actually delete tens or hundreds of VMAs. And that is a possibility that the literature on the trees does not teach us how we should handle. So we've had to devise our own algorithm for that. An easier problem to solve is that if you look at this particular file in slash proc, it will show you all the existing VMAs that a given process has. And so we have to be able to iterate over them. Unfortunately, B trees are really good at doing that kind of thing. And another tricky part of this is that if you have a stack in your process and you can mark it especially so that when you access off the bottom of the stack, the kernel will, or top of the stack on some architectures. But if you access off the end of the stack, the kernel will automatically allocate you another page of memory and let your process continue growing its stack. I was hoping to have actual numbers for you today. Unfortunately, we are not far enough through this project to give you those numbers. So here's our analysis. Assuming we have a perfectly balanced RB tree, these numbers are for Firefox, which on the day that we measured it, had 1,415 VMAs. On average, you find the VMA that you're looking for in 9.5 dereferences. So there's a 1 in 1,400 chance that you find in the first VMA. There is a 2 in 1,400 chance you find it after looking in the second level of the tree, et cetera, et cetera, et cetera. So the maple tree actually needs to store null entries. It only store, it doesn't, because of the way the maple tree works, it handles storing nulls between ranges as if it was just storing another range. So I did the calculations. And if you have the worst case tree, it's going to take you about 8 dereferences. So that's a win. The maple tree worst case beats the RB tree's best case, which is fantastic in terms of dereferences. And then, so the average tree is 7 dereferences, and the most compact tree is also 7. Now, you can see with some of these numbers like we might be able to end up deleting some, if it only had half as many. And Firefox maps far more VMAs than anyone else. Firefox is probably the process on my system that maps the most different things. We might be able to get down to 6 or 5. Indeed, if you look at cat, it's only got like 30. So that's a very short tree. Yes, question. The question was, how many cache lines are you going to touch for each of these lookups? And so it's actually a very good question. And Machia has made a very important point, which is that your performance is going to be dominated by how many cache lines you touch. That's just how modern CPUs are limited. So at each level of the tree, at each other, the maple tree, it's two cache lines. At each level of the RB tree, it's a single cache line. However, the two cache lines that you touch are consecutive. And CPUs are really, really, really aggressive about prefetching cache lines. So if you just touch one byte, it's probably going to load in at least two cache lines. I've seen traces from CPUs where they'll load five, eight, 10 cache lines just because you touch one byte, because it's almost free for them to do it. This is something, I used to work for Intel. If you have a sufficiently development machine, you can tweak all kinds of things about how aggressive the prefetchers are. I do not recommend touching that on your own machine because chances are somebody smarter than you has already tuned it for the common case. If you really, really, really know what you're doing, sure, go ahead and play with it. But on your laptop, just leave it alone. They're certainly to L3. I don't know if those prefetchers come as far as L2 or L1. I think if I did know that, I wouldn't be able to tell you because Intel would consider that to be proprietary. All I know is that they're getting closer to the CPU. They're not coming from memory anymore, but by the time you're touching the second cache line. Yeah. OK, so that's performance. Now let's look at memory consumption. So I already said that inside a VMA, you're going to have these 24 bytes. You've got a pointer up. A pointer to the parent, a pointer to your left child, a pointer to the right child. But for the VMA tree, we also store next and previous. So because fairly often in the RAM management subsystem, once you have a VMA, you might want to know what the start of the next one is or what the end of the previous one is. We actually have pointers for, so instead of walking, back up the tree to the nearest common ancestor and then back down the tree, you just look, you just follow the pointer. So that's another two pointers. And then because the RB tree is also optimized for doing allocations, they also store the gap. So that ends up being six pointers or things that the size of pointers that we allocate. So that's 48 bytes per node. And that works out to 66 kilobytes for this 1,415 VMAs. And the calculations then work out. So our worst case, the worst case of the maple tree does not beat the best case of the RB tree. But the average case does. So it's about 20% worse for the worst case scenario. But the best case scenario is actually 50% better. And the average case is 20% better. So I'm going to call that a win too, even though worst case is like, yeah. So once we have our implementation fully debugged and we're able to provide data, I feel like this will be pretty exciting. But this is not the only problem I want to solve with. I want to solve the RB tree problem. But I also want to solve the radix tree problems that I discussed earlier. We have a function in the kernel called IDR allocCyclic. And the IDR data structure, and I don't know what the letters IDR stands for, it might well be ID and then something. But the guy who wrote it appears to have disappeared. So we don't know what the letters stand for. So the idea of the IDR allocCyclic is here is a pointer. Tell me what ID you stored it at. Here's another pointer. Tell me which ID you stored it. And the IDs should increment up to a certain point at which point it wraps around to zero. So the first point you hand it gives to tell you it's index zero. The second one gets to index one until you get to the end. You wrap that around to zero. And it doesn't reuse. So if they're still in use, it won't reuse an ID that's still in use. So if you're storing process IDs in an IDR, which Linux does, the RAIDX3 data structure is pretty inefficient at that. Because you've got gaps down around between. If you look at the PIDs available in your system, like you type PS, and you look at the range of PIDs that are in use, you'll be able to see that there's some gaps. And every gap you see represents at least one pointer that is not being used. And you have to keep track of which ones aren't in use anymore. So the idea with the maple tree is that we are going to add a couple of different kinds of nodes. So what we've talked about so far has been optimizing for storing a range of IDs which map to a single value. Now we also want to be able to optimize for having single values which map to a single pointer. And not only that, we want to optimize for single values which map to a single pointer but are distributed in a sparse fashion. And so we added a couple of new node types. Or we've written down how to deal with these couple of node types. We have a dense and a sparse 64. So a dense node, you'll notice we don't store any pivots in it. At the point where you're indexing and dense and sparse 64 nodes, they're only for use at least. You can't use them higher up in the tree. So the point that you're indexing into a dense node, you'll have an index which is between 0 and 14 from the pivot of the parent that you're accessing from. And you just load the pointer you're looking for out of the slot. That's it. You don't scan through the slot. There's nothing to scan through for. All you do is load the value that you're looking for. Sparse nodes are a bit more interesting in that you look at each of the pivots in turn. If any of those pivots is the index that you're looking for, you load the corresponding pointer. If you get to the end of the pivot array and it was none of them, the answer is no. So this is literally a list of six values and six pointers that correspond to it plus the seventh pointer, which is actually used for the pointer which corresponds to the pivot of the parent. So we find a way to use that last slot. I'm going to skip over the API. I do want to note the API we have is very, very reminiscent of the XRA API. And the intent is to unify the two. I convert the maple tree over to the XRA API. But we want to make this suitable for merging to handle the VMA problem. And then we'll come back and unify the two later. So I'm just going to skip over these. So we have some, this is our short-term plan. Obviously, we need to finish what we've started with the VMA tree and do a whole bunch more benchmarking and testing to make sure that we're not introducing any bugs. When you think about 32-bit CPUs, the whole way through this talk, I've been talking about pointers being 8 bytes. Well, if points are 4 bytes, that's going to change the calculations on a number of different things that we're looking at. And honestly, 64-bit CPUs are not quite ubiquitous yet, but it's certainly the one that we care about the most that we've been thinking about at the moment. One of the things that's missing compared to the XRA API is search marks. What search marks that you do is tag each of the pointers with a particular mark, 0, 1, or 2. And then you can search and say, you can say things like, show me all of the pointers which are tagged with tag number 1. And some of the users make a lot of use of that and others make no use of that. So we need a new node format for supporting search marks. That's fine. We started out doing an implementation of the dense nodes and then realized that nobody was going to give us any prizes for replacing the VADX tree with something that worked not quite as well as the VADX tree in the best case. We've put that on hold for the moment, but we'll need to go back to it and dust it off. Like I said, we want to implement Spark 64 nodes and get rid of the Maple Tree API and favor the XRA API. We have some fun longer term plans, too. One is to compress the pivots. So if you look at a real life tree, you'll see that it very quickly degenerates into a point where the pivots all start with very close to the same pattern of bits. And we were saying to ourselves, well, what if we figured out a way to just look at the bottom 32 bits? And we actually have calculations going all the way down to supporting only nine bit nodes. But we're starting to run out of bits to encode which kind of node we're looking at. So this is probably as far as we're going to go because the benefits really do start to drop off after you go down as far as 32. So you can see, for a range 32 node, we get 10 slots rather than the eight slots that we were getting with a range 64. And so that's pretty valuable. I think that's worth having. Because the more fan out you get at the high levels, the shorter your tree is. So that's worth having. Another thing that we want to use the XRA4 is the file descriptor table. Right now, the file descriptor table is a very own custom implementation of a resizing array. So every time you exceed a certain value, the file descriptor code goes off and allocates twice as much memory as it used to have, copies all the pointers over, and says, here you go. Well, that seemed like a really bad idea to me. And I said, whoa, we should convert that over to using an IDR. So it's perfect. And Google came to me and said, so we have an application that has half a million file descriptors and accesses them randomly because they're network sockets. And we don't know when the next network packet's coming in. And so if we do that, we're going to incur seven cache misses per file descriptor, as opposed to the one cache miss that we currently incur because we've used a VM alloc array. And I said, fine. OK, I will stop touching this code for now. But I haven't forgotten, because I don't think anybody should be implementing their own custom resizing array of pointers in the kernel. So I'm suggesting that we start out by allocating an entire page for this kind of situation. So once you grow to maybe past 256 pointers in your tree, we just allocate an entire page to you. And you get 512 pointers per page. And that ends up taking three levels out of the tree for dense regions of the tree. Because you don't have to just use dense nodes as being the only thing in your tree. You can have a mixture of different types of node in different parts of your tree, depending how your indices are distributed within the space of indexes. So I feel like this is going to be a useful thing. We may never end up being able to use it, but it's going to depend on benchmarking. So Google's not going to say no to this if this improves their performance. One of the other things we want to do is replace hash tables. It's really, really hard to size a hash table effectively. If the top level of your hash table is too small, then you get long hash chains. And walking hash chains is expensive. It's basically a singly linked list. No, it's basically a doubly linked list, pardon me. But if you oversize the top of your hash table, then you're just wasting memory. And so we see these things all over the place, like with the de-entries. We assume that on a larger system, you're going to be accessing more directory entries, more filenames, which is not necessarily true. It really depends on how your system is used. You can override it, but honestly, having sysadmins tweak the de-entry hash sizes is just a ridiculous thing to expect anyone to do. So we should try and size these things more automatically. Complicating this, someone's actually introduced an automatically resizing hash table. So we're going to have to benchmark against a real data structure soon. And that'll be cool. May the best data structure win. I feel like we have a good shot, but it's by no means a guarantee. And then we have a whole bunch of open questions. And I don't have too much time to go into them deeply, but which one? So I want an API to let us remove or insert a whole batch of entries all at once. And I don't know what that API should look like. And this will be a good topic for conversation later, I would say. We are talking about using a larger node size. Right now, we size our nodes to be 128 bytes, two cache lines. We're talking about maybe go to 192 bytes, which will bring the height of the tree down, but it will increase the number of cache lines that you potentially touch each time you go down the tree. It's an interesting trade-off to make, and we're not quite sure what to do yet. Again, we need to get the code working as it currently is and then do some benchmarking. I would deeply love to get rid of the RbTree data structure altogether, but the x-ray does not handle overlapping ranges. And quite a lot of the RbTree users in the kernel today do use overlapping ranges. And I think that's actually going to have to be a different API, maybe some shared underlying code, maybe not. That is definitely a project for the more distant future. If we find users which often have gaps between the ranges, we may end up wanting to introduce a new node type to support those. It's on the table, but right now we're optimized for ranges which are adjacent to each other. And where you don't have a rate where there are two ranges which are not adjacent to each other, we insert a range which maps to null rather than having a start length and pointer triplet. The RbTree supports only three marks, search marks. I've been asked to support five or seven. And somebody said, well, if you're supporting seven, then I'm going to ask for 10. So I really want to try and get a sense from people who might use it how many search marks they would like to support, because we can actually trade off more readily than the Radex Tree how many search marks we could support. And that is the end of my talk. Particularly if you have an account on kernels.org already, I would really appreciate it if you would sign my PGP key, because I lost access after the kernels.org break-in. And I've recently been asked to reactivate that account. But in order to do that, I need signatures from kernel developers on this key. So please do so. I want a minute over. OK, thank you all for coming. I'll take questions for a couple of minutes, if anyone wants. Yeah, Matthew. I have looked into Judy arrays. And I think you and I can have a very fruitful discussion about why I think this is probably better for our purposes than the Judy array. Sounds like a great idea. Yes, sir. How does this compare to cache oblivious B-trees? OK, so we've done some looking into those. The B-trees and the literature generally don't handle ranges. They are optimized for essentially single-value maps to single-pointer. Or even just is a single value present in the tree? You find this quite often in computer science textbooks. It's just that they're literally just storing the value, which amuses me. Because it's like, well, have I seen this number before? Yes, most of the time you want to associate something with that number. The better textbooks say adding this information is an exercise for the reader, which is lovely. But yeah, how do they compare? Yeah, so they're really optimizing for different things. I mean, we are very, very conscious of caches as they are. And yeah, one of the things I didn't talk about earlier was that B-trees were originally developed for use on disk back in the 1970s. And the ratio between CPU speed and disk speed then is about the same ratio as CPU speed to memory speed now. So it really is appropriate to start using B-trees data structures for main memory data structures. The big difference with them is that you can read memory one cache line at a time. You don't have to read four kilobytes at a time. So it does encourage us to use smaller nodes rather than larger nodes. Obviously, larger nodes do give the advantage of having a less deep tree. But then you have to spend more cache lines at each layer of the tree looking for it. So there's definitely trade-offs to be made there. And I won't say we've definitely made the right ones yet. Thanks. Great question. Yeah. That is actually what I meant. Yeah. Yeah, I just wasn't speaking very precisely when I said compressed. I mean, there's all kinds of interesting ways you can choose to compress the indices. But I think the easiest one is simply say, yeah, instead of these being absolute values, these are offsets from the parent's pivot. Yeah. Anyone else? Cool. Thanks for coming.