 So some words about me. I live in Lund in Sweden. I am working in Sony since 2011. Main areas I am responsible for in a company are performance, power safe, thermal, for mobile devices and for mobile workloads. Apart of that I am doing RCU together with community and also I am working with MM. So let me start with a problem. So basically when we try to manage a map space, a global map space, we have three global data structures. And those three global data structures are protected by three spin locks, global spin locks. So let me go one by one to explain a little bit about them. So first of all, the first one is a free VMAAP error lock. So actually this lock protects an access to a global free VMAAP space when we allocate and when we deallocate. So we need it. Then we have a VMAAP error lock. That lock actually deals with bookkeeping data when we so it's actually mapping data or also we call it busy areas. And last one, last one is a purge VMAAP error lock. This one is supposed to be used when we access to lazily freed areas. What is also separate global data structure. So if we summarize, we can conclude that all these three spin locks protect three different data structure and are a bottleneck for system with many CPUs. So if you have a single CPU system, then I guess it's not a problem for you. So next slide is about, let me talk about this in more detail. So access the allocated bookkeeping data. On the right side you see a mapping errors. It's a data structure. Basically it consists of linked list and then it consists of arbitrary. So here in the middle we have a VMAAP error lock and here on the top you see users. So it's basically CPU who invokes our APIs like V3, VMAAP error lock and so on. And this is a bottleneck when it comes to bookkeeping data because each CPU has to wait each other in order to accomplish any request. So this is a problem, number one. So why we need this tree and why we need this bookkeeping data? First of all, we need it in order to find an address that belongs to a particular VMAAP area when we are going to free, for example. Also we need it because we would like to make a dump of map it errors over VMAAP info. So if user would like to understand what kind of map it errors we have in the system, so we basically do cut pro VMAAP info and then you get it. And at least last reason is when we need to generate a kernel dump, when we run into some panic or crush dump, we would like to include map it to map data into that dump. So it's a kcode.c. So this is a problem for first global data structure. Then we go to next. Same here on the right side you see a picture. So basically it's a high level picture and high level description because I don't want to go into deep detail because we will lose focus. So that's why I try to simplify it as much as I can. So on the bottom we have a global free VMAAP space. Then we have another spin log that protects this second global data. The name is free VMAAP area log. Here we have also users. We have four users, for example. And you can see that access is not serialized at all. Everything goes through one single log. And each CPUs or each user who deal with which deal with global VMAAP space have to wait each other in order to solve some requests. What we can conclude is that such approach is not scalable to number of CPUs at all. And last one. It's about this guy is about lazily free dairies on the right side. Similar picture. Here we have lazily free areas. It's the last one. It's the third one. Global big data. And here we have a patch VMAAP area log that protects access to this data. Here we also have our users. And we have similar situation. Exactly the same situation. So basically we can summarize that SimuStane use free or allocation is a problem and is not scalable to number of CPUs. So let me a little bit summarize. On the left side you see a log pass here. On the right side we see free pass. In case of allocation pass we need to access at least two global data structures. First one is a free VMAAP area root. So where we allocate from. Once we get a VA from that global heap we have to play that play that VA into bookkeeping data that is associated with the global market areas. So and that data is protected by VMAAP area log. So it's a bookkeeping data. On a V3 pass we need to access also bookkeeping data when we try to find an address within bookkeeping tree. Then try to find an address that is associated with the EA, unlink it. And then we can pass it into to push VMAAP area tree. And that tree is protected by another spin log. So here is number one, number two and number three. Three global data structure. And what we can conclude that these three global data structures cannot be accessed concurrently. And this is the main problem that I'm talking about. May I ask some question? Yes. So as you said the global log and data structure is not that scalable, but on which work road do you see the contention most on VMAAP or VMAAP alongside? At least I see contention when I run my performance test. Then I know that XFS people who is working with XFS file systems, they see contention on under their workloads. And they would like to have a maximum throughput of VMAAP code. So at least we know XFS, as of now. Of sure, thanks. And on this plot I would like to show you VMAAP scalability. So basically what we have on X axis, we have number of threads. We have starting from one until 32. And then on Y axis, we have throughput rate. It's a VMAAP v3 pair. And the time is in microseconds. We measure the time. I run VMAAP test suite on my AMD32 physical CPU systems. That's why I have 32 threads. So if we have a look at single thread performance, so we are approximately on my system, we approximately need two microseconds. It's pretty small, of course. If we have a look at 32 threads application, we need approximately 50 microseconds. And please note that I'm doing it on a super powerful computer. And what we can conclude that it's the difference is approximately 25 times. In theory, we can improve it. I mean throughput around 25 times. In theory. On practice, I don't think so. So you can see it goes exponentially. And throughput, as the time gets increased, once you get a thread one by one. And here I would like to show you, on this plot, I would like to show you a log start, because I wanted to understand a log contention within VMAAP player or VMAAP player. On the right side, blue line, it's VMAAP error log. Then we have a free VMAAP error log. And then we have patch VMAAP error log. On X axis, we have number of jobs running the test. And on Y, we have a number of contentions. So, as you can see, the biggest contention is on VMAAP error log. And it's due to a lot of fragments. Because that tree is not, we don't merge any data in that tree. And this tree is responsible for, or this log is responsible for protecting bookkeeping tree or data. Other two logs have much lower contention. Because of this guy and this guy, we use merging techniques in order to minimize fragmentation. So, what we can conclude is that VMAAP error log has five times higher contention than other twos. So, the same test, the same number of CPUs and the same hardware, like on previous. Now, let me talk about proposal. So, I would like to propose a new pure CPU cache to mitigate contention. So, basically, what I would like to propose, I would like to add pure CPU cache. So, first step is introducing pure CPU cache. Then we need to preface a cache from a global VMAAP heap when it becomes empty or when it is empty initially. The cache size can be configured via kernel boot parameter or it can be changed in runtime. But I think it's a bit tough and it's not so easy. But it's not decided. And then, afterward, on a high level, a fourth step, the cache is the cached block or prefetched block, chunk block is split based on a location request. So, it's basically clipping. So, we clip the cache. So, this picture is more in detail but I'm going to propose. Yeah, let me start from the bottom. Here we have a VMAAP space or just starting from zero until unsigned loanmarks. So, CPU one, CPU zero and CPU two, it's just users. So, we can say it's a user one, user zero, user two. So, as a first step for CPU number one, when we do, for example, VMAAP, CPU number one, check its own cache. This, if it's empty, we do a simple prefetch. So, we actually access to global VMAAP heap to take to take a chunk with fixed size. After that, when we do a prefetch, we can allocate directly from we can allocate directly from CPU. In this scenario, you can see 2020-2035. It's just start addresses. Once we allocated from the cache, we need to place AVA into bookkeeping data. The question is how to convert the address to correct bookkeeping data because it's also per CPU. After this approach, it will be per CPU. So, we use address to CPU VMAAP zone and we use simple formula here on the bottom to make a correct conversion, to understand where 2022-2035 as the addresses belong to what kind of zone, number of zone. You see it belongs to number number one. Apparently, we just prefetch it and then we identify that it's number one. Once we identify that zone, it's number one. We go to per CPU number one, bookkeeping data, take a log, and then do a simple per CPU insert. Same story with CPU number zero or user number zero. We need to check the local cache per CPU cache, this CPU cache. If it's empty, we do a simple prefetch. On this picture, you see that it's prefetched from CPU number two or zone number two. Then, as a next step, we need to do a simple conversion according to this formula. Take a log of CPU number two and insert into bookkeeping data of CPU number two. Please note, this is per CPU. Let's consider another last one. It's a CPU number two or user number two. Same way, if cache is not empty, allocate 60, convert to CPU per CPU, bookkeeping data, it's number zero, insert it. This basically is that thing. Please note that fetch size can be, as I explained before, we can configure fetch size either by passing as a kernel parameter or by another way. We haven't decided yet. This size is fixed. Based on that, we know that it's fixed size. We know this fixed size. We can identify easily the zone. I have a question. Yes. So, Vlad, I was wondering, this whole Vmap, Vmalloc thing, it looks very similar to user space map, virtual address space, VMware structs. Can we not use the, so I believe that's lockless now or going in that direction. It seems like it's very similar to the problem. It's almost the same. Can we not use Maple Tree and RCU and do this sort of concurrency instead of what you're proposing? Is there a reason that won't work? So, basically, when it comes to Maple Tree and all those guys, the problem is that we, the problem is not in data structure. I'm not sure how Maple Tree is when it comes to civilization. And I'm not sure how it performs when, for example, we need many parallel or simultaneous insert and removing. And it's one point. Second point, I guess it's quite heavy and it's still under development. And also I checked different kind of B3s and so on. And the problem is not in another data structure like Maple Tree or B3, because RB3 performs quite well. The problem is civilization. And this is the problem, the biggest problem. Did I answer your question? Can we go back a couple slides to the locking, please? So, in that graph, is it fair to say that the blue line, the area lock, is being used for both operations of the free and the purge? No, blue line. It's basically, yes, right. You're right. It's V3, sorry. So, we're talking about this, right? What? Blue one. Yeah, yeah, the blue. Yeah, it's a pair of Vmalloc and V3. So, the free does purge. Yes. Right. So, yeah, so if you could mark them as being purged but not free in the tree, would you need another tree? Like right now, you have three trees. If you go back another slide, I think, another one, sorry. So, here you have, oh, sorry, yeah. So, here you have the free. You have the ones that are allocated and the ones that you want to free. Yeah. Right? And so, each one of these trees has its own lock. And each operation needs to use the area root one. So, if you could just not modify that in one of these paths, then you could potentially reduce your contention. In theory, it can be. But the problem is that, for example, when we are talking about bookkeeping tree, we need to anyway protect it in a way somehow because we need to access it. So, basically, we need to do a traversal. So, either we can use kind of read lock, of course, but then we can modify. But still, we end up in contention anyway. And the problem is that we are not serialized because, and this is the main problem. Potentially, yes. And it could improve, yes, like you say. But it will not improve perfectly because we still have a lot of room for improvement. And from my point of view, ideal improvement is when we do a kind of civilization as much as we can. Right. So, you could greatly reduce your blue line by only marking them. Let me see. You have a free list, you have the to be free list, and you have a used list. So, if you were to mark something not free, but also not used, then you could get rid of the yellow line and the free and the regular the used would be closer to what you want. Yeah, but then you end up in contention with a contention on another line, like, for example, red. Yeah, but you cut it in half. Yeah, yeah, it might, like you say, yes, we can improve. Yes. But I'm not sure. I can't say right away that it's what you propose is a way to is a way is a way like straightforward way. Yeah, it might be that I miss I miss something. Yeah, so we have kind of the same sort of problem in the VMAs, where we have we have these VMAs and we need to track a free and and a not free space. And then we're going and we're going to RCU freedom, which would be kind of like your purge here. And so there's a lot of parallels to what you're doing to what we're trying to solve in the VMA space. And then when you get into the zoning and the per CPU, it's what we've been discussing this week on what we want where we want to go in the same area. So it's very interesting to see like the parallels here, and if we could learn from each other and not go completely different paths. It's interesting that you find the RB tree performance with a branching factor of well essentially three. Yeah, actually, initially, the idea of splitting these these three trees was to kind of mitigate the contention because before we used to have only one. And then we actually allocated from only one tree only one that that was consist of gaps and allocated areas. And then we need to we had to access to that tree when we allocate and when we deallocate. So, but let me let me go next. I will show you. I will explain probably more. Because since we're talking about free pass, I will explain more about free pass. So here we go with a here we go with a free pass. So on the left side, we have this CPU context. On the right side, we have drained the map work context. So once we once somebody invokes V3, we have an address. We need to convert that address into a separate or special zone using using the formula I showed you before. So you can see that it belongs to number zero. What we do with we log that we log that zone this particular zone only. Because if somebody wants to free in this zone, it does it without any problems. Or here it does without any problems and you see we can serialize. So returning back to CPU zero zone, we take a look. We look bookkeeping data of zone number zero. We do a simple reversal find and do a link. And then we store into separate data structure, lazily data structure, same CPU, lock it, insert or merge. Here you see that we can merge. Here we can insert. And that's it. From V3 side, that's it. On the right side, we have a drained the map work context. And this context actually does actual draining or returning back lazily data, lazily free data into global Vmap space. Or it has not been decided yet. But before doing that, we need to flush TLD. So we just calculate the range, mean max, do the flush. And then next step, final step, either to return it to global Vmap space. I have a prototype. I'm doing now, I return it now to global Vmap space. But another way or approach would be to return it back to its own free zone. So it's a pure CPU zone. Basically return back to cache instead of global space. So yeah, this, yeah, it's, it's pretty much that's it about V3. So, but main idea is that it's, it's quite good serialized. You see, it can go this zone, we need to log this one, one user, another user can log this zone, another, this one. So so next steps. Can I ask you to, to conclude because we are at the top of the session. So just conclude what do you, what do you have? Yeah, let's conclude. So let's conclude what I have. You mean conclude this slide or in general? Okay, it seems that you can get more 10 minutes because Mike is just cutting his slot shorter. So, so let's say that you have 10 minutes. Okay. Then let's go to next steps. I'm finished. I'm finished more or less. So next steps. I will do it briefly. First of all, I would like to send out the proposal. I didn't, I haven't done it yet. I will, I will send out the proposal as a patch set series to community. Then I will provide some results regarding performance difference because the aim, the aim is to get as XFS people says they, they, they want to have a maximum throughput. Okay, I will show the difference. And then we can adjust the series based on community feedback or, or I don't know, go with another solution or something like that. And before questions or ideas, of course, thank you for your attention. Thank you that I was given a time to give a talk and any ideas, comments or, or concerns. So you're welcome. Vlad, a question. Essentially, you're, you are pretty much statically fragmenting the entire viadris, vi maloc address space, right? Like it, if you have 32 CPUs, so you have 32 distinct ranges in, in the, uh, yeah, yeah, I, yes, I need to mention about this. Yes. Uh, actually I have, uh, yes, number of ranges correspond to number of CPUs, but it can be, you can, it makes sense to, uh, to make it identical. But you can do that. You can, you can make it half less or whatever. But yes, it makes sense to do them identical. Yes. But if say CPU zero, for some reason, I don't know, lots, lots of modules, and then afterwards it wants to V maloc something that runs on the CPU and zero wants to V maloc again, it will run out of address space, despite having a lot of free address space in other zones, right? We actually allocate from global heap and we don't, we don't focus on, uh, allocated on particular zone because, um, it makes, it makes even more complicated to find particular zone. I mean, for example, if we are CPU number five and we would like to allocate in CPU number five zone, it's a bit complicated. I mean, from a search point of view. So that's why we don't allocate according to my tests and according to my test results, it doesn't make sense at all to do that. We just allocate sequentially, I mean, prefetched sequentially, uh, just a chunk, a special chunk with a fixed size, for example, four megabyte and that's it. And second CPU just allocate and second CPU allocate also sequentially, uh, same size. Okay. And the idea is, and the idea is to do it sequentially because we don't want to allocate in another kind of zones like modulus where modulus actually resides or whatever. We don't want to go far, far behind address space. Uh, have a question. So I wonder where, uh, do we see this, uh, contention in real world like applications, uh, because we, we are not aware of such issues. So if you can share whether we have seen that in some application, what type of application will be helpful? I can share that. I saw, uh, uh, first of all, it's clear that there is a contention. It depends on how heavily use Vmalloc and how heavily test case is. For example, on Android devices, we see, uh, kind of good big traffic when it comes to Vmalloc because of video stuff or calling and, and so on audio. Um, uh, but, uh, first time I was, I was talking, I saw it in LKML community. I saw the patch from XFS people where they actually showed a perf trace and they actually, uh, complained on the Vmalloc contention. But I can share, I can share when I send out, when I send out a proposal, I will share all such details. Thank you. Welcome. Hi. Uh, just have a quick suggestion. It's several times you mentioned some, uh, configuration tunables or kernel parameters. It, I think it would be best if, uh, there was some automatic default and it wouldn't require people to decide anything. Maybe if there's a tunable for somebody who needs to tune it according to their workload, it can be uh, a kernel parameter, as you said, but I think it shouldn't be required, uh, for some sensible default just, just to be user friendly. I see your point and I agree with your point. Um, uh, yeah, we need to think about this. For example, if you introduce a config option that will ask for a number, then Linus would probably yell at it. Yeah, for at least for 64-bit system, we have a quite big, uh, quite big VMAP space. Uh, when it comes to 32-bit system, well, yeah, we are really limited there. As far as I know, we have only one gigabyte at your space, but yeah, I agree with you. Thank you. It seems there are no other questions in the room. So unless somebody online has any questions, then I guess we just thank you for the presentation and I'm very much with Lyme that, uh, this seems like a very similar problem that VMAP space needs to address as well. So we'll probably go to work on that and not end up in two different implementations. Yeah, so thank you. Thank you.