 Okay, hi When to bring it to speed the thing is once you start De-locking you go to the kernel with the you order, which is actually the address of the lock and In kernel we have huge Hatch table which we address the lock to one of the hash buckets and you can Have hash collision and so it comes that two completely different Processes share the same hash bucket And that's where we have problems on RT when one process slows down the other Additionally on human nodes, it's You have the problem that the memory for the hash buckets is allocated always from one human node actually Um But no, it does some kind of file Malik thingy right now So we try to get rid of the hash buckets Get it later. We did some testing of four versions of how we could fix it and This was the box we use you could say it's big iron All four versions are in the gitch we if you want to look it up test it We had some we use benchmarks based on perfue text bench We extended it to use an n parameter to specify the human node you run it So this is version zero. This is state of the art mainline as is On the left side you have 64 hundred to a hundred twenty eight five twelve thousand one thousand twenty four That's the number few taxes On the left side you see T8. This is your firm eight frets So you have on the left side you have a frets Working on 64 few taxes all all over again and the four bars are like numeral node one Zero one two three and Eight frets per node and four nodes in parallel run doing a few tags wait with an invalid argument which means it does a hash bucket look up and comes back again and It does like six million times a second So that's what is now and you see the numeral nodes are mostly performing the same way and it gets slightly worse If you go like for sixty thirty six frets So we go down to like three million operation a second Okay This is what I call version 13 we do an ICU look up. So Each each lock gets a p-fret attached where we put it in a global ICU list this is the module thing you I found We put it in there and each time you do the wait That there's no locking we do us you for look ups. So you have only a lock to add and remove So it's just I see you read luck look up in the RB tree with Unlock with increment of the users and then just get it back Right. So we need an ICU at head and we need an attach from the user space to say this lock is now attached to the process and So we need a kind of hint for that This is version 10 this is where we had a Hesh pet task which got knocked down because we could have first collisions Within a fret within the task like two separate frets, but it's what we've not now then in five years or so now So that's where we it scales pretty nice for all nodes and yeah This is Unique IDs. This is where we do attach the Cisco returns your ID number The ID number is the position in the array. So we don't do our CU look ups We just grab in the array and take it out. We still have our CU for the size of the array and I punched it really huge to fit all of them like Right, so this performs best Right If you see this is slightly better than the per process approach, but the global thing had 64,000 hash buckets while we here have only two 256 So it's not surprised that it performs slightly worse Now this are the four produce in Comparison and the question is which one will it be or do anyone has another approach? User length needs to attach upfront and The operation for lock unlock and so on has a modifier for attached So we can distinguish them both whether we look in the global history or in the pair per process history We need we cannot attach unconditionally because You could run out of memory So I try to avoid any lips a glibc modification, but This works only with did this one because you can then do the preload thingy upfront if you want for RT or not, but ASLR and stuff like that make all kind of fun and This thing is complicated because glibc has almost no no space to store the identifier I found a variable you could cut in half so we have like one 32 bit and you have to 16 bit Then we are limited to 65,000 locks It could work but per process right then you get Java thingy and So everyone uses but Java There's some space So there's not an issue then We can store those bits for the ID for example together the other bits and just extract them Yes, we could spare to get some bits out there for example the mutex kind is just About Well, that's the ID that's what I said if you do go for 16 bit you can have 65,000 locks And that's there that's over you give more bits and we can use more locks Well, the thing is if you do things like pthread mutex in it and you can hide this attached thing in that in it Then we have no API changes at that point, but from the main page pthread mutex in it cannot fail But if you run out of memory in kernel you can fail so You can but then you have different behavior and you have no way to query it because you have no API Well, but the thing is if you have two threads sharing this the global hash bucket and you have query it not It doesn't have to return a failure you can just query the mutex Well sure, but this is a bi change so we get one and P I guess but So we cannot do it without that's and that's why Because Because two different process can share the same slot We are doing with the unique ideas So there's two ways one one Which is basically makes it work for both Private and shared few pieces But then the new scheme we have very just lose use a proof process storage space in the kernel that just will work for And private is what we are really concerned about That's crude the shared ones. I mean They shouldn't be there in the first place But they were there in the first place the private come later You but so right so everything here is just for the private few decks shared is auto picture no one cares Because Want to do share then you need storage space per Process in user space per per mutex So this means you even need to have storage space per threat So you have to store it in the TLS or whatever and then you have to connect it to your stupid few ticks My follow-up question is whether in the real-time Use cases do you actually not care about brothers process shared or is this specifically for this optimization? The shared has the problem that you have to go for the MMSM And you can have the problem that one thread with low priority is acquiring memory and the other is doing some completely Something irrelevant and that's where you get stopped and you don't want to do this So High priority threat usually does not do share so if you really want to share information between a High priority process and some low priority process doing logging or whatever the hell Then you or way better off doing some lock less ring buffer scheme or something like that Avoid the whole locking mess completely So it wasn't clear the background to my question is because if we can ignore process shared Then we can use dynamic memory with the angiolipsy and we can potentially build our own Weight cues and stuff like that which might help us solve the the conversion for example And so I mean if processor it doesn't matter we can do a lot of things and user space I think we we can ignore process shared for that particular performance thing. I Mean even Java is not process shared. It's process private Java it's one big pile of threads In in any variation of the pronunciation you can put it there, you're right so the the real question here is if we want to have that performance gain and Actually a mechanism to guarantee that we do not run into arbitrary priority inversions We we can Document that this there's no such guarantee for for shared and Tell people do not use it Think order We have a lot of such limitations in our tea obviously and then I mean the question is can we really get away with hiding it completely below p-thread mutics in it and things like that or is there other stuff required which we need to think about So there are ways to initialize mutics as a p-thread mutics is in user space With static assignment, so you don't need to actually do in its call So you can do it. Yeah, you can do it easily. Yeah Sorry That's also interesting for non-pi because that's what you're seeing Can you can if I'll back go back to the slide where we have the mainline behavior? That's what you're seeing the area is a fact of hashback content. Okay, just today the zero The degradation is just an effect of hashback at lock contention So you're having lock balancing around which is problematic in terms of performance Yeah Yeah, but in terms of general general purpose lock Performance are much more concerned about the lack of proper spinning and back off in tulipsy And the in the you know, we have a very very simplistic lock implementation And so I think at the point where we get to suffering from contention in the few texts in the kernels blocking mechanisms This is much further into you know into the future than what we can fix right now some some unnamed database company Could not or would not use the few tech system called they they weren't even considering tulipsy. They're not interested the few tech system call because of Cash bouncing on the global skill for their NUMA systems Having this per-task state avoids the entire global hash and solves NUMA issues for them as well So but I think we need to look at the bigger picture and the bigger picture is if you're really a high performance database, right? You don't want to over subscribe your course with threats, right? So you have ideally one threat per core running or whatever kind of resource you do hyper threats or whatever And then you want to keep them busy and you don't want them to block So when you actually block it's really your you're not any optimal case anyway At least that's that's my word view that I have That may be right But if if you think about parallel programs and things like that and you know where high-performance programs are moving Then you don't really want to block either you want to do sensible work done or you want to you know Not do anything at all which in that case you might actually sleep using few texas, but I don't think the we should really not consider the Kernel side blocking as any kind of way part of that the fast path, right or the performance critical path We have a lot of stuff in Geolipsy to fix to actually get there and you know the mutex the simplistic mutex implementation that we have a lot Complementation is part of that, but I think that should be goal Use of space spinning just does not work in the oversubscribed case Just like use of a spin But even in general case So the the atomic operation You need to do is some 20 odd cycles Kernel entry and actually there's a hundred cycles So there's a very small Number of cycles left in which your space spinning can amortize anything That's true But if you enter the kernel you also You'll most likely access more cache lines that you might need and then use the space And the the just the kernel entry and exit overhead I think it's just part of the problem and other parts are where you could actually get the information about the critical sections For example user space might know much better whether the critical sections are short or long-lived, right? so it might make better automatic tuning and then and Additivity decisions the kernel on the other hand has other information like you know Will it be scheduled a thing soon or something like that? So I think if you really want to look at this We need to find a good balance between the different information that we have available on both sides Because it also affects the interface that we put in between right if we make a simplistic interface in the middle We'll lose information from either side, you know Regardless of whether we solve it on user space site or on the kernel side So in general we should rejoice as user space gets their looking right to begin with In the high-performance case people don't care They'll fix their stuff any which way they will not use g-lips either wrap their own or they even hack the kernel And I have not yet seen anybody request Anything across this boundary like what you describe and I mean Sequences The more abstract thing that they request for example that we startable sequences is it's per Per pneumonode or per CPU data It's the ability to be preempted while doing stuff You know how would you build a high-performance lock implementation, right? I think with the optimistic spinning you the kernel makes sure that it doesn't keep spinning when it's going to be preempted and things like that Maybe there's there's something similar, right? so I think the underlying abstract problems are just are similar and so I think there is That there could be used for that and I think there's interest for that Although I agreed that a lot of people would just probably right now Will build their own and they will not use gelip see but I definitely like to get to the point that you know Gelip see purple locks are fast because they are used in in C++ applications and so on and so forth But I don't know have you seen why men's work. He was doing recently why men long He's easier in the room where he is he was Implementing a separate if you take separation just for Very very simple few texas not the whole ops space and whatever And so he does optimistic spinning in the kernel. So the advantage the kernel here has Of course, he doesn't know how long is the critical section to be but What what the current the kernel knows and we use a space can't know is okay? The owner is not longer on the CPU So the only is not longer spinning on the other core. So that's the perfect point to give up the CPU and go to sleep So that's say, so I think we we we were running to a whole bunch of questions do we need more Special-paced few tex operations instead of that whole insanity of Swiss army knife few tex operations which we have now or Not a not instead, but maybe a side of them To solve particular problems So I'm all in favor Few texas hurt my brain and special purpose one. We can't get rid of the the ones we have Though it's just not gonna happen. No, they're not going away, but wouldn't the The Convare thing be easier if you had special purpose build stuff just for con far This is true, but I think it would be It would be Too tightly coupled solution if the kernel would try to really solve exactly the converse Amatics and exactly, you know the need for blocking that we have in barriers and in semaphores and in read write locks And in whatever else people might want to do, right? so The con for example in touch on this later in the talk is for example could benefit of more of the weighting condition if the kernel would Be aware of more, you know more semantics about what kind of condition you're actually waiting on right now It's more likely a flag it is specialized, but it's not specialized in the way that we say, okay Now we're building a mutics or we're building a lock. It's specialized in saying that okay We make the make the abstract thing more powerful the We allow for the specialization by providing for encoding of the vow by allowing the user of the few text to Provide for the policy of what they encode in the few text word the the problem I think that we actually run into is when we break that By imposing our own policy that seems to be where a lot of the pain in the few texas comes from Are all those places where we impose our own policy, but we will talk about that so maybe we should give Sebastian his There's probably no resolution of that problem right now, but we are running short of time as well Thanks, Sebastian for putting the information together