 Hello everyone. My name is Sebastian Zivio and I try to talk today about the private Feudex thing we did this year, last year about the patches we sent and the things we did in the past and what we tried to optimize. We started. The Feudex code started initially with Rusty in the early to five series. Back then we did not have any kind of mechanism to suspend a process and work it up later. So what Rusty did, he introduced Feudex up and Feudex down and this was kind of modeled after what the semaphore did in userland and what was designed in Unix. So you did down and up for what kind of locking mechanism and a few weeks later he renamed it to wait and wake, which was mostly what he did. And the basic concept was to be really fast. So you tried to acquire the lock in userland and if you succeeded you never went to the kernel in the first place but if it failed because the lock was contended you need to go to the kernel to sleep and not busy for some time. So then you went to the kernel. The kernel cured your state and after the lock holder released the lock the kernel woke up the process that was stuck in down. This is the userland part that Rusty released along with the Feudex code. So as you see you had that underscore down method with the counter which is the only thing that was passed to the kernel during the operation. And if it succeeded you returned with a zero and everything was done. And if you failed the Feudex operation was the thing where you called in the kernel. And that was it. It's like very thin layer which did all the work. And after that the code involved like a few months later Rusty come along with async fd which was similar to what he did with up and down but was based on signal handling. So you could do all those signal things which were not that perfect. They were racy all over the place and later in January it was completely removed from the kernel. So if you go back to the history of the sex kernel you see it being removed. Just for the reason it's being too racy it couldn't be used. And that's pretty much the history of Feudex code all over the place. Like we introduced Feudex req which was agility requirement. They had these conditional variables where they tried to give one lock and had an inner lock which they need to lock one and then take the other lock to check the state and see which one is the new owner after the wake up. And the Feudex req thing worked well but if you had many, many, many waiters then it was possible that by the time you looked at it another waiter came in and changed the state and like four CPU nodes back then like 2004, 2003 it was possible the system locked up because it was racy. So later they came up with the compare req and the first opcode is news at all just because it was racy. And later we got Feudex wake up which was another optimization to Jalip C code. And later we got PI Feudex with our their priority inheritance version where you have a process with a higher priority and if that process is going for a lock then it boosts all other process out of the way so it can acquire the lock really, really fast. And after we had that there was another guy coming because he wanted to req thing for PI as well. And he didn't think it through because June 2007 it was removed again because it was buggy, racy and didn't work perfectly on the scale at large. Around 2007 we got the private Feudex. Private means it's only can be only used within a process. So if you have five threads all five threads can use the same lock but you cannot share it within two processes which is fine if that's the use case and from locking within the kernel it scales way better because you don't have to take the MMSM for instance which locks the global memory within a process. And yeah that's pretty much it. What I left out are like thousand of patches in between which were fixing things which were obvious by the time the patch was merged in the kernel but no one has seen it in the first place. So we had bugs all over the place. And if you look back we have the two operations where you have sleep and wake and it was so simple but in the lock term it's not. So that's basically the basic way how Feudex work in user land. Since we try to do the atomic operation usually like zero one transition you need some assembly code to make it work. So we have had back then like Rusty release Feudex layer and implementation for x86 and power pc. And each one architecture that wants to use Feudexes needs to implement this assembly code. And the concept remains the same if you do the wake the req or whatever. If you manage to do it in user land because nobody else is using the lock you remain in user land but if it's contended you need to visit the kernel. And that's the first thing the kernel learns about the lock. So you don't do any kind of initialization of the lock beforehand. And that's kind of problematic within the kernel because you never know if the lock the user is pointing you to is valid if he is the rightful owner or not. Technically he can take memory right whatever he wants and say I want that lock that process x is owning and that process might not be involved in this locking scenario at all. To get it managed Rusty got a global hash table which was shared along all lock users. And this is what it's called a Feudex hash bucket. This was what Rusty invented and each process that is going to wait has a Feudex queue which he allocates usually on this deck. For PI it's slightly different but mostly it's the same thing. And the Feudex hash bucket has a lock which is held while a process is queued on the hash bucket during wait states and uncued again during the wakeup process. And since it's a hash operation you can have multiple processes which share the same hash bucket due to this the hash function but they are not related. So the problem we get here is on the one hand that two different processes can lock each other out during lock requirement in kernel and they are not related so it's not well it doesn't scale well on those kind of use cases. And if you go for NUMA for the big boxes we have today you have the problem that the memory can be for the global hash list can be allocated on NUMA node zero while NUMA node three is trying to acquire that memory to perform the lock operation. So since we have the global hash lock in the hash bucket we see on RT what we call a ping-pong boost. What you see here is a scheduler which where I remove all those things are not that important to get it fit on the slide here. So on the left you see mat high low is the name of the process, schedule wakeup switch is the name of the event and what you see here on the second line you have mat slash 29 which means it's the process name mat and the priority is 29 and it's the curl way of writing things down which means the lower number is the higher priority. So you see here the mat process is waking up high and ty wants to get a lock which is owned by the low process. So what he does is he boosts the process from 120 to 9 in order to acquire the lock. So low is on a CPU it grabs the lock it has the lock it does whatever it needs and releases the lock immediately and at that time it's in the kernel and it's deboost itself after it's released the lock but it still holds the the hash bucket lock which it needs to hold in order to to anchor itself from the list. So we switch from low to high and high is then within the kernel and is going for the hash bucket lock and then we have the same thing again it boosts low again in order to acquire the hash bucket lock and then after it accomplishes it we switch back again from low to high and after this is completed high terminal itself. So this is the middle thing is like the useless ping-pong boost we would like to avoid. So we see this problem on RT in such a trace but we don't see this in a non-RT but it happens there as well. So what happens on non-RT kernel on S&P is that you have the process that has been woken up is spinning on the hash bucket lock the whole time while it's blocked and it's waiting for the other part to release it. It's not that visible in tracing but it's happening. So from the RT side we identified we fixed it and we sent patches upstream and Peter looked at it and said hey well this looks good in general we have other places in kernel suffering from the same issue and he came along with something he called wake queue lockless wake use where each task could be enqueued on the wake queue and then we could have multiple tasks on the wake queue and release them immediately and David lower converted after that the few takes wake operation which was one of the users and we got IPC MQ in for to convert it as well. So those two were also suffering from the same issue while being completely unrelated to our use case and later I got around and fixed few takes and QPI which was actually the key function we needed in RT for our PI things and as of today we got IPC MSG merged and I have the page posted and merged has performance numbers and you see we got something like 10-15% on a slower AMD box which improved things after the lockless wake wakes and to follow the spirit of of breaking few takes while sending patches I the few takes unlock PI broke RT due to the early debus it's I didn't notice it in early testing it was from I don't know when did we found 42 RT which was much back in 41 and it was just fixed recently like two weeks ago and it after looking at the code it was obvious but by the time doing no one noticed after fixing it on RT this is what you see and this is what you have seen in the first place if it was done right. So you see the met task is waking up the high ones the high ones goes for the lock and boosts the low process the low priority process and after low switches away the high process can get on the CPU and exit immediately because there are no more locks to that are required to complete the log requirement. So after that we came along to the global hash bucket problem the global hash problem was or is the same thing we used back then in 2006 when Rusty first implemented it we switched from hash long to jh2 for the hashing algorithm itself because it was it spreaded nicer but the the ground rules we had there and have now are the same so we have the the amounts of hash buckets that are located globally based on the number of CPUs at boot time and as I said before two tasks can can share the same the same log and it's not always the same two tasks because due to memory randomization the the hash of the address can differ and so we can have one's this look and then the other one and it's not always obvious which process are sharing the locks and additionally we managed to trigger an unbound priority inversions on RT and the problem is that you can have one task which runs on one CPU at high priority and you have another task which is runs on CPU B but could run on any other CPU and if a is the highest priority task which is holding which a a holds the lock and gets preempted by another task on CPU 0 and B wants the lock then it gets preempt it gets preempted and it locks and waits on the lock and it cannot boost the a task because a is a block by another one which has higher priority priority and B is not high enough from the higher view of priorities so what remains is that task C with a much lower priority gets on the CPU and runs and this is not what we want so after we identify it we were thinking what can be done and become which the most more or less obvious thing where you want a unique hash bucket struct for each locker in which comes from user land which is more or less challenging because you never know when the user gets in and where do we have the luck so it's hard to allocate memory and all you need upfront and then we came up with version one where we said okay we need additional function in jillip-c which we expose via the futex syscall we call it futex attach and this is where we allocate an extra hash bucket for this lock you want and we add we had also another flag like you have this private flag to distinguish the private futex from the shared futex we had another flag which you called it's the attached futex so we could distinguish is this futex attached or is this futex used in the in the global hash array we started small with like a small hash array and this thing was per fret so each fret has to attach itself to the lock after it was created which after the review came out it was kind of challenging for people so they didn't like it but from the implementation point of view it was pure simple because you didn't have to do any kind of locking for the for the hash array for the resizing or whatever because the task itself the fret could be either in user land or in kernel so there was no need for locking at all and the result was that people didn't like it at all mostly because we changed the abi and the user of of the pfret mutex to distinguish am i special to use this attach not attach thing or is not required and we tried to explain when it's needed when not but what came down that neither linus nor the jillip see fox agreed on it and another thing is that we had changes in kernel and we required changes in jillip see and from that point when you want to get users to use it it's it's hard because they need to backport the kernel and they need to backport the jillip see and the jillip see part where you have to backport and replace it it's it seems tough so what we learned is that what is expected from us is what we did in the past that things work out of the box out of the box automatically and back then with the attach we considered the shared view text which was shared along all processes and the private part and it came down with the talk with people that the shared view taxes aren't an issue at all because they are slow in the first place so there's no need to optimize them further so we tried to stick only with the with a private view text which is used within the process and chat among threads and the part where we have to attach them in each thread but also considered not very user-friendly so we dropped the part as well what we came out with is the version 2 and if you look at the email thread nobody cared about the other patches they only looked about the first page where we come up with the algorithm for hashing now the current had what is called hash long which is close to the thing we implemented in version 2 and we didn't go for hash long even because even if we were pointed out by Peters and other because hash long didn't scale well back then and this is how we tested for why we didn't choose it we had a larger box and we did the perf bench futax operation for wakeups what the function does is it invokes futax wake operation in the loop all over the place and what futax wake does is usually it wakes up the process but because we specify an invalid process for the wakeup it never does any wakeups at all what it does is simply a lookup of the hash bucket and return with an error and this is actually what we wanted to have we wanted to see what time do we need how long does it take for a given implementation to to find the correct hash bucket and do the wakeup the n option we have it was something we patched in so we can limit the process to a given nummer node so we run all so we want threads only on that nummer node and we can have what we had was for nummer nodes in parallel for testing and while we started it this is perf top so we started the benchmark on on one CPU and then we started perf top and we were looking how are things progressing what is where we stuck so 23 percent was perf in the worker fn a worker fn is the the thread that is doing the the futax wakeup code operation all over all over all over and again so while we were looking at this it was 23 it was like okay in comparison there was the futax weight setup it was something we didn't change at all and there was this is called at 4 3 3 percent so it looked like okay this is how it has to be but then we got like why is it what is it doing in the worker function so we went in perf clicked on the worker function and this is what we have seen and this is the loop at the beginning it checks how many futax this operation it did so far the the 93 percent is which came up to the top and was like oh that's a lot and we don't see anything else and this is where we got curious why is this ad operation like at the top and then we were going further in perf to look what is it doing here and this is a striking question this is what he did um the futax pointer is um the huge array of locks each thread is using in it goes in an array like from first second and another one and this is shared among all threads and the up the up thing at the bottom this is what it increments after each each iteration of the loop and once perf completes it said i was able to achieve like one million operation a second so that's what the ops is there now anyone has an idea what is wrong here well it's actually simple if you look at it and their destruct was shared among all threads so we had cash ping pong doing doing their ops incrementation and this is what killed us so after we fixed the tools we use we went back to step one so and this is how it looked the second time and you see the worker is isn't there anymore it's it's not an issue anymore so step two was um okay what's now a key of interest so hash futax at the bottom is the thing we modified because we um we had jhash and jhash was well it's it's long but it's still fast for what it's doing but hash long is like three parameters and for the shared futax and the thing it does in general in general it's okay because you have the mm and you have the the offset and another offset which is like page space so we have you have three things that you put into the hash but we have only one thing we have only the user space address and this is um not the general use case for for jhash this is um this is bad so what we went for is um hash long but hash long didn't distribute it well enough along among all the slots we had so that's why we get went back to um model prime number which was scaling very very well and model hash prime is unfortunately a division and the division operation was um popping out like seven percent and then we were looking at it and was like expensive and the ARM people know very well that division is very expensive on on CPUs so instead of doing the mod prime we did in the code we tried to optimize it further and what we did is um multiplication like inverse one so we since the the mod operation is the reminder of the division we had to do two multiplications to get the same thing but with multiplication instead of division and now with this change the hash futax went down to two percent only instead of five and then we were curious again and look into the hash function what is it doing and the bottom line is the two multiplication we have in here are still way cheaper than the one division operation we had in the first place and gcc as smart as it is if you tell them to division or the mod operation with a constant number it will replace it with inversion multiplication all by its own but since um we didn't have a constant but um a number which were acquired from a struct it had to do the division so doing the multiplication on our own was faster and this is actually why we came up in the first place with our own um hash function um what happened after that was um we fixed hash long um the reason why it was um so slow no why didn't spread so well so for version three we used again hash long which was fixed by then and we dropped some things but it was mostly the same thing which we had in version two so we had again a prep process operation to pre-allocate the hash for the task and this was only done for the rt thing so we could have larger pools for rt and for people that cared but everyone else could use the small hash table which was um scaling well um we had to do um some locking and we dropped the rehashing of the thing of the hash table um the locking was pretty bad because we needed in several places and after we dropped the rehashing of the of the prep process table it was most of the bad things began and there were two things which were not that nice number one was that we could um run out of memory if um for the first time the the user got into the kernel because it never did a pre-allocation so we had to allocate the memory back then and if you were if you run out of memory we moved the futax to the global um hash futax the hash packet table again so you would run into the same problems you before our modifications but you wouldn't know it because nobody wanted to change the glibc and to make it worse we could have hash collision but now process uh um process wide not system wide but process wide and this is uh what killed us okay so we tried again and so but now we were like okay glibc interaction maybe we can do something about it but the part where we have to um have to be collision free and we're not allowed to allocate memory is is is tough and we're looking at different algorithms for hashing and what others do in similar situations and we didn't come across anything that would help us at all um there were hash algorithms which were guaranteed hash free but what they did is they hash and get to a certain spot and if that spot is taken they take the next slot and the next until they find an empty one and this will work for us except that um we could remove people um hash buckets if they if they are gone and then we wouldn't find them again because we didn't skip the empty ones and this is just one example of few we were looking into and which did not help us so um this this is an idea we came around which was pretty one of the first one we had before we posted version one but we dropped it because it was too hard to use back then in version one we had all operation per fret and we had IDs per fret and this was something um it was not usable at all in userland um because we did not have a per fret memory so um so we came along with those IDs um process wide so the first user doing a per fret mutex init function doesn't attach and he receives a cookie and after that each every operation that follows like per fret mutex log unlock uh req is not using the ur but uh the cookie instead and the cookie is simply an index in in our error in kernel so we have like oh one access to the to the hash bucket we want and this is like slightly major change in the glip see but we could hide it so there would be no no things the user has to change in his uh program we could hide it in two frame mutex init the attach part and there is another part for the for the cleanup of the futax but the the per fret mutex init main page says that this operation cannot fail but since we call the kernel for memory allocation we can fail so it's problematic again because if we fail then and do some fix up that we go for the global hash bucket again then the user is not knowing that it's now working different there's no way of telling and the part where we have to replace the lock with the id is also problematic because um the p-thread struct from the glip see is a bi and it's fixed size so it's kind of tough to to hide the id somewhere within the the struct without modifying it and to make all those things worse um there's fork and on fork we have to copy all ids because um the logs might need to work after the fork it's usually what people do after fork is that they exec another process we have v fork for it but most people are not aware of it so the the bad part is that we make fork slower and for no for no good reason okay so we came along with idea two which was I have to say a weak moment of myself so the idea was each fret is usually in in user space or in the kernel but it's nowhere in between and req is the only function that requires two hash buckets so the idea was that each uh process on fork uh comes along with two hash buckets and puts in the global pool and each time we get in the kernel on the lock operation we go to the pool and and take something out and use it for this operation once we're done we get it back to the pool and the problem is that is that we need a global lock and this is not scaling well at all and I try to optimize um the link list for the lookup with the rp3 but it's the link list wasn't a problem in the first place the problem was um the global lock which has to be shared among all frets for the lookup and then we try it again the third one where we do the attached thing again but the url mapping with the um with the hash bucket is not done with the id but with um our beach we walk and the whole thing is rcu protected so we don't have any logs in the hot path and this is um nice since we don't have the id so we don't have to do anything on fork but we need small small support on the glipsy side for the attached part and we cannot do it without glipsy modifications so we do auto attach because a we can run out of memory and we are not allowed to tell glipsy about this and if we do attach but not detach we can have um hash buckets we don't belong to a process because the memory where the lock was a part of is gone and to make it worse uh people could do like attach on four megabytes of memory and they have allocated one million hash buckets in the kernel so as perfect as it sounds it's not that good so um these are some performance numbers i gathered with the implementations um we made um each bar is a nummer node is a fret so you see 64 and this means we have 64 few texas and on the lower you see t8 which means we have eight frets and so what it means is we have eight frets doing operation on 64 few texas and then we have greenest nummer node zero one two and three and this is how the implementation current scales so far so we have something about six million operation a second and if we go up to 64 frets and 1000 num 1 000 um few tex we go to like two and a half million operation in so this is how it scales um this is the the su3 lookup so why it's not that bad we still have roughly five million operation a second and if we go to the to the right to the 36 frets per nummer node we still have nice three million lookups a second so it's it's not that bad but we the arbitrary goes like um 16 levels deep so it's a lot of lookups to do um this is the the one the version two that got knocked down because we could have hash collisions and this is slightly slower than the version zero what we have with the global um hash table but we have only 250 56 hash slots while the global hash tree has um 64 k so memory wise we use way less of them but due to the hash collision part it's not considered and this is the part where we use the unique ids and performance wise is the best thing we have so far it out numbers even the the current version we have in kernel um i presented the numbers on the slides yesterday on the rt summit and the outcome is that we don't really agree what we want to do um from the kernel side and from the jelib c side so right now we try to have another conversation with people and what we can do and how we follow things any questions anyone okay thank you