 So, I'll discuss, I'll do the first part and I'll talk about the new Geolipsy condition variable. There were process requirements or clarifications about process requirements that required us to build a new algorithm. I'll talk about how blocking with few taxes makes this complicated and I'll give a brief overview of the algorithm. So this will be a bird's-eye view and the things that there are lots of details and there are more details than we have time for, but I guess our goal is that you get a bit more aware of the problem and we maybe get to discuss this a little bit and we certainly won't solve it today, but I think it's good to have the conversation and, sorry. And Darren then will do the second part, talking about the actual problem for how to support PI with the new condition variable. So quick reminder of what the condition variable is. So it's a way to wait for a certain condition to hold. So you have a thread that does the waiting. So and it grabs a lock and then while the condition does not hold, it calls this conduit operation, p-thread conduit. And this one, atomically, so specify to atomically release the mutex and start waiting. And it returns only after reacquiring the mutex. The signaling site then optionally and that's a good practice acquires the lock, sets the condition to true and signals. We have this while loop there because the condition variable is allowed to, so this p-thread conduit is allowed to wake up spuriously. So that's the thing that we want to implement. Now, right, the problematic thing though is that the condition variable is not a counter. It's not like a semaphore or a lock, right, which either has tokens in it and a semaphore or not, which is acquired or is not. The condition variable is essentially an order of events. And this is what POSIX clarified and this is also what C++14 specifies in terms of requirements on condition variables. So the requirement there is that the signals, they must wake threads or one of the waiters that started before them. And the reason is that a program can observe or construct an ordering of waiters and signals. And this is because the mutex is released atomically with starting to wait. And the counter implementation on the algorithm, the synchronization that we do, then must adhere to any ordering that the program may have observed. So if you look at the small figure down here below, right, if this is a sequence of an ordering of events that the program could have observed, right, it means that this signal here must wait exactly one of those two waiters, sorry. It must not wait the second waiter. The last signal, S2, is allowed to wake W1 to W3. That's not a problem. And you see that this is here the ordering that I'm talking about. I'll give that a try. S1 can wake W1 and 2, S2 can wake all of them. There you go. So the errors are the ordering requirements or they determine the eligibility, right. So in this case, for signal as 1, waiter 1 and waiter 2 are eligible to wake up. And so the counter synchronization must model an order of waiters and signalers. And so as I already mentioned, there's the set of eligible waiters that it has to adhere to. And only those are allowed to consume a signal. And the problem with the previous condition variable implementation was that we could violate that. You know, not with two waiters maybe, but with three waiters there was a problem and we actually got bug reports from users and so on and so forth. So what then we do? So I then went to the Orson group and to ISOC++, asked them what they think. They clearly said they want these semantics and they want this ordering to preserve. So I built a new condition variable algorithm. The first attempt that I tried didn't actually work. And now I have a second one that I think works and that is now currently under testing and review. So how do we actually do that, right? So let's start simple and say that, okay, we don't do any kind of kernel level blocking at all. We just do spin waiting. And in that case, it's rather simple to do. Because then we can just say the ordering that we saw down at the bottom. This is just a sequence, right? So we ignore all the partial order thing and say it's a total order and it's a sequence. And then the eligibility for wake up is determined just through a sequence of waiters. So which we call Wsick, which can be as simple as a shared counter. And waiters in the condition pthread-condwait implementation basically do three steps. So the first is that they look at any kind of sequence thing and ordering that the program might have observed. So here in our example, they acquire a precision in Wsick, for example, by incrementing the counter, right? And then they become eligible for signals that happen afterwards, right? Signals can also observe this counter and the sequencing. For example, right? Then they release the mutex, right? And then they actually start spinning. We'll see later that one problem here is that the order that we observe in step one doesn't necessarily match the order that we have for calls to few takes of eight, for example. That creates these all kind of interesting problems here. But with the spinning, it's not a problem, right? Because we're always spinning and we can just stick to our order that we determined in step one. So what signals then do is that if the number of signals sent, so S sent, is larger or equal to the number of waiters that we had so far, then they just don't need to do anything because there's no waiter. Otherwise, they just increment the number of signals sent. And then one of the waiters, spin waiting on S sent, will just see that it's turn and it will go, right? So this is similar to a ticket lock. But the immediate few takes falls? So do you want to have the... The block will do a few takes wait immediately without spinning. And the signal is a few takes wake. Because the furious wakeups were allowed, a wake will never wake something that is not waiting, therefore your ordering is preserved. If you do not do the spinning, it doesn't change anything. You can be preempted right before you do the comparison of the few takes word against the value that is expected in the few takes wait call. So it doesn't change a thing. You never change the value for the conveyor. If you do the few takes wait call, the kernel will eventually load the value, the current value of the few takes word and compare it against the expected value, which should always be true. Otherwise, we don't go to sleep. Right. But this is the problem. This is like spinning. And I'll show you in the next slides why there's a problem there. You go back and then you consume it, but you should consume it. Let me ask in two slides more, okay, if it becomes clear, not clear enough. So what the spinning does is resulting in a simple 50 wake up. Okay, so far so good. And timeouts and cancellation, which adds more fun to all of this, we can model like sending artificial signals, right? So that's, we can deal with that. Now, the first attempt at using few takes is, instead of spin waiting, we want to call few takes wait eventually with a send as few takes word, right? So if there are not enough signals sent for us, we want to sleep. The problem, however, is as I mentioned previously that the wake up order and so for the few takes does not necessarily match the order that we determined when we set the data sequence and acquired our position in the sequencing of waiters. So step one and step three on this slide here in the middle, right? Step one, the order is not the same as step three, the order necessarily, because we can be preempted in the middle and so on and so forth. So the reason is for that that we do the few takes wait after releasing the mutex, right? Which we have to, obviously, because we need to allow for others to make progress, right? Before we actually block everyone out. And the few takes system itself provide no wake up ordering guarantees in the non-PI case, at least according to the specification. And in the PI case, we don't have a means to tell the few takes wake us up in this particular order, right? There could be, for example, their wake and end priority order and things like that. Which is A, good semantics, not sufficient for what we need to do here. And we don't want to wake all the threats blocked in the few takes wait, because that would obviously be bad for performance. And every P threat consignal would be a P threat broadcast. And we don't need a condition variable for that. We can do anything else that is better. Now one possible workaround is that we could be clever and say, okay, eligibility for wake up can also be argued when a few takes wait actually happens before a few takes wake, right? Because then clearly parts of the wait were before the signal. So far so good. What can happen though then is that waiters wake up if ascent is larger than their position in the waiter sequence. So this is the spin waiting site, which could also happen through the comparison on the few takes wait. And they also wake up if few takes wait returns zero. So as in any other situation where a few takes wake really woke it up. Does this work, right? Does anybody have a question? Or in the interest of time, it can already tell you that it does not work, obviously. So the first bug is that the difference between the number of waiters that we have and the number of signals we sent in this case can be smaller than the number of waiters that are actually blocked. So the program can count with accurate knowledge and so on how many waiters are actually still blocked on the condition variable. And it can legally and correctly only send that many signals to wake up the remaining ones. Now if two waiters wake because of one consignal call, so one through actually observing the shared memory value of signal sent and the other through the few takes wait, then ascent is not incremented by two but by one. So in the end ascent is too small and we can get less wakeups. Now if you think about workarounds, right? Can waiters actually increment ascent if the few takes wait returns zero? Well, they can but then the problem is that the consignal check that we have to not have unnecessary calls to few takes wake and to actually incrementing signals will hit early, right? And so one of them might not run, which also means that we have one few takes wake call missing. And so we have, you know, if we try to work around one of the things, we have another created a problem elsewhere. Maybe we are able to count these events and try to find a workaround. However, any kind of workarounds and I looked at this for a couple of days was really bad and it was really hard for performance. There were a lot of them. And I'm not counting the time that I spent before on that. So they will have result in spurious convo wakeups, so bad performance and it gets even worse than that. And this is that we can't distinguish spurious few takes wakeups from non-spurious ones. Now you always say, well, the few takes implementation, the kernel doesn't wake up spuriously and you would be right. The problem, however, is a combination of POSIX and C++ requirements for when you can destroy mutexes and condition variables and combining that with the general few takes design. So POSIX requires that mutexes can be destroyed as soon as no thread is blocked anymore on the mutex, same for condition variables and barriers and so on. The general few takes design in return means that we have a user space fast path and this fast path and the actual few takes operations in the kernel are not one atomic step. So they can happen at different times. Now, because of this end memory use, we can get spurious few takes wakeups in practice. So what happens is that thread one, for example, releases a mutex in user space then gets suspended, right? So before it can do the few takes wake call. Then thread two acquires the mutex in user space, destroys it and reuses the memory for another few takes. Then thread one gets resumed again, calls few takes wake and this few takes wake hits a different unrelated few takes variable. It happens to be at the same address, but it's not the same thing. Then we get in practice a spurious few takes wakeup. Which is not a problem for all the mutex implementations and so on that we have. Because the way the synchronization problem is on an abstract level is that they can, they can, they are, it's harmless, right? But for the condition variable in particular and the first attempt that we had, it's a problem. Because if we can't distinguish between the spurious and the non-spurious one, we're back to having the first problem. But it's even worse because we don't even have any kind of, we don't know where the consignal happened or not, okay? The call mate was allowed to wake scourish and it's just never crime. Yeah, we can wake all of them and then we're back to really awful performance. And if we're in a bad place here because we cannot decide what was it. We have the error on the conservative side. And then we actually, we wake everybody else. And it's not nice. So. Question? Yeah. Perhaps I'm missing something big, but can, can, can, can you do this? Sorry. Okay, so I'm perhaps I'm missing something, but can you implement a FIFO queue in user space that would keep track? So every waiter that gets pushed on to wait, basically link itself into this queue and the wake up actually the queue and wake one. So you could do an atomic exchange to enqueue a waiter. Then you'd have your ordering and all would be fine. I'm just saying a piece of process shared condition variables. So POSIX allows for condition variables, mutex and so on to operate in a process shared where you just map the variable, but you don't have any extra space that you could use. In the process private case, yes. The process private case is different because we can build our own wait queues and we can do stuff, right? But a process shared case, there's nothing really we can do. So, the second attempt, and this is what I have currently in my patch. It's quite a bit more complex than the first algorithm, but it avoids some of the problems. So the basic idea here is that you maintain groups of eligible and non-eligible waiters. Each with their own few text words. And new waiters always start in a non-eligible group, right? So which is, we're limiting this to two groups right now, so this is group G2. In contrast, the eligible group contains only eligible waiters. And each signal always wakes some thread in G1. I'm stressing that it's some thread because they're all eligible. So you can wake any of them. Which means that from a synchronization perspective, we're back to having something like a counter, right? You can say, okay, is there a signal? And you just say yes, I'll grab it, right? And this avoids the problem with the ordering because through the groups we build and represent this partial ordering that we have. And then when group one is completely signaled, group two becomes the new G1, the new group one. And I have tried to use it the pointer for now. So in this part here, first step, we have just two waiters. This might be the initial state and we start with one group two, all right? So then subsequently, our signal comes along, signal one. And it sees that there's no other group G1. So it makes group G2 into group G1 and signals some of them, right? Then next example, so how does it play out? We have a new waiter coming in, W3, and a new signal. The new waiter sees that the group one isn't yet completely signaled. So it always just starts in the group G2. Here it's a new one. And the signal S2 that comes along, where's my mouse pointer, this one. Also sees that G1 is not completely signaled. So it just sends the signal to G1 and then makes it completely signaled. And then last step in this example is that eventually when G1, when all the waiters in G1 have confirmed that they have woken up, we can make G2 the new G1. And then the third signal can then wait the third waiter. So that's the really high level perspective on the algorithm. I don't want to get into the details in this. You can ask me later and I can walk you through all of that. The important part here for us and also for parity inheritance is that the groups G1 and G2 are roles mapped to two groups. Sorry, we started sentence again. They are virtual groups, right? They are roles and they are mapped to two fixed group slots in the pthread-conti data structure. There's no space in pthread-conti for more than two groups and it's already tight. So what it means is that we need to reuse the memory. So the convert keeps track of which slot has which role and it always has G1 for waiters to enter. And it also maintains the waiter sequence. So the waiters can detect aliasing of groups, but they can only do when you actually look at shared memory. And they cannot do it inside of the futex-weight operations. So reusing a G1 as G2 requires quiescing this particular group to be able to avoid an ABA in the futex-weight. So in ABA situation, in case some of you are unaware of it, there is a name for something that you often find in synchronization problems is where you have values representing states, right? And you say, okay, you see a value of A and you think it's state X, right? And then you have a value of B, something else changes and then somebody else again sees a value of A. But this value of A doesn't need to represent the old state. It can represent the new state. This is what we call an ABA problem. For example, in a concurrent linked list, if you remove nodes, you might see the same pointer, but it actually might be a different node in the list after memory use and so on. The ABA in this problem is that you see a number of signals. And one signal, one value that you see, for example, for there are no signals available in this group, is the same value that you see if it's actually a reused group, right? So we need to quiesce. No. So I mean, this, because we have the counters and because we reduced it to a countering problem, the spirit of wakeups are not a problem for when you just need to count, right? When you actually need to find out is it non-serial, the number of signals that you have available. So we need to avoid this ABA problem, so we need to quiesce future wake calls, which means that we need confirmation from all the waiters in the group that if they signal that they start it or are about to start a few takes wake call, that they're not going to do it or that they have finished it. This is important later on and we'll discuss this later on. The good thing is though that the, for example, the switch from group two to group one is simple, right? We don't need to change the few takes words. So it's not as inefficient as the complexity might sound. And the quiescence is also something that we can do in user space and so on and so forth. So ignoring PI, I'm pretty happy about this algorithm. And now Darren will continue with how it looks to consider PI. So for any of you that are still following along, we're going to fix that now. Okay. So just for a quick recap, when we talk about an unbounded priority inversion, we're looking at this red line here taking an unknown amount of time preventing the higher priority task from running. And this is intended to show a single CPU executing three tasks in time. And the general idea is just you have a low priority task running. It gets preempted by a high priority task. The high priority task ends up going to sleep because it needs a resource that the low priority task has. So that's why you see the blue task stop and the green task pick up again. But then while the green task is running with that resource held, the medium priority task is scheduled. And because there's no dependency on that resource, the medium priority task can run unchecked. And the high priority task is therefore interrupted by a medium priority task until such time as the medium priority task relinquishes the CPU. And then the low priority task can complete, release the resource, and then the high priority task can run. What priority inheritance is intended to do is to shorten that red period because it will in turn boost the priority of the low priority task, which will then preempt the medium priority task and make that a minimal. Okay, that was just sort of required background. Okay, with few texas, with few texas and priority inheritance, we had two goals. I say had because this was in 2009 when we went to attack priority inheritance in few texas with con bars initially. So the way a signal would work is on the left. When a signal would happen, we'd wake a single task and then it would lock and take the mutex in user space. The problem was that, one of the problems, was that we wanted to be able to avoid a thundering herd where because we couldn't re-queue directly to a priority inheritance mutex, which has an RT mutex as its back end inside the kernel, is we would have to wake every single task that was blocked and so we'd have to wake them all up and then they would all contend and then the rest of them would go back to sleep. So that led to a lot of unnecessary wakeups and then going back to sleep. We also wanted to make sure, though, that we woke up the highest priority eligible waiter. And I say eligible to refer back to Torvald's representation of eligible being everything behind the observed order in the application. So everything behind S1 when S1 is issued. Implementation restriction. So I mentioned that we had to wake everything instead of just waking one. And one of the reasons for that is that in order to queue to an RT mutex, RT mutexes cannot be in a state where they do not have an owner, but that they have waiters because if you're in that state then you have waiters that need to get boosted to the priority of the owner, but you have no owner. So that's an invalid state which we had to avoid. So what we implemented was the gray box represents the kernel. So what we implemented was a, if this cuts out, is that the battery? I don't know. Inside the kernel we allowed us to re-queue from a normal few-tex to a few-tex with a RT mutex backing it. And we would do that inside of the kernel and as we woke we would take ownership of that mutex. And what that meant is we never left the RT mutex in a state with waiters and no owner. However, this imposes a couple of challenges. One is that PI few-texes enforce a policy on the few-tex word. So the few-tex word cannot be used for anything else because it is used to encode the tid of the owner and whether or not there are waiters. And so that's how we reflect that state within the kernel to user space. So we, in terms of considerations for this problem, we wanted to note that we are only concerned about the unbounded priority inversion with respect to the target mutex for the condition variable as well as any of the locks associated with the locking mechanism itself. One of the nice things about Torvald's new implementation is it eliminates the previously there was an internal data lock to the con bar. And so half of what the G-Lib C modifications that we made originally did was they would take that internal data lock and if through a non-POSIX modification we were able to determine that we wanted to later re-queue to a PI mutex we would change that internal data lock to a PI aware lock. Otherwise we had a secondary priority inversion risk. But he eliminated that internal data lock. So that eliminates half of the G-Lib C changes we needed to make. So that's nice. But we're still faced with the problem of the value encoding in the Futex. The other concern is this makes a lot of sense for FIFO and RR. But as we go to moving to using SCED deadline more it's a little less obvious. So even for FIFO it's not ideal because in its previous slides the S2 would only wake W2 because W3 was not yet eligible to wake whereas W3 might be the highest priority one and it is eligible to wake, strictly speaking, at S2. Let's go back to this slide so we're all talking about the same thing. There's a simpler one. The group one. So at S2 the only possible wake-up is W2 in his scheme. Even though W3 might be the highest priority one and that's the one you want to wake. You're correct. But it can't. Not with that scheme. Not in this scheme. So sequence-wise it is allowed to wake. It's semantically correct. And we want to wake according to PI rules. This scheme does not permit us doing so. Noted. Let's see. I think lastly, well, let's see. Torvald, I think I'll move this to... You should probably come up for this part. So this is what I touched briefly on before. The problem with the quiescence is that we need to avoid the ABA. And so the threats that Ren futex wake or are about to try to use a futex wake, they need to confirm once they are finished doing that. So because when they confirm that they are all not going to do that, they... Oh, shit. So when they confirm that they are not doing that, right, we can reuse the memory because only then we have avoided the ABA because there's nobody in the first A thinking anymore, right? So the first A of the ABA, right? And so we would need to boost priority of threats, but they have not acquired a lock, right? They are just there. And we can have something like a help of futex per threat or another wake queue because of the process shared converse. Now, when we are saying that, for example, in the real-time case, process shared doesn't matter, maybe that's our solution there that we can actually need to build user space wake queues. Might be the easiest thing to do, perhaps. Or at least an approach. One futex operation that, atomically from user space perspective, one locks one futex and goes wait on another. Would that help? So that would be your three steps all wrapped up in one. So you're saying that... So you're thinking about releasing the programs in futex and waiting on the other one? Yeah, the problem is that everything in between is not. What could help is if we have something like increment the reference count and go sleep on a futex or something like that. So the sequencing that we do before the mutex releases is one thing, right? The other thing, what I just mentioned, though, is that we need to know whether they are pending futex weights and having kernel support for figuring out whether they are pending futex weights without running into this problem here with that we cannot boost something for which we don't have a handle. That might help. But if the futex Cisco would unlock and wait atomically, then you already have your correct order for a futex wake. Because if the futex wake will observe anybody in the futex hash bucket, it must have happened before. That's true. So if we then do everything inside the kernel, that's true. But do we want to do that? I don't know. I'm just saying this. But for the PI cases, the performance is less of an issue than the functional correctness. 40 PI cases might be true, yeah. Though I think what just quickly stepping to this direction, because it's related, a question that we perhaps really need to ask is, do you really want to have a conditional variable? We just discussed briefly the case where the different, the ordering constraints for the conditional variable specifications say one thing and priority says another thing, or FIFO says another thing. So maybe we need to think hard about saying, what actually do you need? Do you really need something perhaps more like a semaphore or like a latch or something like that? So POSIX only requires to wake waiters that happened before the signal, which is I think a sensible thing. We further require that we not only wake any one waiter, but the waiter with the highest priority as seen by the scheduler function, which is not something user space can determine even if it wanted to. So I don't see POSIX being ill specified here, but the solution you crafted for the confar is sufficient for POSIX, but insufficient for the PI case, because as I previously said, the S2 wake up should also consider W3 and not only W2. And I'm not saying that the specifications are incorrect or poorly chosen. I think I agree with POSIX. That makes sense, that constraint. It's just that implementing it is slightly awkward, maybe. The problem with what you just mentioned is that you cannot easily implement this. This one. The problem why I'm crunching S2 into Group 1 is that if you don't do that, the state, the memory that you need to actually cover everything that you're allowed to do just is too large, right? Well, yeah, but that's the problem. So that's why I'm thinking if the Futex Ops could help you here by doing the unlock and wait, automatically, but that again reduces to always doing the system calls, which is something that might not be desirable in the generic case. I mean, Darren is right. Maybe we need to do something different entirely for PI. But I really want to encourage you to think about what I had here, even though I already said it once. So yes, the requirement that POSIX makes is a sensible one, and I think it's useful for many things, but maybe you don't actually need it, right? In a lot of cases, a semaphore is just as good. And maybe for the... I don't know what you are using condition variables in the real-time setting for. So somebody is making a request for that. This person or these people probably should think about, you know, can we do something else that might get back to me? I can tell you who came up with the request first. The Java people. Okay, so I know what you're asking. And they are doing something totally scary. They call it real-time Java. Okay, here you are. Stop wondering. So, you know, maybe we don't need to solve the hard problem if the simpler problem is easier. Right, but with this... Is there a simple solution for getting rid of Java? But with respect to Java, I believe the convars were a requirement for the implementation of their monitor. And they are certainly open to alternatives to that. I mean, they've been doing all kinds of weird things for the monitor for a while. They had the scheduled issue for a long time. And so this was a way that they... But convars are also widely used for all kinds of nonsense. Yes, and that's the other point. Outside of Java, it is heavily used in general-purpose programming. And one of the advantages of using a real-time Linux kernel is the ability to reuse pre-existing software. But it shouldn't need PI. But the period between the weight and the signal should be bound somehow. Also, because if it's unbound, you get, again, unbounded behavior and all your determinism is out the window. It's perfectly deterministic. If you don't call a signal, it's not going to wake. Yeah, very good. That's the same. If you don't get an interrupt from the audio device, you're not going to wake. So this should basically only be used for external stuff, where you're not waiting on software, and there is no owner, so there's nothing to boost. So convars in a real-time and PI scenario are indeed a very dangerous construct. But there are valid use cases, but it's very, very tricky. Yes. One use case I know about is that's where people come up with replacement functions for things found in other operating systems with keys to exist or about the keys to exist. So OS9 had a very similar construct. It's called kind of events in the space. So you can one-to-one replace it by convars. And that's unfortunately what people used to use for 20 years out there. And they just expect it to work. If it's the best solution, certainly not. But, are we going to educate all the people out there? But for example, if it's events, then maybe a semaphore is as efficient. So the reason that convars... We tried to implement a semaphore, and it didn't work. So maybe we should talk offline. It has very similar semantics than what the convars have. Okay. Unfortunately. Because the interesting thing about the convars is that they're special on that day. You can actually detect when somebody started waiting. Which you can't for anything else, like barriers or semaphores or something like that. And that makes it special here. If we hadn't had this property, then it would be much easier synchronization voice. And the other problem is we can't call the evils back. I mean, it's out there. But the evils don't have PI support for the convars currently. So there's incentive to change. They patched chilipsie. Yes, for seven years. But so they have a broken condition variable because it's the old one. But the printing presses work. It's working. Kind of. Okay. Any other questions, comments? Great ideas. And on positive semantics, you don't have it. So you should not use them. If you're not using the patched version, are you using the patched version? No. If you're not using the patched version of chilipsie, you should not be using convars with the PI mutex because the internal convar lock is not PI aware. And so you can just hit unbounded priority immersion on that alone. Regardless of the first half of the presentation, the convar implementation is broken. Any other questions? It's more a question for today before you can find a solution because if I see the bug tracker, it's about six years now. Seven. Okay. So practically, if we want to use convar, we should better use the patch version. If you're going to use convars today, you should use the patched version, but understand that you will have the partial ordering issue where we might wake W3 when we should only wake W1 or 2. That's the issue today. It's a trade-off. Yeah. It is a bit of a subtlety, whereas if you... The problem with what we had originally was you would use a PI mutex and intuitively expect that you would be getting priority inheritance. You would intuitively expect that you get priority inheritance, but the internal lock would mean that you could actually not get it and you could suffer from the inversion. And it would be difficult to debug because it's a G-Lib Z internal which you never really see. Yeah. You've said a couple of times that semaphores could be... It's better to use semaphores instead of conditional variables. Could you please kind of... We're out of water. Let's suppose I have a simple problem, like a queue and other thread put events in queue to wake up and start to process stuff. How semaphore is better? So you could... What I'd try to say specifically is that we could potentially, I think, implement PI support easier for semaphore than for the conditional variable. We don't have PI support or something like that for semaphores right now. So currently it's the same for me. I've seen this priority inversion. It happens to me. So no use for me for switching to anything else at the moment. Did you use semaphores or the conditional variables? No, just conditional variables. So the semaphores, especially the new ones that I have, they work a little differently. So I would have to look a bit more closely whether you might be less likely to run into a priority inversion problem. So the semaphores don't use an internal lock or something like that. If you run up and whoever grabs the token first wins. Of course, this can be still kind of unfair, for example, if low priority threats are running. What I'd really try to say is that the semaphore might be easy to... For the semaphore, we might have an easier time giving you PI support. And I think for the key example you mentioned, the semaphore, if you really need to wait for just something being available to consume, then it sounds like a semaphore to me. Okay, thank you very much. I think I'll Google more from it. We should discuss offline. It's just too complicated to do it, you know, off-hand. I spotted this interesting 64-bit Feutex operation thing. I mean, it has been discussed in the past. Linus shot it down back then. But he most... The main reason why he shot it down was because Ulrich wasn't able to explain it proper. That's true. I read the email conversation. And I think, if I may interrupt you here, I think the problem is that it's going back to Feutex and stuff like that. There are flags, essentially, if you look at it, right? They're either acquired or not acquired. Or, you know, do I have tokens on a semaphore available or not? This year is a sequence. Right. And that's different, right? And you can argue whether 32-bit are sufficient to avoid an ABA. But, you know, it's hard telling a customer that you shouldn't wait three months and have something suspended or not running for two months. Then you get this ABA and you get this and that. And we definitely can't do PIN sequences in 32-bit. That doesn't work at all. That's true as well. I have the 64-bit Feutexes in there because then we reliably and practice could version our Feutexes so that we can avoid all kinds of ABA. And it also would make it easier in some cases to use Feutex words for something that are pointers if we wouldn't need that. Okay, but I mean, I'm not totally opposed, but what frightens me is that we might have to replicate the whole Feutex code once more. Which... No. So... It's just one possible approach. There are others. Maybe we really need to do more of weight queues, build more weight queues in Union Space or something like that. So do something closer to what Java has done with Park and Un Park. Okay, so, I mean, if you come up with something which just uses this, has some special functionality for using that 64-bit variant for some special simple operation, then we can certainly do it. I think one of the issues though, too, is even if we did do the 64-bit, we still not have solved the PI problem because we'd be back to the 32-bit problem but with PI because the other 32-bits are the TID and the waiters. Yeah, but you could argue that, okay, the 32-bit on PI is sufficient enough. But in any case, I would rather do a custom, one custom Feutex op, than all of the ops again in 64-bit. No, no, no. I mean, that only makes sense if we can come up with something which requires that 64-bit thing for a particular operation and the rest of the operation still operate on 32. So for the normal stuff, it certainly would be sufficient. We probably also don't need the special ops that we have right now, like maybe RQ or the PI stuff for 64-bit. We might be, depending on what kind of requirements we actually put in terms of PI and on the Convore to be able to solve it 64-bit, because if we assume that the incoming signal always has high enough priority, then all the signal needs to do is RQ onto a PI mutex that is the program-supplied PI mutex. And then if we have the problem solved for the versioning, then we don't need to do the quiescent stuff. So it may solve it if we have something like a 64-bit RQ or something, just thinking out loud. So maybe there would be an option. You picked the most complex call. I said a simple one. No, but as Peter said, I mean, if we can help with something doing atomic wake-weight switchover, that would be rather simple. Unlockweight. What? Unlockweight. Unlockweight. Wake... Wakeweight. Unlockweight, yeah. The other problem that causes the spurious wake-ups is that with their mutex implementation, they do not know if there are waiters, so they always have to issue the feutex system call unconditionally, which is not good for performance, because then you waste a hundred or a lot of cycles doing a no-opsistical and in the worst case, cause these spurious wake-ups. I respect the lock and unlock of why one proposed aren't that weird either. Right, though, I think that I'm not quite sure I understood you correctly, but for the current implementation I have we already try to avoid the feutex wakes as much as possible. I still have a few cases where I need to implement it, but we do the usual stuff with, you know, having a bid for whether there are waiters that. Yeah, I know, but there's these edge conditions and it makes it rather more complex than it needs to be. It's a little complex, yeah. 64-bit for some thing and we have to come up with a proper explanation why we really need it and why it helps you. I guess we can I'm not a post and I guess we can get it past Linus as well if we come up with something sensible and not just I want to have it because I'm Ulrich. The previous explanation that I saw wasn't convincing. He probably had his reasons, but it probably wouldn't have been convinced of anything else. Okay, so any other questions on that future step or any other questions at all? Thank you both for coming here and preparing that.