 So, yeah, I was going to talk about my grade disabled, but that's a bit boring, and I didn't have a presentation until yesterday anyway, so I figured I'd do something else. Well, it is bad, and I can talk about it for a few minutes if you want later. So I've been working on proxy execution on and off for a few years now, I think, usually I get stuck and then give up and then get busy with all this stuff. This time it's stuck for a bit, and it's during the holidays and when I drive my daughter to ballet lessons, and you know, the odd hour here and there. So why proxy execution? So we have the priority inheritance for the 5.0 in our Artistic Priority Schedulers. Before the job priority, we've got the deadline stuff, we've got deadline inheritance, which sort of gets there, but because of the CBS we also need to bend with inheritance. We might actually do that if this takes a little bit longer. It shouldn't be too hard now that I've fixed the crashes we had in there with few tags and RTMU tags. We have a stable top task pointer in the task struct, so if we change the deadline accounting to use the top task state instead of current, it should be fairly trivial to make that bend with inheritance working. It shouldn't be too hard, but I've not done it. And there's people doing priority inheritance on CFS, like I think Binder, the Android people, they're doing absolutely crazy stuff, but they're inheriting nice values and that's just plain wrong. It's not proper. But you see, for every scheduler, we need a different scheme, and that's a bit iffy and wooly. So some, who wrote this work? So some people dreamt that Kurt was at Dirk Niehaus at Kansas. He dreamt of proxy execution, and Thomas has been lobbying for that for many years. Yeah, well, on UP is simple. We get there. So it integrates the priority inheritance, or it's not really priority, with the scheduling function as such, because the scheduling function simply picks the most eligible task to run at any one moment. It does, however, split what we currently have as one task into two concepts. It's the scheduling context, that which the scheduler selects on, and an execution context, which is basically the stack and the register state and all that stuff, the bits we actually let the CPU run on. This is currently a bit endominkled in just struck tasks, but conceptually we can pull this apart into separate issues. Another thing that is very different from regular priority inheritance and all that is, is that we keep the blocked on mutexes, and this is a distinction, an important distinction. Only if you block on mutexes do you stay runnable. You stay on the run queue. You don't actually get pulled off. So yeah, I gave it a go again. So let me see where it's light. So very simply, this is schedule picking a new task. So it used to be just next is pick next action, off we go. We add the proxy thing, that's the one we initially pick, the schedule context from the task. And if the next one is blocked on a mutex, we walk the blockchain, and then we get to that in a bit later. So then we get next, and the proxy function is basically here, follow block on, and from the block on which is the mutex, we find an owner, and off we go. Seems simple, right? Trivial to do, don't work. So this was the case some years ago before I rewrote mutex the last time. We had a fast path on mutex, which set an atomic log variable in the fast path. We have a good, we done. And then another mutex comes in, and it goes into the slow path and there it blocks. But this is interleaved such that at this point where we need an owner, there is no owner set yet. So there is a number of constraints on mutex implementations. One of them is that the fast path or anything that claims the mutex as owned also needs to provide an owner. You can either do this with a big state and a spin log, or you just use the owner pointer as the content of your atomic word, which is what the current mutex implementation actually does. So yay for me. Optimistic spinning is another thing we added to mutex. Is another thing you obviously cannot do if you want proxy execution. Because your higher priority blocking task shouldn't go spin and sit there and idly wait for the mutex. You want it to contribute or boost the priority when you're waiting on. So we need to get back on that run queue in order to boost. If the task selection here does not pick the highest eligible task, it does not follow this and the actual owner will not get to run earlier. So no optimistic spinning. And this is new. It must set the block tone relation. Otherwise there is nothing to follow. So yeah. Optimistic spinning is because the owner is running. So technically you don't need to boost it because it's on the CPU already. Yeah. But an S&P just gets tricky. I guess it just kills the rest of the complex. We can look later. But it got very tricky to make this work. So I killed it. But we can, I mean, if we ever get this, we can look at adding some of it back in. So here we get the atomic owner thing. And off we go. We must be good. But earlier on I said we only stay on the run queue if you're blocked on the mutex. The corollary is that you can be blocked on anything other than the mutex. Obviously the weight variables, the condition things, and just your schedule, the sleep stuff. You can just go off the run queue for whatever another odd reason. So we need to deal with that because what happens if the owner of the mutex you're boosting is not on the run queue, then you happily select the task to go run that is not in fact runnable. It's not a pretty situation. So we need to deal with that. So the solution that we came up with is to simply enqueue everything that ends up selecting a non-runnable task on that task itself. You just tear them all off the run queue and put them in a list. So here we go. If we find an owner that is not runnable, jump out, and here we do another new thing. We keep back pointers. We create a stack. So if we traverse the chain, we create back pointers so we can find where we came from. And we go back up, and all the tasks that we found in the boost chain put on a list and take off the run queue and return no. Where did I put that? Oh, yeah, here it is. So if we return no, we try again. We just tore a whole bunch of tasks from the run queue. Those are not eligible to run anymore, you know, if we go. So this means that when we wake up, when we do wake a task, we need to iterate that list we just built and put all those tasks back on the run queue. It's not too difficult. And this is the regular blockchain thingy. And this works. This is UP. This boots. It's not too hard. Like Thomas said, we had it on UP. It's brilliant. So this point wasn't too hard getting here. There is this one little hiccup with the mutex, though. So our current mutex does handoff. If we have a blocked on task, which is the back pointer that we had from the stack, then that is the task that boosted us, and that's the highest priority task that is waiting for this resource. So we should handoff to this one. However, this can be known, and this is something that I have not actually solved or know what to do about. If we have our three tasks, the high priority, medium priority, and low priority tasks, the high priority gets there first and it acquires the resource. Then the low priority one blocks. And after that, the medium priority blocks. And if the high priority task releases, block task is zero, because it did not need boosting. In this case, we fall back to FIFO, which means that the first blocker gets it, which is in this instance the low priority task. If we want to order that wait list, we need to reproduce a scheduling function, which is the exact thing we wanted to avoid. So yeah, that's icky. Also, if we do the handoff and we give it to somebody that boosted us, we must of course reschedule the moment that we've broken this chain so that the higher priority task can acquire the resource and get on with work. So yeah, so far so good. And even on UP, there's a little hiccup that I'm not quite sure what to do with. Yeah, SMP. And if only this were easy, because like Thomas said 10 years ago, we had the UP thing running. So we had the unlock. No, no, I know. We should disable and delete all the config SMP code and go back to UP. Yeah, I know. Somehow I might see some people objecting to that idea. I mean, we can't even delete all 32-bit code, which would be awesome to do. So on SMP, we can have our proxy chain following thing race against an unlock on another CPU. This can result in the owner you select being a task that has just been freed. Which is generally a bad thing. It can also be you, which is a fun case. And I'll get to the exact case later on. You can also race against wakeups because as we've shown, we had this list for wakeups. If you block on a block task, or if you try to boost the block task, you get booked. You get put on a list. And then on wakeup, we do them all. You don't want a task to go missing there, a task that thinks it's blocked itself there. Things are just woken up, and you end up with a task that is just not there anymore. Been there, done that. It's not good. You sit staring at your machine, why don't it move, that task went missing. And then there is the whole affinity thing. The obvious example is one task running on two CPUs, or even three or four. This is a fairly bad situation to be in. That gives very creative crashes. The reason for this to happen is that if you have multiple higher priority tasks, one on CPU one, one on CPU two, one on CPU three, all blocked on that same resource, which is owned by the one task, and all these CPUs will traverse this blockchain and end up with the same runnable task, and schedule it on different CPUs. It's not a good thing. I can't remember. I'm sure I've seen a lot of it when I was working on this. So yeah, the idea is to, when you traverse the blockchain and you see that you've just crossed to another CPU, stop your traversal and migrate your entire chain up to that point towards the CPU that you see your next owner to be on. And then there are people, what if my task has an affinity on that one CPU of your own? Well it's a blocked task and blocked tasks don't have affinity. The affinity is for the code you run, for the execution part. If we blocked, we don't run code, so that's okay. We can break affinities if we're not running, right? If we're not there, doesn't make a move. So blocked tasks have no affinity. So the first thing, which is the mutex unlock race, is not that hard to fix? The mutex, if it is contended, will go through the slow path and will acquire the wait log for unlock. So if we take it over our traversal, the unlock will wait for us. And that means either the unlock has already happened, in which case this one can happen, or it'll wait for us and give us the guarantee that the owner we see will still be around, which is a nice thing to have. If we are the owner, or the current iteration in our descent, then that's awesome. We can just complete the wake up and we'll get to there later and start running it. Also not too hard. Let me see. The race against the wake ups, also not too hard. We add another lock. So here we have four locks necessary already, I'll not bore you with those details. We add a lock around the block list. Then there is the obvious situation that, yes, we acquired the lock, but meantime the other guy did this, it did the wake up, it iterated the list, and now we appear to be running again. Also not too hard. We jump back to the iteration and try again. So this also works, it's not so very difficult. And this is the last of the code because you can see the font is getting smaller and smaller and I did not put the code that I have for the SMP bits in because that's just horrific. It boots, but that's about everything it does. I've not run anything other than booting on it. So I'll try and describe some of this. We need to migrate at the first CPU crossing, so if we traverse the chain, as soon as we see that the owner is on a different CPU, we need to migrate. Because the locking and all the other stuff only provide guarantees for the current CPU. If we were to cross over to another CPU, the locking doesn't guarantee stability of pointers and we're off in the woods. Ideally you'd like to follow the chain all the way to the executable task and migrate everyone there. We can't do this, or at least I couldn't make it work. So migrate at the first CPU crossing. And then all we can do is migrate towards the CPU that we see the other one being at that point in time. So we take the CPU number that is not us, then release all the locks so everything can move again. We unwind our stack because we kept the stack pointer, put it on our list, shoot it over to the other CPU, and then back off and say, let's retry. See what happens. So if that task, the executable task me more migrated to us and then the other CPU will select and they find, oh, it's there and it needs to shoot it all back over. So we can have a bit of ping-pong. I couldn't find any solid means of avoiding that. It's just too horrific. Another fun point is we need to migrate from the idle task. So currently on load balancing, we try and avoid migrating current because current is on the CPU. If you try to take that off, it's obviously icky. So what we do is we use a stopper task, which is the highest priority task in the system. We schedule to that and that means that our previous current is no current anymore and then we can move it and then we switch back and all that. I did not want to use the stopper task because all we really need is to schedule again because this is all inside schedule. So I switched to the idle task, but I already said it that we need to reschedule. So we schedule to idle and we immediately schedule out, but that's enough. But that means current is not current anymore. Otherwise this happens. This took me a wee bit to figure out how that happened. But yeah, so we block and we try and pick one and we find B which is owned here and we try and migrate ourselves to here. We can take ourselves off the rank queue, that's okay. We can enqueue ourselves here. We can reschedule the other CPU and we can pick A and switch to A and boom because now A is running on two CPUs at the same time. It is a fairly difficult race to make happen, but if it does, it leaves you in very, very funny water. So yeah, that wasn't good. This is why before we take A off, we first switch to the idle thread and have this redo the schedule which is redo the walk and then you can migrate whatever task you do find because it's not current anymore. So okay, now we've talked about moving tasks towards somebody that might have something to run. This also means you need to undo this when you start running again and that is tricky as well. I mean, wouldn't it? So this is actually the situation where we end up with owner is us. So CPU 1 does a reschedule and it does a proxy and it does the iteration. CPU 0 does an unlock and it assigns the owner to us. Since it did an unlock, it needs to do a wake up of the owner but the wake up one needs to take the rank queue lock but this CPU, CPU 1 already owns the rank queue lock because it's in the middle of schedule. So here, the wake up is waiting, it's stuck. So this is why we need to finish the wake up and then run except that if owner is on a CPU, say it has a strict affinity to CPU 0, we can't actually run it here nor can we migrate it. So what we do in this case is instead of finishing the wake up, we put it back to sleep and then we return 0. It says, this proxy run, I failed, retry. Then this one will continue with the wake up, CA sleeping task and says, oh, I need to move you over and it works again. This turns out that in all the cases that I found so far seems to work because in all cases, a pending wake up is happening. Either you're doing it from the wake up path in which you fail your first wake up and then continue to a second wake up. That works or you have this funny case where a wake up is waiting on another CPU which then will fix upstate. It's all a bit iffy but it seems to work. So yeah, that's it I think. I have batches and then the pizza guys ask for it. I've not had done a lot of work on it so I've not posted them yet. But like I said, it boots and it's got a whole heap of tricky in and there's some few unresolved issues but it's a fun playing with. So another, I can do that. Yeah, yeah, I know. So Steve asked why I don't like my grade disabled. Which was your original title of this talk? I don't know. It artificially limits concurrency. You end up in the situations where you have four runnable tasks all stuck to CPU zero and three idle CPUs. This is a difficult situation because people don't expect it. And from a design point of view it's also not ideal because the reason you do my grade disabled is to use per CPU memory. But if you're schedulable, if you're preemptible and you're using per CPU memory, you already need logs for your data anyway. Because you're sharing that per CPU data between these four tasks that are all stuck to the CPU zero. So you're not actually avoiding logs. You're limiting concurrency. And I don't really see a benefit. So saying everyone who is using per CPU variable should be using spin logs in the first place. No, it doesn't. A lot of that's preempt disabled. So currently you do preempt disabled. And then you know this per CPU data is me because there ain't nobody else. Except for the few cases which use it from interrupt. But then they know the data is only used from interrupt context or they already use a lock or some lockless data structure. But basically most per CPU stuff, like for example the memory allocators and all that, they disable interrupts or disable preemption and then no, this per CPU memory is just for me. We don't need logs. There isn't anybody else. And this makes sense. And this is okay. But because of RT and how it changes some assumptions, like for example the spin logs with some mainline disabled preemption and therefore also disabled migration and therefore implicitly allow use of per CPU memory, we need to do this. But it is, if you allow preemption while using per CPU memory, you need logs. And at that point the use of per CPU state is of dubious value. Yeah, I mean, I know it's all my fault. I came up with that. Well, I mean, yeah. No, the problem is I didn't find a sane way to deal with the whole per CPU memory frenzy which was breaking out around the three that are timeframe. Yeah, so you did the local log log, right? Yeah, but yeah, the local logs is a related thing, but they all rely on the thing on the fact that you're preemptible and can access this CPU memory because that's how the code is written. Right, and doing it from a small and bounded preempt disabled region is the best and sane thing to do. The problem is that mainline is not sane. It has very long and unbounded preempt disabled regions. This is why we need to break up the spin logs in the first place. If we were to... Which then resulted in everybody assuming it's safe to do this CPU access inside the spin log regions because mainline gives you that guarantee. Yeah, and then... So now you break the preempt ability and you're... Yeah, so mainline build a house of cards and we're tearing cards out from them to see what happens. So yeah, we need this for RT and it's a bit of a... But it's an ugly hack. So Sebastian, ask me, why can't we do this in mainline? Can we do preempt disabled into mainline? And we've had this discussion a number of years ago and I think the Slop guys started it. And I said no then and I say no now because we already have people using two large non-preemptable regions. If you give them this, it'll only get far, far worse. And they'll slap it on everything and anything. It's a patch for RT because mainline relies on semantics that are implicit. I guess the rule is migrate to stable can only be used with conjunction with a lock. Yeah. So basically... And in RT it is. Right. And if we ever do bring it to mainline, it's only going to be, it will only come with the local lock information. But only for that. I mean... And then, but the thing is you have to patch up for those driver writers that, we just had to talk about how they're using, Julia I think was just showing that drivers like to use raw spin locks and such like that. Driver writers are like user space people. They have no frigging clue. Right. So you'll probably start using migrate to stable. So we'll have to have like some sort of feature that migrate to stable cannot be used outside of, I mean lock and have like linker tricks. So one thing, if we want to go mop up and avoid the entire migrate to stable thing, and this is fixing a lot of code, is building a detector. And you can use lock depth or whatever, that if per CPU data is used, outside of an explicit IRQ disable, preempt disable or raw spin lock, it'll yell. And then we... No. Because currently on mainline, spin lock also disabled preemption. And you can use per CPU memory under spin lock, but the spin lock changes to a mutex and introduces all this preemptability. Also, I think RT still has the local IRQ disable no RT. Those would also need to yell. And the preempt disable no RT would also need to yell. So if you would make per CPU accesses yell under the stricter conditions of RT, or outside of the stricter conditions of RT, and then fix up all the per CPU usage, that would be best. I guess what we need to do is have a way of adding two preempts disables, one that, like a special one that spin lock uses, that doesn't flag lock depth. That basically, so when locked, so it locked up ignores the preempt disable for anything that's done by spin lock. Right, but that's not too hard. I mean, lock depth knows, or can be taught to know the lock types. Actually, Thomas and me just talked about a patch that validates things, look nesting. Currently, mainline, you can nest a spin lock inside of a raw spin lock and nobody complains. It'll just work. It's obviously a problem as soon as you start doing RT because then the regular spin lock will be coming mutex and this does not work if you're holding a raw spin lock which is still a spin lock. So in 14, I posted a patch for locked up to check this. It never went anywhere near because the core x86 interrupt stuff had a violation of this and at the time we couldn't fix it but Thomas has just rewritten everything there. So it might just work out. But yeah, this is why I don't like my great disable. There's no upsides. It just, so yeah. It is a stopgap to guarantee the things mainline now does. I mean, I was looking into something else recently whether I could do something magic with the whole per CPU access stuff for RT. I mean, if we hold a lock and say this is for this particular CPU that we basically get some magic. I think we talked about this before. Yeah, we did. It's an extension of the local lock, use local lock and on our team make it an actual proper lock and relax all the, I think we can actually make that happen. The thing that I was thinking about not getting rid of my great disable completely and change the implementation of this CPU bar and all this fancy stuff we have by saying under our team if you take the lock you store the information which CPU bars or which CPU you're targeting with this CPU access. And then as long as you hold the lock you can be scheduled to migrate to some other place but you still reference the right. We played with this, I think. It went up in flames at some point. Yeah, of course. Yeah, this should work in theory. It's just as long as you have a proper look around it is just any old data. Yeah, it's just the CPU maze which blocks you from actually doing it. Yeah. But we could hide all the nastiness in there. Maybe. It's ugly enough already, so. It is. Although I've not seen the very latest versions. I've not looked at RT in a while. Any other questions? While I'm here. No, okay, so there is more icky in the patch that I've not talked about yet. One of the things is all those tasks that are blocked on a mutex and are still on a runqueue are exempted from load balancing for obvious reasons. The way I'm currently doing that is really icky. Mostly because of the RT classes which have the pushable and non-pushable lists. So when the scheduler finds it, it actually dequeues and recues those tasks which is fairly heavy weight just to update those push lists. But it's the easy way to exclude them from load balancing because if you're also gonna add load balancing that randomly moves these non-runnable tasks around there, it's just pain. Yeah, that was a fairly quick thing to do for me. So I can also talk about why inheriting nice values is wrong in case anybody's interested. If you look at what this or what proxy execution ends up doing for a weighted fair queuing algorithm which is what CFS basically is, you'll see is that it is adding all the weight from the entire blockchain. So it's a sum of weights and it's not just the heaviest weight or whatever you have, that ends up being the inherited quality. So all the tasks that are blocked pull their weight together to make the owner run faster, run more, run quicker. For this, with C-group scheduling, you indeed need to keep these things on the run queue. So I did a very ugly patch for the Android guys a whole bunch of time ago that would emulate this, but the problem is if with the regular mutex, you block, you take the task of the run queues from the C-groups are also taken off and the entire weight distribution of the system changes. So the sum of the block queue is incalculable. So you just wing it. So here the quality of the block task actually remaining on the run queue and being persistent in the system helps to make the sum doable because if you take them out, the sums change. So yeah, that's why what Android is doing is absolute bonkers. All right. How is that screwing with accounting? How is that going to screw with the whole CPU accounting? They're keeping the task on the run queue. It'll change a bit, but I doubt anybody will notice. Of course, somebody will complain and we'll just tell them to stuff it. So ideally task won't block and they'll stay runnable indefinitely. And this is your upper bound on load. You'll basically be stuck at the upper bound. The actual block time will disappear from load calculations. I don't know. If you have significant block time in your system to see it in your load averages, you're doing it wrong anyway, right? Right. So, yeah, I don't care. Yeah, I was assuming that you don't care, but I was just asking whether there is... Somebody will care. Or being out there lurking around. I mean, the CPU... I mean, Brennan Drag recently did an entire blog post and that's a very good read on the CPU or on the system load averages and how on Linux this is different from all the other Unixes. Because people get really upset about this. Seriously, people get worked up over it. And then this is in the comment on the CPUloadaverage.c file that we have in the scheduler. This is a dumb number because people care. It really is. It's just a magic number. But people get really, really worked up over it for some odd reason and I can't find them. But it's just one of those things. Because in an age where we have 200, 400, 600 CPUs in a system, we care about one global number. Yeah, I think not. So, yeah. Yeah, I mean, it's the same thing like with top. If you run RT and then suddenly all your interrupt threads and soft IRQ threads get accounted. Yeah. RT consumes 50% more CPU time than your mainline kernel. Yeah, so... If you ask the person to add thread IRQs to the mainline kernel command line, suddenly the mainline kernel eats in exactly the same amount CPU. So we have an option for interrupt accounting using TSC timestamps for interrupt action and action. And that does some of it, but it's not as visible because we don't feed that through the hard interrupt reporting magic number because that's difficult. Other architectures like, I think, S390 and maybe Power actually do do this. They have the fine-grained accounting from day one. So for them, it's not a loss in performance. But yeah, who cares? I'm way too quick, aren't I? Yeah, you're good. Hooray. Well, food, people. Thank you, Peter.