 This is talk about the state of the real-time preemptive project. So let me talk a little bit about the project itself. The mission of this project is RT mainlining in the very end. But it's also about documentation, establishing community, long-term maintenance, testing, and things like that. So one of the questions I get regularly asked is why establishing community is important. So when I was talking to Linus a couple of years ago about the chance to get preemptor T into the mainline, he said to me very clearly, if there's not enough backing behind that project, he's not going to take it. There's a very simple reason for that because he can take random drivers, random file systems, whatever, they can bit rot away in the kernel forever and eventually get removed. But RT is a different beast. It gets into the guts of the core kernel. It fiddles around with deer. And it might, to some degree, impose some limitations on the freedom to develop the kernel fruit. So at least not in the way they do it now because they have to respect that RT is deer and think about the problems, how to solve it with RT. So that's what we are doing after the fact right now, which is a painful vector mole game. So funding, we have these members paying funds into the project. We have that for 27. So out of these funds, we can have five developers working on the project. So they are working on the code. They are working on the testing infrastructure, on documentation and all the related things. So that's pretty good. I'm happy this happened. It's much more fun to get paid for things. So let's a little bit talk about the tasks we accomplished since the project started in 2016. So that's 21 months from now, from now, behind now, or whatever. So one of the two things we were really, really busy with and it took us a lot of time was the hot black lock, the hot black CPU hot black rework and the CPU hot black locking rework. Why would we actually tackle that? So one of the things was that the complete CPU hot black infrastructure in the kernel was known to be fragile for about 10, 15 or longer years ever since it got there. And it really was literally duct tape to death. So people just applied workarounds over workarounds and some other workarounds and all this stuff completely fell apart when it tried to do CPU hot black with preemptor T. Not fun because people want to run RT for whatever reasons on a laptop and expects us bent with you to be working or there's actual useful use cases. You have battery operated devices for data acquisition which require an RT kernel and you want to suspend them in order to save power. So you really want to have that support. I had a first step at rewriting the CPU hot black infrastructure which was basically based on notifiers. And the notifier chains have interesting properties. So they are randomly ordered despite the fact that you can assign notifier priorities but we had notifier priorities, 10 of them documented in the header file that were actually defined for them and then we had about 20 others which were randomly choosing numbers hard coded in the particular code file which is very easy to understand and to debug. The other thing about notifiers was what was interesting in the hot black case. You would expect if you bring up something like a CPU and you initialize all the facilities and all the drivers or whatever it needs it you need the hot black information that the new CPU is there and then you pull the CPU out you would expect that you tear down the crap in the reverse order you brought it up. No notifiers get called in the same order. So we had situations where we had code where we had two notifiers because it was asymmetric and they required symmetry. Another code which would have required symmetry as well just worked around it by hacking really crappy stuff into the code to make it work. In 2012 I really got tired of fixing or trying to fix hot black in RT and I posted a first patch set which introduced a state machine. State machine with explicit states where we have documented ordering and the states that was the assumption at that point the states have to be symmetrical. So I run out of spare time and some big corporation promised to pick it up when we had a big discussion about CPU hot black at a conference. What happened was they went away didn't do anything about it and applied some more duct tape to the hot black code which was pretty much fun. So end of 2015 we started to look into that again when we knew that our T funding was in place and it's finished roughly now so we still have to fix that. But this is a new one. But this was an easy one. So there was a lot of ground work to do to get there. First of all was look at all the places and that's hundreds in the kernel which use CPU hot black notifiers and stare at the code and sometimes you need special glasses to do that. Analyze it, document the ordering requirements which have been interesting because you have the explicit ordering by priorities but that was only 10% of the total notifier base and the rest was ordered by chance either by link time ordering or by just runtime ordering that particular init calls happened in a certain order which was mostly due to link time ordering again. This was interesting because when we converted that to the state machine we didn't know because we were too lazy to figure it really out in which order they were coming. So we just assumed ordering which broke stuff. So which means this code worked by chance not by design. While we were analyzing all that stuff we found really several dozens of bugs in the CPU hot black notifiers in the callbacks which is bogus code in here. Either never executed or not exploding for whatever reason. After that I revived the old state machine core patches and rewrote them mostly but then we did a one by one conversion of the notifiers to states. So this was a lot of analysis not necessary because we tried to not only to do a one to one conversion we also tried to do a symmetrical conversion. So that we can really get into that state where we have the ordering buildup stuff in that direction and tear it down in the same order or in the reverse order. And that involves the gradual removal of the old infrastructure. Once we had this finished we went to the next interesting problem which caused us a lot of headache in RT. It was the hot black locking. The hot black lock was basically a kind of a lock. It was a counting semaphore not really covered by any of the lock debugging mechanisms or only basically covered. It evaded locked app almost completely. So what we did we ripped that homebrew counting mutex out of the code base and replaced it with a per CPU reader writer semaphore for scalability. That was something people wanted to have anyway because the mutex counter based CPU hot black locking was not scalable at all and a lot of code actually uses it in hot path. So you get contention under certain workloads. So I tried to do that. Yeah. So for a first period of time it was like 14 or something and then there was still no hot path so you could rely on this and please go away. So yeah, I think that we now find a problem anyway. Yeah, but Linus said, yeah, I remember that discussion and Linus said no hot black uses it and no hot path uses it or should use it. But he couldn't prevent that hot black that it's used in hot path. And I succeeded in that. Right. I know, but he didn't argue with me when I changed it fortunately because what I could show him one of the reasons he didn't argue, I think was that I could show him that now that the locking was under locked up coverage, we actually could prove that there are tons of deadlocks hidden in the code and it was an amazing amount. Some of them took literally weeks to solve because yeah. So we had this interaction between tracing proof K-probes, the chump label code, which all took in random places the CPU or block lock because nobody noticed that there might be a problem because locked up didn't complain, so it must be fine. We have this, Peter just mentioned we have a similar discussion right now in LKML about the cross release locked up feature where we want to track dependencies which come from a task waiting for something and the other task releasing it which is completions or weight queues where we can have interesting deadlocks. Task A is waiting for completion B and task B is waiting for completion A, not covered, but you can have the same thing with lock. Task A takes lock and then waits on completion, task B takes lock and wants to complete, doesn't work. So this kind of things and other more complex scenarios and people are now complaining, oh we don't want to deal with that, there are too many false positives and I had a nasty discussion with one of the SCSI developers in the last couple of days where I told him, well, we had the same problem with locked up when we introduced locked up. The first years of locked up were just annotating code and teaching locked up about false positives because he was falsely claiming that locked up is perfect and does never see false positives. I mean. You should talk to Dave Jenner. Pardon? You should talk to Dave Jenner. I did some statistics actually. We have about thousand places, thousand, literally southern places in the kernel where we do locked up annotations. Oh, I spent a good year doing locked up annotations when I started at Red Hat. Pardon? I did a good year of locked up annotations when I started at Red Hat, but this is how I got sucked into all this. Yeah. So, that was a pretty interesting work to cure these locked up problems in the code, especially the convoluted tracing, proof, K-Propes, chump label thingy. That was going around in circles. Thanks, you. Thanks, Steven. Yeah, that was another story. We are mostly done with curing it. The last fallout was the watchdog thing. For some stupid reason I didn't trigger it in testing and it got reported and I looked at it and said, oh, that should be easy. 40 patches later it was fixed. Because it turned out the code was so buggy and they tried to work around the CPU hot plug problems by papering over a design problem in the underlying code base and this simply didn't work. And yeah, that was another two weeks wasted. So, there are a couple of lessons learned. If you're unearthed existing bugs you're expected to fix them. Nobody cares. So this, I literally got told, this bug was not there before you changed the hot plug lock. Okay, I could prove the person that it was there. It was just not expressed by lock depth, but it was clear. No, I have no interest to fix that. Okay, you go and fix it yourself. The amount of crap you find is insane. I mean, a lot of people talk about the code quality of the kernel. Yes, it has a lot of code quality in certain places, but don't look aside of those certain places. It's amazing. But it's a lot of fun to see that, to fix it up, to clean it up. And if you think it's sort of the worst thing already, no, it's going to become worse. So if you really want to help with such work and Julia is doing things like that, Keith Cook is doing it for a different reason, for security, he rewrites the timer wheel API. So he ends up touching thousands of files. And if you want to do that or ever get into the situation that you need to do a big tree-wide cleanup, don't worry. Don't give up. Just do it. Pardon? Huh? Reef count. Reef count. Oh yeah. Oh yeah. And the other lesson learned was never expect that code boards keep their promises. That's, yeah. I was young back then. I was not a grandparent. So, that's an excuse. So, but the other thing I learned about that, that estimation of effort is extremely hard. When we started our T-project and got funding, of course we had to come up with a timeline or a project plan and things like that. So I took out a crystal ball and estimated it. No, seriously, the estimation was based on a lot of knowledge. I was doing this step for years. So, I thought I could estimate it pretty good. But particularly for the hot black step, I was off by a factor of two and the total work hours by a factor of three. Which is, in terms of software projects, not that bad, but I would have expected to be better at it. Yeah. So, it took two years of time, almost two years of time in total to clean that up. Something like two men, many years effort. And the resulting patch flow out of that was something like 1.2 patches per workday. So you can extrapolate how many patches that took to get this cleaned up. So, other stuff we were doing during the timeframe, we rewrote the internals of the timer wheel. Because the timer wheel had a big problem in RT. We couldn't make the timer wheel base lock a raw spin lock and actually I had to convert it to a sleeping spin lock due to the cascading nature of the timer wheel. And I wanted to get rid of that. That was done by losing the precision of the timer wheel for timers which are fought out in expiry time. But that allowed us to get no hurts and the no hurts full stuff working on RT again. Because there's a code path where we have to take that lock from idle and we can't take sleeping spin locks from idle. That simply doesn't work. We tried it, it didn't simply work. You can, but it's not recommended. HR Timerist has a patch that posted, it was reviewed, it's about to repost it again. It's about the distinction between timers which expire in hard interrupt context and timers which expire in soft interrupt context. So, in the initial version in mainline, we had this distinction but then at some point Linus looked at the code and yelled at us, rightfully so, it was ugly. And we ripped it out and for those users who needed the SoftRQ expiry, we came up with a weird construct HR Timer Tasklet which basically scheduled from the hard interrupt context callback, the Tasklet which then executed in SoftRQ context the actual callback of the driver. Pon? You're welcome. I don't tell anyone who came up with that. So, there's quite some uses of that in Tree and we had other discussions that some of the facilities wanted to use HR Timers for good reasons but they needed the same indirection which is inefficient and ugly. So, and RT needs it as well because most of the timers which can be executed in hard or the callback can be executed in hard IRQ context on the mainline cannot be run in hard IRQ context on RT. So, we have that work around for moving the timers to separate expiration queue from hard interrupt context still in RT but it's still inefficient, it's still ugly. And we reroute it, so it looks pretty sane right now. So, basically what we do, we have that the high resolution timers are queued in RB trees sorted by expiry time for the various clock, different clocks we have in the system, clock monotonic, clock real time, clock boot time and clock tie. And so what we do, we duplicate the basis and then we have for each clock base we have a corresponding soft expiry clock base. So, we don't have to reshuffle expiring timers from the hard interrupt context into a separate expiry queue which has queued them right away in their own RB tree and just say, oh, this is the first timer of this set and please make sure that the hard timer fires at that time. Works removes quite some crap from the RT patch again. We did quite some work in the areas of futics and RT mutex together with Peter and others. This was fun. Separation of page fault and preemption. We had it for a long time in RT then the S390 people figured out that they needed for mainline. I think it was for virtualization purposes and so we helped them along to get that done. Then there was a huge amount of code in the tree which was app using task affinities for obscure purposes and we extended a lot of the debug facilities in order to catch app use of various things in the kernel. So, by now we have since then in this one and three quarters of years we have about 700 patches merged either all day in Linux 3 or they are on the way for 415 in one of the maintainer trees. There's roughly 50 patches under review or pending for repost out there and in summary we fixed about 40 real and 80 latent bugs in that time. It was partially very obscure stuff. The latent ones I categorized them into they're basically impossible to trigger but they exist. The fully real ones were actually rather easy to trigger but people got lucky and never triggered them or they got never reported. That we had a lot of bugs, have a lot of bugs seen also with the early work in RT where we spent a huge amount of time of fixing up log problems where then people told, oh yes, that's the thing which stops my server every three months. So, let's talk a little bit about the long term stable versions because that's something which comes up regularly and came up recently in a discussion. So, these are the trees we are currently having. 414 will be the next one. I don't know what the end of life is for that one yet. Pardon? Long term stable, 413 is not a long term stable. 414 is the long term stable. Full foot, yeah. So, Steven is doing a lot of the work where Steven doesn't scale so we have to have a discussion in the meeting on Monday. So, Julia took over 4-1, right? So, but still there's stuff which is rarely updated and Steven has a huge backlog, he told me. So, we have to do something about that but that's a discussion we have on Monday in the project meeting. Our current development version is 413. Patch is out there, it kind of works. So, but we are not going to stabilize it, we're going to drop it in favor of 414 because that's going to be a long term stable version and we agreed on supporting long term stables instead of picking a randomly chosen kernel version. Whatever. You know I'm stubborn. So, what do we have on development tasks? The most complex one right now is decash locking. That's the soft interrupt modifications. There's still some rough edges there which we have to figure out memory management interaction. That's pretty straightforward but memory management maintainers are interesting people. And the local locks and annotations that's something I wanted to get into mainline for a long time but got distracted by hot black locks and other two. Yeah, I talked about that with him and he didn't come up with something reasonable but then he didn't care much because it works in mainline by some definition of works. So, what's wrong with the decash locking? Decash locking has one problem which hits us in our tiers doing trilog loops. So, because in the decash you have to take locks in reverse order. If you walk a tree in a particular direction and in order to achieve that you lock the top node and then you try to lock the parent node or is it the other way around? I don't know what the regular lock order is. It keeps forgetting it. So, the regular order let's say goes up and then if we go down you have to do the trilog dance. So, trilogging in mainline is pretty cheap because you know that the other side is in the critical section and it's going to leave it sooner than later. So, now that doesn't work on RT, especially not if the one who holds the lock is on the same CPU and you preempted it and it cannot get back on the CPU because you preempted it so you would trilog forever. And you get unbound priority inversions by that up to the point of a live lock. So, right now we have but ugly work around in there. We just go and say, if we fail to trilog we just go to sleep for I think a couple of microseconds or something like that. I don't, I can't remember. Or it's a millisecond. Like it doesn't matter much. I mean it's, the reason why I did this was, A, it was the cheapest solution for the problem. B, I was just saying if any real-time relevant task is doing file system operations on the decache I can help them anyway so screw it. You always have my last solution for the trial marker who's, it worked. I know it worked Steven but it's even more convoluted than the multi reader boosting. Well, at least it's more deterministic. It's deterministically ugly, it's just not. Yeah, I mean there's some deterministic behavior in ugliness when patches come from you in that area. We know that. Now, seriously, but I was talking to a couple of people involved in the decache itself and some of them actually think that the trilog loops are not needed at all. So there is some, and I know that there is some experimentation out there already and some of them, some of that was initiated by me trying to do more RCU stuff in decache. And it kind of works but there are a couple of corner cases which really do not work and we have no idea at how to solve that. And unfortunately Nick, I mean he's back in kernel development but he vanished for a couple of years in some dark hiding place but he's, it seems he forgot everything about the decache he wrote ever. Pardon? Yeah, I probably would have forgotten it by now as well. So this is the most challenging issue right now. If you have an idea how to solve that, which is not involving trilog boosting, which is something we can actually sell to Linus, yeah, that would be appreciated. At the moment I draw a blank but that has been the case for several times and we always came up with some solution for the problems. So let's see where that goes. So if you want to help there's testing documentation that's stuff we really need help on. If you really want to dive into the code there's a task list on the real-time Viki. It's half ways up today to update it two weeks ago. Yeah, I did my homework. So that if you have something particularly in mind please shoot me an email and let's talk about it. Other than that, just grab the code, fix it and send me the patch and say it's done. I'm happy to pull it in. So people always expect the roadmap from my talks. Here is the roadmap this time due to the evolutionary nature of Linux, the roadmap will be published after the fact but it will be a very, very precise roadmap. Questions? Nope. No, I gave up on that. You all have these smartphones now and can figure out where the pup is on your own. So the roadmap with the pup was back then in the time when most people didn't have these things. So no more questions, that's good. Everybody wants, looks forward to the next speaker which is Steven Roestat and his 500 words per minute talk with 6,000 slides in a half an hour.