 I'm Grazie Antrishan. I work for National Instruments. And to give you some context, National Instruments makes hardware and software for the test measurement and control markets. I'm part of the Real-Time OS Group of National Instruments. We've been using pre-empt RT for the last six or seven years on all new hardware. We still maintain some old legacy hardware, but I won't talk about that, and we're trying to get rid of that. Our main two architectures currently are ARM and 64-bit Intel CPUs. And our main platform we support is this embedded CPU plus FPGA combo, but we support many more than that. And we use OpenEmbedded to build our distro. I do have a couple of disclaimers. This is work done by multiple people. I try to link at the bottom of slides to their mailing list posts or commits and so on. I intentionally picked some RT troubles that have ugly hacks in RT currently, so don't throw too many stones or frozen sharks at me. And some data might be still by now. The Linux kernel moves very fast. I tried to get numbers off the latest branch, which was two weeks ago and lots of things changed since. And this is kind of my agenda for the rest of the presentation. I'm going to do one slide on the problem space and then go through describing an RT trouble area we've run into. And since this RT summit is supposed to be about discussion, hopefully we'll have a good discussion around it. And I'm going to do this as long as I have topics and we have time. So the problem space for us, it's defined in terms of control group rates. So there's two ways you can run a control loop. You can reserve a core and run it full tilt in polling mode. That's possible with Linux. There's some usability issues with that, but that's not the area I'm going to concentrate on. Instead, I'm going to concentrate on control loops that are triggered by a timer expiration or some other event. And in that space, for us, the trouble space is this area of control loops that need to run between 10 kilohertz and 100 kilohertz frequency, which roughly correspond to wake-up latencies between 10 microseconds and 100 microseconds. That's because usually control algorithms are fairly simple things. PID is just a couple of multiplications and additions. So the control loop iteration time, it's actually dominated by wake-up latency. I also listed, let me see if I can get a pointer here. I listed kind of where our slowest platform currently stands and our fastest just to give you an idea. So with that, I'm going to start with the first trouble area we've run into. And it was discovered actually by just bumping into an Etonight cable. I think Julia did that while running a cyclical test run. And then more recently, we found it with a TPM chip we were testing. This time, we were looking intentionally for it because we knew it could be a problem. So the symptoms are the CPU appears stalled in the middle of a memory mapped IO read instruction and the timer interrupt gets delivered late even though interrupts are enabled. And it was discovered with the E1000, E1008 drivers and like I said recently with the TPM TIS driver. And this is how it looks. This is a screenshot of Colonel Shark. There's a lot of stuff going on there. I just wanted to show you how it looks if you look in the upper right corner between those two cursors there, how it looks on a Colonel Shark trace. You can see that long stretch of time where the CPU just does nothing. And this is how it looks if you plot a cyclical test histogram while running that TPM, while accessing that TPM chip in parallel from a scared outer thread, low priority thread. You get this weird histogram with a really long and flat tail that's over 400 microseconds in length. And that's the added latency caused by the accessing that TPM chip. And by comparison on this graph, I overlaid a hackbed drone that maxes out at 56 microseconds on this platform. Why this happens is essentially because the CPU can ride exponentially faster than the IO device can sync. So the rides get buffered between the CPU and the IO. This is generally not an issue if the number of rides is small. You usually don't notice it. But if there's, you get the stretch of rides in a row like with that Ethernet driver, when you plug and plug the cable, there's a lot of reconfiguration of the Mac that needs to happen. There's a lot of rides that happen in a row. And then you get this stall in the middle of the MIO read instruction because that read instruction for the same memory region needs to be ordered with respect to the rides. And then it results in hundreds of microseconds latency spikes. And it's pretty obvious why that happens if you look at, this is a fairly simple CPU, just the four core embedded Intel CPU. And if you look at the path, the data has to take, for example, from that core all the way to this TPM chip that's connected to a fast SPI bus. It has to go through all this IO fabric that's running at different frequencies and different bus weights. So there's all these buffers and arbitration that needs to happen along the way until you get to that TPM chip. So we have a couple of hacks that are really ugly currently. For the Ethernet driver, we just added a delay after like stretches like this where there's a large amount of register rides. For the TPM chip, one of my colleagues, Harris Kanovic, did a bit more involved patch that essentially replaces the IO ride call with an inline function that eventually for pre-empt RT does a read after each ride and forces the flash of that ride. So you said that you practice that you only get a sing-in access delay in the platform. Yeah. So that's the description of the trouble area. So I'm hoping now we can talk a bit about what we can do. I will practice this by saying we know this is a hardware problem. Unfortunately, it seems to be a fairly common one. So I'll be interested from the RT community if there's other ideas what could be done. And so I have a couple of prompting questions here. I don't know if you've encountered this in other drivers, have you encountered it in other architectures? Yeah, that's that's what I thought. I think it's a function of the buffers along the way. It's not fixed for the Ethernet. It depends also on the PCIE bus topology. How many switches are on the way? For the I don't have data for the TPM. I know for the Ethernet case it was about 40 rides in a row followed by a read. When it's a single ride, it's not noticeable essentially. Yeah, I would say yeah, probably over 32 or something like that. We also did a test where we registered the buffers to see how bad it could possibly be. Do you know if there's any way to track these buffer states? I don't know if there's Intel people here. If there are any PMU counters or anything like that that we could use to track the buffer states? The only thing you can see in the core PMU counters is the fact that you're stalled. So the only thing you can see in the core PMU counters is the fact that you're actually stalled and how long you stalled. But that doesn't tell you anything about why. Yeah, it doesn't help you try to prevent it by issuing a raid or something. So did you look at the code generally if there's a lot of these patterns lurking around? That's kind of the first question here is what tools are available to actually pick out those pieces that aren't just like grab around the source tree and figure out what the access patterns will be. Because some of these access patterns are going to be not obvious at all. Right. The problem is it's the writes at runtime. So it's not necessarily that they're after each other in code. It's the runtime pad. I mean, the question is whether we just hack some of the accessor functions or inlines or whatever. And either instrumented in the first place to see how that amounts or you have to come up with some smart, non-intrusive instrumentation for that. But the other thing is we might actually what we might do for RT is force the read back, which is a performance problem. Will we then find hardware that breaks because the writes don't land fast enough? Yeah. Yeah, I mean, hardware is broken. We know that. So, yeah. Yeah, so I actually added that question if if we could add a load to the Iorites, but there's going to have to be exceptions because some loads have side effects for some things. I think Julia posted that to the list. There wasn't any follow up after that. Maybe one good thing you could do is a documented in the wiki that this is a problem. And if people look at that, that basically test your drivers hardware, look at it and then figure out, can you can you find any of those patterns in your driver? Maybe, maybe we integrate some lightweight instrumentation, which means just counting consecutive writes before a read hits. That should be fairly trivial if we do it per CPU because it's unlikely to happen. Yeah, you can get scheduled over to a different CPU, but by then the problem is gone. Yeah, the problem is there, they you also have to come for time because it's okay if there's enough time within those rides. So maybe a count. Yeah, I mean, at least they will give you a rough idea, but it won't be very accurate. Right. For out of ideas for this, I'm going to move to the next topic. Which actually is not a trouble anymore, because I'm fortunate I took a vacation two weeks of vacation right before the Arctic summit and Sebastian's media patch the day before I left on vacation and fix this issue. But I didn't have time to change the presentation. So I'm going to talk about it anyway, because people might not be on the latest RT development tree. So the problem with this one, the symptoms for this one are if you have multiple time slips or timeouts coming from scattered threads, they can stack up to fairly large latencies. And it's not just clogged nano slips you need to track. There's other things that use high resolution timers like locks that have timeouts, a few taxes with timeouts. And in fact, that's how we first found it is there were a bunch of I think it was conditional variables with timeouts that stacked up to a fairly large latency. So this is an old trace from back then. And this is a bit more complicated because first you get the high priority thread wake up, but then immediately there's another time or interrupt that gets stacked up. Then there's the scheduler tape processing that takes about 40 microseconds with interrupts disabled. This is with event tracing and an arm running, it's a fairly low power arm. And then you see what you see in traces is these 15 microseconds sections where HR time expirations get processed and they keep stacking up to, they can stack up to large latencies. So when we first saw it, then we went and wrote a pathological test to see if we can reproduce it. And this just launches a configurable number of scheduler threads and it does random nano slips up to one millisecond. And using this pre-dump test, you can reproduce the same pattern where you see the cycle test wake up. In this case, we were running cycle test in parallel with this stress test. And then you see these 15 microseconds sections of HR time expirations processing. And if you're curious how it looks on a histogram, I compared it with the hack bench run. So the hack bench runs are this green and blue traces up here. And then makes out that like 15 microseconds on this Intel atom running at 1.3 gigahertz. And by comparison running cycle test in parallel with that stress test, you get almost double the max latency and you also get this like ugly looking histogram, ugly 4RT looking histogram. And it's the same story on ARM. It's a universal problem. And there the max latency was even worse. So before Anna Maria did a rewrite of the soft IRQ for HIT Armors and Sebastian submitted a patch, we had a fairly ugly hack that we tried. We didn't really ship because we weren't sure if it's safe. But what we were doing is marking as IRQ safe only the RT tests. And like I said, right before I left for my vacation, Sebastian submitted this patch that actually fixes the issue for us by comparison. This is the new histogram here with the patch. And you can see that histogram looks nice and spiky and has a much lower max latency. And it's the same for ARM. Now, because this is a solved problem, I don't really, I don't have a discussion. So I'm just going to do lesson learned. Mainly, don't leave on vacation before the artist's presentation. Sorry. Yeah, go ahead. There is one thing that might need to be discussed and that is the use of HR timers in general in the kernel because there is a quite a large number of more or less misconfigured, usually ranges and things like that, that need to be fixed. And there does not seem to be awareness that these things can cause problems. I wonder how much of these quite useless high resolution timers are actually a part of the problem here. Which one's only use less? Well, having something like use lip ranges with min max being equal or use lip ranges that are in range of 100,000 to 200,000, which just doesn't make sense to use high resolution timer because you're not in the right context anyway. The problem is the timer wheel, at least as it is now after the rework of the timer wheel, because it had had its own problems. You can't use it for anything which should expire half ways precise. And it doesn't matter whether your expiry time is 100 microseconds out or 5 seconds out. So if you want to have half ways precise expiry times and basically what we decided in the kernel community is to give the user space interfaces. And that's what a lot of things rely on today like Paul timeouts and other things. You want to have a half ways precise expiry time. The timer wheel degrades in precision. The four out you go. Because the reason behind that is once we had a high resolution timers as the timers, we basically have left the timer wheel as timeouts. And timeout means it catches something which shouldn't happen anyway. So if you look at the statistics, how timer wheel timers, how often they expire. So if you do heavy networking loads, except for a few things, most of the time 99% of the timer wheel timers are canceled before expiry or reprogrammed before expiry. TCP timeout timers and stuff like that. So what we decided to do in the timer wheel is basically only keep the first the first wheel precise on the tick. And then the next wheels degrade precision. So that has two effects. One we can get rid of cascading, which was a major pain in the neck with the timer wheel. Because if you had 100,000 of TCP connections on a machine, you had eventually to recascate a gazillion of timers into the next wheel into the more precise wheel in order to cancel them before expiry. So this was work done just for nothing. So now if you have something which you want to happen in exactly five seconds from now, the timer wheel is the wrong thing, because it will either expire you in five seconds from now or in five seconds plus 12% or something like that. In the worst case. Yeah, but the if I have a usually branch and I'm putting min equals max, it prevents optimization. And if I have a usually branch with a min of 100,000, the max of 200,000. And if you look at the context where that's used, it's very hard to see why that would need to be a precise timer in most cases. And there are just a lot of timers that were converted when usually branches was introduced, where I think there was not that much thought given to if that needs to be a high resolution timer or not. So it's just a matter of keeping the kernel clean with respect to high resolution timers. Yeah, I mean, I mean, if that if there are things we where we could replace them easily without changing the functionality with time of will timers then there's no reason why not. I mean, the range thing is basically was has been introduced for power saving. And now that we have the implicit batching on the timer wheel, which gets into less granular ranges. The four out your time out is we might get away with converting some of those HR timer users to the timer wheel as well. So without violating that the power constraints. So but you cannot do a wholesale conversion, you have to look at it on a case per case basis. So we can put that on the ever growing to do list. Yes. volunteers volunteers. Yeah, but the country script will just identify the place but it doesn't can't make a decision. That's the problem. You have to look at the at the at the thing and in order to make a decision. That's not what a script can do. Much I love coachy but so report problems early. They might just get fixed for you. There's this guy's great guys working on it. All right. Arkti trouble number three. It's actually not the kernel issue. It's a Julia C issue. And that's the lack of priority inheritance support for the locks in the Petred Library. A couple months ago, so I did a survey looking on the slope at of the locking primitives in the Petred Library and what a few text operations they use. And at the time, and hopefully it hasn't changed since, I don't think the code moves that fast in Julia C. The only locking primitive I could find that actually support the priority inheritance was the mutex if you enable it through its attributes. All the other ones, they either have an internal lock. So the reader writer lock has an internal lock that uses a non PI, a few text wake, a weight wake, semaphore, same thing. Petra spin is not that Arkti friendly is just those users who play spinning. And the conditional variables have a really long history. And I was hoping Darren is will attend, but he's not in the audience. He filed the bug report in 2010. As you can see, it's still open. That basically Petra conditional variables don't support priority inheritance. And there's been a patch set or that's that Darren wrote and I believe think are I'm not going to say his last name because I'm going to mangle it. And I helped forward that patch a couple of times. I never got accepted. And following the Arkti summit last year, Torval Regal had a presentation around these, because conditional variables had to be rewritten due to C plus plus 11 standard change. And the last comment on this bug report says essentially there's no known priority inheritance way to achieve priority inheritance support on this. Personal gripe of mine, the past standard actually hasn't changed yet. The issue is still open with the Austin group. They need to clarify the essentially the ordering. If a signal can wake up a waiter that started waiting after the signal was sent. And that clarification hasn't been made yet in the past standard. But the gilipsi code is changed now. So Petra conditional variables don't have priority inheritance support. And the new implementation also has a priority inversion issue. As I understand it from the presentation last year, there's it's implemented with two groups, one of active waiters and one like standby and waiters won't get signal until the active group gets completed signal and that can create priority inversions for Arkti threads that they get put into the second group and have to wait for a long time. Yes, okay. So for discussion, I was I was gonna ask if you know of work in progress. So apparently, there might be some or any other changes since last year. Or if you know the library for for Arkti locking or Arkti data structures that do have priority inheritance support. I mean, the only known to be working PI implementation is Petra and mutex. So I'm not aware of any other things out there and I mean, you don't want to use users by spin locks anyway, right, neither neither on RT nor nor on any other thing. I mean, there are use cases where you can do that, but you have to be very, very careful and in system design in order to make them work. I mean, multi reader boosting. Multi reader boosting is dodgy. I mean, Steve done it. But timing analysis, I'm not sure if that's ever been done on it. I know that the old Solaris kernel used to boost the first reader and then just give up. Because as it turned out, there were very many cases that only had the one reader. So it more or less work tissue. Yeah, that's another question. I would like to ask for a clarification. If I have in the hard real time threat, a broadcast of the conditional variable and in a non critical threats at the rate for it convey it, then if there is no changing of the mutex in this, if I do not lock the mutex in the waiting threats, or in the broadcasting threat, then actual broadcasting is not blocking or it is blocking. Do you understand what I mean? The problem is now with external mutex associated. It's with internal implementation. It's the internal implementation with chances on mutex, which is non PI. And you have to take that one in order to broadcast. So even the broadcast in threat? Yeah. So yeah, the internal Julip C implementation has a lock inside. In many cases, it would be interesting to have only the primitive, which allows to wake up low priority threats, which is really non blocking. So basically using a better text, which should allow this without the consistency of that time, anything else, but only as the firing even to the, to the other low priority tasks. So what is the suggested primitive to use in this case? Where mutex? Yeah, I mean, it either, I guess for a few texts or a mutex. Yeah, the problem is the conditional variables internals are really complicated and they need a lock to mess around with their internal data structures. So the topic number four I had here is around managing interrupt threat priorities and priority inversions that are possible with that. I'm going to start with another big disclaimer. This is work with it a couple years ago. We didn't know better. So the first two patches, I'm going to ask your opinion on our machine to show them within send them upstream for that reason. But it's still a good discussion to have. We're looking for best practice ideas here. So the first one comes from the need for it comes from the fact that it's hard to associate and interrupt with its corresponding thread PID. So there's already a file in proc IRQ that lets you set the CPU affinity for an interrupt. And what we did was add another file that lets you set its priority. So I'm curious if you know better ways to do this or if or if what should we do about this patch, I guess. Yes, it won't work. I mean, what we could do is expose the threat ID in the proc interface. Yeah, that was all about. So I need some action threat ID. Yeah, that was all about. Just the security people probably. Portion the threat ID would work. I would love to have work use have a donor task like that. But instead, they created an alternative universe for setting their things. Yeah, we'll see. Yeah, that will work for us. So if that's acceptable. The second kind of patch question I had relates to interrupt as they get created after boot. For example, some internet drivers don't create their interrupt threads until you plug in the cable. So we have this internal patch that allows you to pull on proc interrupts. And what that enables is you can use, I think Red Hat has a demon called RTCTLD or something like that that can watch for things and can associate a priority to interrupt thread. Is this the same thing to do, enabling pull on proc interrupts? Are there other ways to do this? Yeah. Yeah. Okay. All right. And with that, we are getting to them. The only hassle is we have no IRQ information in CISFS as far as I know. So you would have to add all that RQ stuff to CISFS as well. Don't know. And now for the more interesting problem of the priority inversions. So this is an example we encountered. We have some watchdog functionality that's implemented in a CPLD chip that's connected to an I-square C-boss. It can, for some complex cases, it can be configured to do more than just reset. It can fire and interrupt. The reason for that is to do things like save the IO and stuff like that. So what happens is the high priority, the watchdog interrupt is configured as high priority. It fires all it's fine until it tries to acknowledge the interrupt and it needs to do I-square C transfers. It usually needs to read the status register and write it back to, write back some register to acknowledge the interrupt. And that I-square C interrupt is lower priority and some mid-priority interrupt or some other threat can create the priority inversion. We have a solution to this. Don't use I-square C for thing which is important. That's true. Yes, I-square C is one of the slow ons. SPI is not much better SPI. So I was curious if there's any ongoing work around preventing priority inversions here or more generally a solution to the priority inversion problem with completion objects. No generation with completion objects. So one interesting thing what's happening right now is at least on the instrumentation side and that might be something we might be able to hook into is we got lock depth support for cross release. That means that covers completions because you can create catalogs with completions as well. But the lock depth doesn't know about that today. No, now it knows. It does now. And people are complaining because of course it triggers false positives. But we've been there with lock depth as well and teach lock depth that false positives exist. But it might be something we might want to look into at least for instrumentation purposes. I said to make it easier to decode scenarios where you can end up with something which waits for completion and then you can see who is the other guy who is going to complete. Is that going to create a priority inversion problem over time? It might, it might not. I mean if you're waiting for this guy or you're in your high priority threat, it's your problem anyway. You can't do anything about that. There are other things where you wait on completions. A related problem is CPU frequency for IO weights. So you like to boost the frequency when you're stuck on IO weight to more quickly submit new requests and all that stuff. But then I think the plumbers this year, the typical pipeline workloads, the DAC workloads, they have a similar problem that every new thread waits on the work of the other one and how to do PI across that or the completion stuff. So that is something that people are exploring and looking at but there's no real solutions. Yeah. That's all I had for troubles since it's, I have 10 more minutes I'll go beat into access stuff. Just some tips I had at the end of my presentation. Check your kernel configs after a kernel upgrade. We hit this problem with on arm for security reasons. This option got added, defaulted to yes, increases I guess security. Not sure what it does but basically what it does, it adds code in all U.S. macros, save registers, lower registers and as you can imagine it creates a huge problem. You see this orange trace. So what I plotted here is like how long cloud get time takes. The blue trace is before the change. The orange trace is after. You can see the whole thing shifted and it's a 10% if you're looking for just doing cloud get time. So we disable that option and we leave with a risk of, I guess, security. What it does, it's related to CPU domains. Honestly, I'm not sure. I know it's implementing a hardware and arm v8 but all the arms deported. Yeah. So something to keep an eye on if you're running on arm chips that are v7 or lower. Check your clock sources on your hardware kernel upgrades. We've hit multiple issues. For example, on one piece of hardware the TSE clock got disabled because the only other clock source was the ACP IPM and I believe it's only 16 bit and it rolled over and it thought that the TSE is bad and it disabled it. So make sure you check that you're using the correct clock source. Pretty much TSE is the only one that's active friendly. We also had an interesting boot hang caused by some test code that got left in the BIOS that left the TSE adjust register set on first core. This was your BIOS team doing that? You have educated these people please? Yes. They fixed it. Luckily we have a BIOS team that's in-house. Other people are not that lucky. Yeah. So make sure you're using the correct clock source. Compare timer experiences against external interference. A lot of times I find it useful to use it like just as plain oscilloscope and like if you have a fast GPI or something. Make sure that they're aspiring at the time they're saying they're aspiring. I also find it useful to drive the F trace from a clock source that I trust. That way you can because a lot of times after an upgrade we're not sure if the trace has changed and it looks slower or if things are actually slower. So it helps to have an external reference for your clock source. And sorry. Yeah. So it looked like things are taking longer in a trace but we weren't sure if it's like the tracing code change or right. Trace points. There's a trace benchmark that turns on a benchmark so it shows the speed of the trace points. Okay. So if you want to be upgrading your queries if anything changes with the trace points look at the inside one of the options in the bug config. Do that at least we'll tell you with tracing slow down. Awesome. Yeah that's good to know. And tip number three run reboot test. It's incredible how many issues we've discovered by just running calling reboot at the end of like once your software stack it's up just call reboot and it hit multiple issues. The most recent one was a noobs on shutdown actually Peter fixed in like record time. There was a race between exit and phytex unlock I think. We hit export data corruptions. Nandrid disturbs. This is an interesting one. We had a boot partition that was fairly small so we didn't have a lot of free blocks. So if you read it often enough it you eventually hit read disturb and your system stops booting. All kinds of things. So running reboot test is very helpful. Simple test is useful. We also do hard reboots. We expect data loss but data corruption is not okay. And our harder teams also runs it in a temperature control chamber which brings out other gremlins. So that's the end of the material I had. Thank you very much.