 All right, let's get going Thanks everyone for coming To my presentation on real-time tuning I'm Grazian Krishan And to give you a bit of context about me. I work for an eye here in Austin and I Makes hardware and software for test and measurement systems I'm part of the real-time OS group Where we use preempt RT based kernels and we build our distro based on open embedded in Yachto We mainly work with 32-bit arm systems and Intel systems Mostly item class some zeons up to 20 cores max. So pretty small systems I also happen to be the maintainer for the Linux kernel shipping on RT hardware at an eye So in this presentation, I'm gonna try to cover really quickly just the slider to on what real-time is Mentioned the tools you can use to work with real-time systems and hopefully spend most of my time Talking about some of the tuning knobs you can use to get better RT performance out of your system some of the safety nets you might need to remove for the same reasons and some of the gachas we encountered around the way and things to avoid I'm not gonna have time to go into any implementation details about preempt RT or anything like that or do in-depth review of tools There's a lot of good presentations out there on the tools to work with our team So at a very high level real-time is really all of all it is about is having a deterministic response to a stimulus if an event happens we want the system as a whole to act in a Bounded and predictable amount of time. We call that latency Events can be asynchronous a sensor exceeding a threshold Something happening in the real world or it can be synchronous clock driven events periodic tasks that happen, but the the main thing about it is this concept of latency and responding in a bounded and predictable amount of time having max latency Real-time since those systems today can get really complex That's actually one of the reasons preempt RT is good at it because you can use the full inner sector system to do Really complex sensors like cameras and ML in your control loop and stuff like that but the the same idea of having a bounded latency and Response to something happening holds It it can be catastrophic in some cases to miss your latency that line might run over somebody So if it's all about measuring latency what kind of tools do you use for that like as mentioned before you can This systems can get really complex. You can have cameras and complex actuators and bunch of IO bunch of CPUs Where do you start? So my my recommendation is to actually take a layer approach and the first thing we do and we do this on every kernel upgrade We are on cyclic test at its core. It's It's a pretty simple test. It basically Sponsor bus bunch of threads and then you do this measurement loop where you take a time time stamp You sleep for a period of time take another time stamp and from that you can compute the latency the wake-up latency of that thread and And you can do more interesting statistic like histograms and stuff like that It's also important to simulate the load you expect or actually Maybe even stress it beyond the load you expect on the system And you can use tools like IPerf to saturate your network link or like FIO to do this scale Hackbench is a good tool to just stress the scheduler with background tests There's a lot more Things like this I Also like to plot the histogram that cyclic test produces a text histogram. I like to plot it After a while you start to see patterns if you plot the histogram you get from cyclic test It's a useful thing to plot. It's just a simple Python script One thing that cyclic test doesn't measure a cyclic test measures a lot of things. It can measure Preemption being disabled interest being disabled and so on but it's usually driven by a local CPU timer So it's not going to include your IO latency. So it's a good idea to write a set of tests that Include your IO latency in this measurement loop and and you can loop back your IO your analog out your analog in and And do kind of the same measurement loop where you read the inputs do some processing Update the output and see how long that takes or measure how many of these loops you can do per second We have a set of tests like these I'll come back to this graph they're really useful to find regressions when we do kernel upgrades and and Check our drivers that they still work and have good RT performance. I'll come back to this graph in the gacha section But ultimately you have to test the whole system latency And you have to do it for multiple days weeks months and and stress the system beyond It's expected operating point to be to have any kind of certainty that in a complex system like this It's going to behave correctly in all situations There's a bunch of other tools Cyclic test and hack bench or part of our detest is a lot more in there RTVL is another combination of cyclic tests in a load and some scripting The Linux test project has a section for real-time RTLA I'm gonna plug Daniels talk on Thursday it's taking a white box approach to Detecting where your latencies are coming from using tracing for that. It's really interesting. I encourage you to see that So then once you know you have some latencies in your system, how do you go about debugging it? I'm just gonna mention the tools here. The top three are my favorite tools use that trace trace command in kernel Chalk all the time It's it's really convenient To interact with the trace file system directly for example for you can just start the trace from your application It's you by writing to files. I use trace command to extract the trace and then the post filtering And Kurno shark just to see what's going on. It's it's my bread and butter tool There's other tools that are useful. I mentioned LTD and G there There's a bunch of other tracing frameworks that I have no personal experience with but you can find talks on them Perf is really useful for Debugging the hot pads in your application and kind of figure out where where the CPU is spending the time It also has a really useful mode where you can def Results from previous versions so you can figure out where your regressions are coming from There's new tools now based on BPF like BPF trace and the BPF compiler collection There's they're really useful if you want to do custom instrumentation custom tracing And don't discard if you are close to your hardware team and have access to a scope and like some digital lines Sometimes it can be very useful to have an external reference. You can check Against and and figure out how long something took Like I said Steve and among other people did a bunch of presentation on these tools you can find the recordings from previous DLCs So now how do you go about tuning with a system to get the best performance out of it? The first step is of course getting since I'm talking about preempt RT is getting a preempt RT kernel Install most of the distros these days provide one if you want to build one from source You can use the stable RT or the develop RT branches For now I put it for now in brackets there because the preempt RT patches It's on its way upstream and maybe by the end of the year you won't need to use a separate branch There's also birds of a feather I want to plug the Stephen is going to do on Friday And I think there's going to be a lot more details on on the status of the preempt RT patch and where things stand as of now One thing I wanted to mention on this slide is if this is all you do if you just install a Preempt RT kernel and expect better performance. You're going to be really disappointed If you don't do any of the tuning it's actually probably going to behave worse You'd real-time systems have to be designed from the ground up to have real-time performance So basically what you need to do is identify your RT workloads in in your application in your system and then Decide on a scheduling policy and priorities or if you're using schedule on things like runtime deadline and period and on the right-hand side of that slide I kind of Listed the various scheduling policies that are available and kind of what preempts what let's get that line will preempt schedule and so on So my recommendation is to use scat 5 or scat deadline if you have for your critical loads And for scat 5 fall you need to pick a priority in the one to well It goes from 1 to 99. Don't use 99 It's used by the kernel migration threads and can get you in trouble if you use it I'll pick something from 1 to 98 or if you're using a scat deadline. Yeah, you need to provide those parameters for each of your threads Don't forget to adjust the priorities of your interrupts The ones that are part of your RT loop, you're probably gonna want them higher priority and the rest You're gonna want to have lower priority for them as same goes for kernel threads and so on And use everything that runs for everything that runs in the background use scat other Or lower priority So it won't interfere with your RT loads The other way you can isolate your RT workloads from Everything else that's happening on the system is by partitioning your CPUs It's it is a good idea to designate some CPUs as housekeeping CPUs and and that's where you are going to want to move all your background like workloads or like things like the kernel word queue threads and Same goes for interrupts most of them you can Affinitize to a certain core and it's a good idea to do that depending on if they're part of your real-time loop or not And for for real really sensitive workloads, you might want to actually isolate the CPU even from the scheduler take and For that you can enable full dynamic takes if you if you set the config no hurts full option in your kernel and then by setting The I sold CPU and no hurts full kernel Parameters at booth you can isolate that CPU from from the kernel scheduler and then you can explicitly assign a Tread to it and that allows you to run a real-time tread Basically at 100% of CPU utilization without interference from anything else and you can get some really good latency numbers by doing that I forgot to mention on a previous slide, but you can set the scheduling priority and Scheduling policy using tools like chrt or the sketch these calls for partitioning CPUs CPU sets is a it's a good tool The secret version one actually works better for RT because you have thread level granularity Compared to secret version two Yes, Stephen Yes, that's the next slide. Thank you Perfect segue Yes, so the other thing you need to do is Delegate your RCU callbacks to threads then then you can move to your housekeeping CPUs by Specifying that RCU node to be CPU list at boot And that's how you do it Other things you need to worry about don't do memory allocations from your real-time context Do they all your allocations up front if possible? Allocations can cause page faults and obviously that's a bad thing for your for latency Also consider resolving Symbols are startup like if you have shared libraries you want to resolve all the symbols of startups So you don't get dynamic like behavior later that that it gets your latency and After you do all that it's a good idea to lock all your pages for your real-time processing memory Using unlock all you can specify current and future flags Those that will basically keep your code and data pages from being Paged out and and introduce big latencies The delay VM instead timer It's a good idea if you're using the CPU isolation feature and you want to eliminate all latency sources The other thing you need to check is your clock sources, and you can check that under that is fast path For Intel base CPUs you really want to use TSE all the other timing sources have big latencies by comparison And it's also a good idea if you're doing tracing to check what trace clock is used just to make sure your data makes sense The other big knob that it's a good idea to turn off to improve your your RT performance is power management Going into sleep states or power states can can introduce large latencies and So it's a good idea to disable CPU frequency scaling and the CPU Freak driver Disable power management. You can do that boot for P states and C states. You can do things at runtime for C states By the way, all the slides are already uploaded. So you can you can read this a leisure later And Where it comes to firmware bias on Intel Same idea disable P states C states high-portreading up until relatively recently It was the recommendation was to disable it because you can get interference from the sibling core With core scheduling, I believe it's possible now to leave it Enable and and with careful design you can you can make use of SMT and not Pay the penalty Turbo boost can introduce variations in latency. It's debatable if you can leave it on or not Memory corrections this there's some like deep level some memory correction that can add a bunch of Of latency It's a good idea to set it at the lowest functional level possible It's also a good idea to disable peripherals that they are not using because they're just going to generate interrupts and stuff And increase your latency Legacy hardware sometimes can be actually implemented in in bias via like SMI interrupts And I'll get back to those system management interrupts I'll get back to that in the catcher section and there might be other options in your bias that you need to tweak and test And see if they what kind of impact they have depending on who your bias bias vendor is So let's talk a bit about safety nets probably the biggest one is RT throttling That is the feature that reserves a certain percent configurable percentage of CPU for non real-time tests So if you're if your real-time thread goes above a certain percentage, you will get throttled Obviously this has a huge impact on latency We actually disabled that on our systems by default and we're telling people to rely on just a Hardware watchdog to catch runaway RT tests that that might consume all your CPUs or Or yeah, and and educate we try to educate our users from the get go to design their applications where they allow a Headroom for other background tests to run on a CPU If you're on Intel and you've proven that your TSC clock is stable and you can use it It's a good idea to disable the watchdog Because again, it can interfere with things and thing goes for like soft and hard lockups after you validate your system It's stable and runs good. You can disable those and again for for machine machine check errors you Basically, you can ignore the corrected errors because there's another kernel thread that runs in the background and adds noise And especially if you're trying to do full dynamic takes and CPU isolation It's gonna prevent a system from from being able to disable the scheduler tick Some safety nest for memory you might consider disabling memory over commit that That that will make it where like allocations will basically fail with Like malloc will fail of returning null and if your application is written correctly Then it can handle that as opposed to allocating as much memory as you want and then finding out later that you exceeded a physical Space available and womb killer getting involved and speaking of the womb killer It's also a good idea to prioritize Processes you want to kill and processes you don't want killed like for example, if you echo minus 17 for whatever is your RT process that they'll make it unkillable and It's a good idea, right? You also should be deciding what to do on Out of memory situation and their recommendation is to actually make the system reboot rather than Rely on the home killer to kill a random process and then your your the system will continue running But you don't know in what degraded state it is So that can create problems. It's it's often better to reboot and get back to a good state These I hesitate to put up But if your system is completely disconnected from network and you put the epoxy in your USB slots You can consider disabling the mitigations They do have a pretty big performance impact in certain on certain workloads and there's a handy kernel parameter They can just set mitigations off and getting back that performance. But like I said, be really careful with this So now for some gotchas things to avoid I already mentioned system management interrupts these are basically hardware Former hardware interrupts that are handling former in bios or ufi on Intel and their highly priority. They're unmaskable They're used for things like temperature management legacy hardware emulation patching hardware box in some cases The trouble with them is the transitions to the system management mode via this system management interrupts is completely Like Linus doesn't even know about it. It's it's the West is unaware So you're just gonna see this like big latency spike And even if you look at a trace, you're just gonna see like basically the time jump a stamp jump And you don't know what happened, but your system. It's gonna see large latency SMIs and SMM are excited to see specific, but there are a similar Cons similar privileged modes on on other architectures like the secure monitor mode on arm there's a good wiki page by the Real-time collaborative project that talks about this and and shows you what tools you should be using to try to detect This SMIs happening and there's not much you can do. I mean you can if you can figure out what the Situation was that triggered the SMI you might be able to avoid it Worst case scenario you might need to select a different piece of hardware that doesn't suffer from from a lot of these Speaking of interrupts this are this is on the kernel side and this is the graph I showed at the beginning that was showing the degradation when I was doing that a analog in test analog out loop test So when we upgraded to from 414 RT to 510 RT, we discovered this drop in performance. You see there And we couldn't figure out of it first while what happened But basically when the preempt RT patch was rewritten for upstream. I believe if I'm Explained this correctly you can only have now a single software a cube per core and There is the bottom Half lock it's getting basically Both software accused and and force-threaded interrupts use this bottom half lock And this will cause basically latencies in your force-threaded interrupts If you don't request interrupts specifically with request threaded interrupt Which doesn't take that bottom half lock lock. So after we switch to our driver to Request threaded RQ directly as opposed to just relying on preempt RT force-threading your interrupt handler We gain back the performance. In fact 510 is much better than 414 if you look at that graph It's the latency spread is the jitter in the latency is much smaller and performance It's basically the max performance is about the same Another interesting one is MMIO CPU stalls. So this is a great cyclic test histogram that I took and a couple years it's a couple years old now And in the background, I was just accessing a TPM chip And I was seeing this huge 400 microsecond added latency tail on the histogram and The reason for that is if you look at this happened to be an Apollo Lake system If you look at this CPU diagram or the system diagram for that You can see the core where your code is executing up there on the left and the TPM chip is all the way on the right And there's this long path that goes through all the IO this IO fabric blocks. They are all Different bass widths and different frequencies Which means there's a bunch of buffering happening in the way to transition from all these IO fabrics and at the end you go through a fast SPI a boss To access that TPM chip and what happens is there's this common pattern where There's a bunch of register writes happening in a row through to this TPM chip to configure it And then there's a status register read and if the read is right after all those rights that happen in a row What happened is the rights will get buffered along the way in the fabric But the read because of the architecture ordering guarantees has to wait for all those rights to actually Propagate all the way to the hardware the rights to take effect and then you get your status register back But then the net result is your CPU will stall waiting for all those rights to go through all those bosses and To all the way to your TPM chip And this can happen with a lot of peripherals with we've also seen it with an internet Phi And we discovered that by accident by bumping an internet cable while we were running Cyclic test and we saw the latency spike and it's kind of the same pattern Basically the fight cheap is trying to do the link-negotiations for internet so there's a bunch of rights to that cheap and then a read that will stall everything and Obviously, this is bad because you don't want your RT application to fail just because you bumped into an internet cable that Not even part of your RT like you might not even use network communication here at the application But it's just background load I don't have a good solution for this other than test for it and try to avoid it in We have a couple ugly patches that are really not upstream of all where In a couple drivers where we saw a bunch of these rights and a read happening We basically added a delay to let the to give time to for those rights to propagate to the device and now stall the CPU Another big one after you solve all the tuning knobs and all that Probably the most common one I see when I debug RT applications priority inversions, so just to explain this If you have three threads there like a high priority thread age medium priority thread low priority thread and let's say the low priority thread starts running first and it acquires a lock or some other mutual exclusive racers and Doing its thing and then the high priority thread comes along and preempts the low priority thread And that's all fine. That's what we expect If you're using the FIFO scheduler But at some point age once the same mutual exclusive resource or lock so it blocks because Alice holding that lock The problem is that in the meantime you can have this medium priority thread That's completely unrelated to anything that that becomes runnable and because It is the highest priority after the our age thread blocks It is the highest priority thread on the system So it will get scheduled in and it can run as long as he wants and it creates this unbounded latency in your high priority thread Who cannot grab the lock because Alice holding it by L cannot run because the m thread is preempting it So this is really bad and it can lead to high latencies. This is kind of how it looks in So in Colonel shark, especially if you write the test to the specific area produce this But you can see the low The order is inverted by the low priority thread on top there medium high And you can see how the high thread gets blocked and then your medium thread can run forever And only at the end of that the lock the low priority track and get scheduled in and release the lock and The high priority thread gets to run the the solution implementing preempt RT for this is called priority inheritance and The gist of it is basically at the point where age blocks on the lock in order to prevent this priority inversion the priority of the L thread the low priority thread gets boosted to the priority of the high thread and That makes it where then this Interfering medium priority threat cannot come in and and scrub your day So low priority thread gets to run a high priority for just the right amount of time to release that lock And then the high priority track can continue and this is how it looks in colonel shark You see there like the high priority thread running blocks on the lock the little priority track gets boosted runs for a bit releases the log and And so on and you can see it in the trace events too if you can't read it So it is a common problem and it's the it's Unfortunately, it's common because there's there's a lack of priority inheritance support for locking primitives with the leap it read Glee C library the only Primitive that that has priority inheritance is the p-tread mutex and you have to use attributes to Enable it all the other priority all the other primitives in leap p-tread don't have priority inheritance And even worse we discovered a couple weeks ago, and we're trying to do a quarterly release the p-tread reader writer lock will actually live lock in user space remember we disable RT throttling so The p-tread reader writer lock got rewritten to be more performant So it does a lot more in user space trying to acquire the lock But that what it ends up happening you end up with this like high priority RT Threads spinning in user space trying to compete for that lock and it lives live locks your system So that's really bad. We don't have a solution for it I think we have a reproducing case now that we can send upstream and we're we're gonna work on See if we can do anything about it, and there's also no way to set priority inheritance on the standard mutex There's no way to set the attribute on it. So I Talked about a partial solution a couple years ago at ELC we Darren Hart and me and the RT folks worked on Implementing a conditional variable that has priority inheritance support and there's also PI mutex I'm taking suggestions if you know about our libraries that work well with RT that implement POSIX locks Yeah, it's not great And to make matters worse you can actually have Priority inversions with interrupts So this is a real case we encountered we had the watchdog functionality that was implemented in a CPLD That part makes sense. It's the CPLD is doing power sequencing on that board anyway So why not do a watchdog in it? Problem is it's on an I2C bus Which is fine for configuring and if you just want the watchdog to reset your system that all works great But there's there's a use case where we wanted to fire an interrupt and the reason you might want to do that on a real-time system Is because you want to put the IO in a safe state you might have like huge machinery attached to your controller and You don't want that thing to do crazy things while your controller is rebooting So you fire interrupt and there's a the sequence that puts IO on a safe state The problem is Even though the watchdog interrupt was configured to have high priority It requires an I2C transfer to acknowledge the interrupt and the I2C interrupt at that time was low priority in our system So and some unrelated me priority interrupt can basically ruin your day so you need to watch out for this and And kind of audit your peripherals and how they're connected what buses they use and what interrupts priorities you said And since I promise tricks in the presentation title This is a trick that I use a lot I Talked about priority inversions how do you go about finding locks in your application that don't have priority inversion and a really convenient way to do this is To basically patch the few tech sees call There's there's a piece of code at the beginning that checks for all the priority inheritance Few tech operations and you can add the default case in there That just sends a segmentation fault to your process if if it's an RT thread But it didn't call into the few tech sees call with the PI few text operation and This works a trade it like if you run your application gdb You can see the exact spot where where that call was made from get a nice Stack trace and you can see the p-trap Primitive that caused it in this case. It was a barrier They use few takes weight and and it was from an RT thread. So it's gonna suffer for priority inversions So in summary, I Hopefully gave you at least a starting point to what real-time tools to use and and What are useful and Some of the tuning knobs and and some of the safety nets you can remove And hopefully some of the things to avoid so you won't get trapped by them So that's all the content I have Are there any questions I can help answer there's two from the virtual session all right The first one is Bear with me. I am not in this field. I am interested in knowing if there's any Sorry, I'm interested to know if there's any built system built route Yachto, etc That integrates or makes it easy to apply most of these settings for you as an a template Instead of having to go through the checklist for any new system I Am not sure I know Red Hat has real-time offerings and and the out of disturb vendors. I Don't remember seeing anything yachto related to this I mean, it's a good idea in general But yeah, I'm not aware of something that exists out of out of the box unless like I said you use You contract with redhead or use one of their That's true Yeah Sorry, I don't know your name. You were saying that they can check the RPM spec package, right? That's using redhead and and that can be a good starting point for you And I should mention all our stuff is on GitHub. I mean if you if you want to look at our stuff You can take a look there. It's get up slash and I There's one more virtual question Does Muzi have complete support for PI Convars and mutex's. Yeah, I think it's muscle I Remember those interest from the muscle man there. We don't use muscles. So I'm not sure I can answer the question but I know they were cognizant of RT applications and and they were trying to Make it work for RT But I have not used this. I'm not sure what the status is And we have five minutes if anyone from here has a question. I'm happy to pass on the mic Yeah, firstly, great presentation. I had a question. Are there any multimedia pipelines that have been tried with? real-time Threads and you know, I was drivers and you know, basically just from here say unlike the RT users mailing list and and IRC channel There's definitely people using it for audio applications and and big like stage kind of sound setups So it's at least for sound. I believe that's true. I believe that's true for video too because There's it's used on cameras and other things Like I said, we're more in the test and measurement space. So I am not Sure how common that is in consumer products, but I believe the answer is yes Probably the later the question is what application do you normally the product? It's really depending on which product. It's a lot of trade-off. Yeah, probably why the reason Yakato don't have this is you once you enable RT you suffer from like a memory Overcommit memory normally So what application do you normally use? Yeah, so if I understood the question correctly is yeah, and you're correct The reason you cannot have like just a standard set of tunings you it's because it's specific to your application in it for us a lot of our applications are Things like basically hardware in the loop test for example, if you have Like an electric vehicle inverter and you're you're trying to simulate the car around it because you don't want to test inside the car Right, you're just testing out a production line So you're simulating everything else the car is doing and and only and and you're hooking up this inverter And the reason real time is important there because you have to feed the information in real time I like as otherwise it can blow up in the most extreme cases If you if you feed it the wrong control signals So so having a real-time response to that is Important another application. I know from from mice companies If you're simulating for example Or emulating a 5g base station for example They 5g has a pretty tight time slots. I believe they're like 250 microseconds or something So if you're trying to emulate the protocol and send all the data you have to do it on those time intervals So it's important to have that real-time behavior There's a lot more like that Production test and stuff like that All right, I think we're out of time out of time and this is the last session of the day And I think the next one and probably a little bit of time is in like one minute If you go back to the showcase where all the sponsors are there is a booth crawl If anyone wants to participate Thank you so much