 Since this session is scheduled to be so long that people will make it into a discussion, so please feel free, especially if I say something you think is wrong, which is entirely likely at some point during the talk, please do interject and let's talk about it. So I'm going to talk, obviously, about interrupt handling in Linux, which is a gigantic topic. It just seemed to get bigger and bigger the more I looked into it. This talk is inspired originally by the fact that for my day job I work on programming real-time Linux in trucks, giant road trucks for increased automation of controls, and trucks are obviously safety critical systems, and we were having a problem where if I ping-flooded our device, the control system stopped working, which clearly for a real-time control system for safety criticality is not acceptable. So that problem had to be solved. I realized that as I think all real-time developers do, that just trying a whole bunch of stuff and hoping it was going to work was not going to be a strategy that resulted in success in any finite amount of time, and that some real understanding is necessary. So that experience kind of inspired this presentation and the investigation I did that led up to it, just about my really coming to understand how interrupt handling is behaving in real-time Linux. And so to discuss that I'm going to talk a little bit about real-time Linux, compare it to the mainline kernel. I'm going to talk a lot about this IMX6 boundary devices board that I have. It seems like if you look at what's online right now about interrupt handling, it's mostly about X86, so I'll correct that by talking mostly about ARM. I also have done some stuff with X8664 just because thanks to some presentations by Brandon Gregg of Netflix and Brandon Blanco of Plumgrid, I've gotten really fascinated with EBPF. How many people have heard of EBPF? So EBPF is really a superior way to trace problems in Linux. I can't use it on my ARM32 board yet. I really am looking forward to doing so, and I'd love to talk to anybody who's working on that. I want to present just a little bit about X86 just because everybody who's interested in testing on Linux should really try to start using it. So enough prologue, let's get on with the show now that everybody's here. So just to point out some upcoming talks that you may wish to attend, if you're interested in this topic, there are two, and of course for your convenience, they're at the same time. One will be in this room by Joel Fernandez, now of Google. Joel and I had a long discussion about each other's talks before the meeting, and I really appreciate the advice and feedback he gave me, really made this talk a lot better. I don't know anything about the talk by the gentleman from ARM, but that also sounds very intriguing. So for those of you who came in after I got started, these slides are posted at my website, shedevelop.com. The slides are significantly different that I'll be showing today than the ones with the Linux Foundation website. This version, of course, contains not fewer mistakes, but simply different ones. But let's start talking about interrupt handling instead of talk preparation. Let's just start off by talking a little bit about what interrupts are and why we have them, just to sort of think about the topic from a very high level. I'll go on to talk about the different types of hard interrupt handlers we have, about the troublesome soft IRQs and taskboots, which are big headache for real-time Linux. I'll discuss the differences between interrupt handling between the mainline kernel and the RT kernel, which I would hazard to guess outside of the scheduler is the biggest difference still remaining in the RT patch set. I'll talk about how I tried to understand the problems we were having with different tracing tools and where I found success and less success. And I'll talk about the example that really motivated this whole discussion, which is understanding the NAPI, the new API for networking, which I think is now 10 years old. And when does NAPI polling in Ethernet take over for the handling of packets, which is initiated by Ethernet interrupts? So just to motivate all this discussion of testing and tracing, I have this quote that I found yesterday when I visited the Bauhaus Archive. So I've been studying German for a year, so I'll give it a shot. Kunst liegt leber ist sie muss in wieder in der Werkstaff aufgehen. Which means stop talking about the code all the time and let's run some tests in German. So here's some types of particular topics I want to discuss. What is all the information and proc interrupts about? How can we interpret that to understand what the hardware in our system is doing? What are inter-processor interrupts and non-maskable interrupts and why do we care? Why are atomic operations more expensive for ARM than they are for x86? What are the differences in IRQ handling? What function is running actually inside the IRQ threads that real-time Linux has? And the demo part of the talk is mostly about the question of when does the kernel actually switch from hard IRQ processing for Ethernet over to the polling interface? I think the thing you really learn by doing a lot of testing is if you just read the code, you may get very concerned about the way packets, say, are handled in a particular section or a particular function. But then when you actually run these tests, you find out that function never executes ever in your system. So doing testing really helps you focus on where problems are. So this is a one slide summary of interrupt handling. So if you just want to know what it is, you can probably leave after this. There's a top half, which is the hard IRQ. And then there's some time that takes place. And then later on we see the bottom half, which is the soft IRQ. So if you understand that, you understand that we have a function that runs right away that services the interrupt controller. And then the deferred work happens at some time later. And of course the whole problem is about the deferred work and when do we schedule it. So why do we need interrupts at all? I think probably anybody who would come to a talk like this knows the answer. The alternative is polling, which is time wasting and introduces large latencies. There's a lot of variety in the way that different real time systems handle interrupts. Everything needs a timer interrupt, almost any system, no matter what it is. Really interrupts are the way that devices inform the kernel that they need some type of attention. There's interrupts in U-boot. Some of this material I'm gonna kind of skip through on the interest of time. But there's information here if anybody wants to go back and read it later on. Very interesting example of interrupt handling is in Xenomai. There's a famous eye pipe which allows the hypervisor essentially to control the propagation of interrupts in the system. And make sure that they get service quickly that way. In Linux, we still have quite a variety of interrupts. We've got, as we mentioned, the fast and the slow handlers like hard and soft, level and edge triggered, local versus global. Local means per CPU and global I think is obvious. We have the system management interrupts versus those raised by devices. The maskable and not, the shared and not, the chained and not, and of course multiple interrupt controllers per SOC. So you can see how this project kind of got away from me. There's an awful lot of things you could investigate. The simplest interface is to look at interrupts are PROC interrupts and MP stat. So lest I just drone on endlessly, I'll show you. Is this font large enough here in this left hand screen? Is it good enough? Say no if it's not. Let us look at PROC interrupts. So this is now 4.4 RT and this is my IMX6 32 bit boundary devices, free scale controller, arm board, which was opened by airport security for your safety so you can feel secure knowing that it's okay. And I don't know what they thought when they looked at it, but they have looked at it so this is the RT kernel. But you can't really see that looking at PROC interrupts. It actually looks pretty much the same and non RT. So here on the left, we have interrupt numbers. We have hits showing which the four quarters of the IMX6, the interrupts are disturbing. You can see expressed in this column whether the interrupts are level or edge, we have a GPC, a general power controller, a GIC, a generic interrupt controller, I think that is. And then we have some interrupts that are handled by GPIOs directly. And you can see a lot of those are the buttons. Should be obvious why GPC, the power controller, needs to look at interrupts. Anybody wanna guess? Just so that the system can wake up, right? The system doesn't have a way to wake up when a button is pressed unless the power controller is watching the interrupt line. Down here at the bottom where we have this text and no interrupt numbers, these are the IPIs, the inner processor interrupts. There are just a few for ARM. There are a whole lot of them for x86. A feature of PCI is that it always shares a lot of interrupt lines. There are some other interrupt lines also that are often shared. And of course for your convenience, these interrupt numbers are different from the PIDs that Linux uses to manage processes. Since here in the real time kernel, almost all of these interrupts are running in threads. Now the Ethernet that I'm gonna spend most of the time talking about actually has two interrupts, one here on GPIO that gets hit all the time. And one that is connected to the generic interrupt controller that in my months of playing with this board has never taken a hit. So I've been unable by reading the FEC free scale Ethernet controller. And IMX6 and ARMv7 processor manuals, I can't figure out what the second interrupt is. You would think it would be Wake on Land, but then I think it should be connected to a TPC. So if anyone knows, I'd be happy to hear that. That's enough looking at that, let's return here to some thinking. Just say a quick word about the inner processor interrupts. This is a switch case statement from kernel sked core. Basically, it just shows what functions the different IPIs are corresponding to. Most of these actually have to do with SMP, with CPU hot plug stopping work or transferring work between cores when the system is coming up or going down. One particularly interesting one that is seen a lot in the studies I've been doing is this call function interrupt. That's part of something called receive packet steering that allows Ethernet packet processing to be moved from one core to another. I'll talk possibly about that if I don't want too much run out of time. So a non-mescable interrupt is something that a lot of the IPIs are. Non-mescable interrupts are, as the name suggests, ones that the system cannot ignore. They very often have to do with errors in the system. They're also used by Perf. This is, I have a lot of stuff in here where I really drilled down and looked to see how the different architectures implemented these features. But I'm not going to present that, it's just here on the slides. A very famous non-mescable inter processor interrupt is the x86 system management interrupt. There's been a lot of people worrying about that as a security problem, since it's kind of a backdoor into the system. It's completely invisible to Linux, except that it induces latency. So people who are running real-time systems on x86 are concerned about it. There's been a recent patch by Steven Rusted, based on some work by John Masters that at least allows the system to detect that this SMI is occurring. Of course, Rusted, like all right peeping, thinking people is at the real time summit in a different room. In ARM, the important, distinct non-mescable interrupt is the FIC. The fast interrupt request, it's fast because the FIC has its own set of registers which are dedicated solely to its own use. According at least to LWN, the people who know everything. The FIC debugger created by Android is the only internal user of this fast interrupt request, which is surprising, I think. So it seems, yes? Yeah, small correction, FIC is not an MI. Okay. It has the exact same priorities than a normal interrupt. Okay. It's not that fast anymore. It's actually, it could even be slightly slower than a normal IQ. And when it's put into the gig, you get the exact same priorities masking. It's not, there's no way to actually implement it. There's no other way to add an NMI on out. If you want to have something that looks like an NMI, you need to use inter-priorities. Very good. So just because this has been recorded, I'll repeat the essence. I think of what the speakers had is that the, in contrary to everything I just said, that the fast interrupt is slow and can't be masked. So the great thing about giving talks like this is anything I'm going to present, somebody here in the audience knows more about it than I do, I'm sure. I'd like one more thing about the FIC, and I'll shut up. The FIC is only usable if you run on the secure side of the NMI CPU. So that's the case in most, on V7 implementations. On LAN V8, Linux is not designed to run on the secure side. So it will run, it won't be able to have a FIC. It'll be able to give virtual FICs to a VM, but that's about all you'll be able to do. In the long run, FIC is not really something that will survive for Linux. So the upshot of that, I'll say for the recording, is that the FIC is doomed. All right. There's something called config IRQ domain debug. You can turn on. I found that not that informative, so I'll switch it. As I already noted, a lot of the interrupts are actually connected to the general power controller. My interpretation of that is to allow devices that generate interrupts to wake the processor. And now get on to some of the more interesting part, I think, which is about threaded IRQs in the RT kernel. Most of what I'm about to say also applies to the mainline kernel if you provide the thread IRQs command line argument. So there are some interesting differences still, and I'll try to go quickly through that. So I guess I didn't actually show. So here was PROC interrupts for the 4.4 RT kernel for the IMX6. If I now say what processes are the IRQs in, you can see there's a lot of them. So these are, of course, the same interrupts and PROC interrupts, but displayed a different way. We see K software QD, which is a per-core thread that handles some, but not all, of the deferred work that happens in software Qs. And their parent is SMPboot because they are started by the kernel during initialization. Then we have all the software Qs which are servicing the devices. They are running in IRQ threads. You can see K software QD runs at a relatively low priority, and the default priority of the IRQ threads is a real-time priority. Some of the real-time threads have slightly lower priority. That is because they are secondary interrupt handlers for particular interrupts. And I will discuss more about that in a little bit. So if I were to boot this board with mainline kernel and choose thread IRQs as a command line parameter, I get very similar result to this. But under the covers, the handling is different. Why it keeps doing that? Let's try this. No. All right. Now I screwed it up. No. Slide pane. Okay. So why would you want to have interrupts and threads in the RT kernel to begin with? Real reason is once interrupts are in their own threads, then we can use tools like CPU affinity and real-time priority management in order to control the execution of the interrupts. So fundamentally running the interrupts in threads gives the person who has to make the system work a few more tools to manipulate the system behavior and also really a lot more rope to hang yourself with. Also another important point is now the schedulers controlling these threads like any other threads in the kernel. And so the interrupts themselves can be interrupted or preempted. And that leads to lots of fun, as you might imagine. So here's an audience quiz. What will we see if we look at the process table for a non-real-time kernel and look for interrupts? Who has a guess? Nothing. Nothing? I'll check it out. The thing I love about this topic is when I read the mailing list, all the really smart people disagree with each other. So that's what made me think it was a good topic for a talk. So in the spirit of Walter Gropius, let us not argue endlessly, let us just check and see. So it should have been clear from what I showed before that the non-real-time kernel, this is just the Debian stable kernel of a great age now. But it works. It should have been obvious that these per-CPU threads, case-off-tire QD, they are always present in any modern Linux kernel. So they are always there. What's interesting is there are a couple of... Well, on this particular case, since I modprobe-r, the Wi-Fi driver just so it wouldn't interfere with this Ethernet connection, that would also be showing up here. There are a few interrupts that in non-real-time systems show up in the process table as threads. There's actually a flag that you can set with request-irq or dev-m underscore request-irq to always get a threaded-irq. The reason for doing that should be kind of clear. It's that the interrupt handler itself may be slow. The person who's writing the device driver knows that and so they want to give the kernel opportunity to schedule the thread. So apparently the most common users of this facility are people whose interrupts are coming from buses where the response of the bus to the hard interrupt handler may itself be slow. So... I'm sorry? Agreed, yeah. I mean, I2C is another bus that could be slow where the kernel would like to schedule the work asynchronously and move on. Why precisely KVM is using it? I don't know. That sounds like an interesting topic. Yes? The KVM thing is actually to be able to inject virtual interrupts into a guest. That's not an interrupt by itself. It's a conduit for a virtual interrupt to be injected into a guest. It's mostly irrelevant. This MEI, as I understand it, is having to do with reading the ACPI tables and servicing the system. No? This is great. What is the MEI for? The security controller. Okay, very good. So the MEI is the security controller. There's nothing like saying wrong things to learn a lot of stuff. So a really interesting question that I discussed a lot with Joel when we were both preparing our talks is what function is running inside the threaded IRQs that appear in the process table for when you say PS on a real-time system? The answer, at least from my point of view, is less obvious than you might think. If you look at the function signature for a request-threaded IRQ, it takes a bunch of arguments, including the flags that I was just referring to, one of which forces the creation of a threaded interrupt even in non-RT, but the two primary function parameters of interest and request-threaded IRQ DevM request IRQ has the same feature is there is a handler and a thread function. And so the handler is the hard interrupt handler that you would naturally expect. The thread function exists because of the shared interrupts. So the fast handler has to run and see which of the devices needs to be serviced based on the particular interrupt line that's been expressed, whereas the thread function can be for the particular device on the shared line. And interestingly, even in the mainline kernel now, request IRQ is really a call to request-threaded IRQ with a null thread function. So I'm going to bravely skip to something later on in the slides. This is just a kind of a snippet out of the interrupt handler function for the Ethernet on the IMX6 here. It's a function of type IRQReturnT, which is what all these handlers are. The thread functions are also of that type. So this is a type of function that would run in the first part of threaded IRQs. If you look... Well, maybe I should go on to the next page. So we could have two cases for request-threaded IRQ. One where the handler function is not...is defined, and the thread function is null, or a second one when both the handler function and the thread function are defined. And maybe this is more interesting if I go back to the example on the IMX6 here. Once again, if you look here at the MMC, there are the two interrupts. So in the case of the MMC, this first interrupt is the handler function, and then the second interrupt is the thread function. So inside the IRQ thread that appears in the process table, if null is what is given for the thread function, then what runs inside the thread is the handler function that I just showed you for IMX6. And what actually runs in interrupt context, since a thread is clearly not interrupt context as we've known it, is something called IRQ Default Primary Handler, which is just a stub that wakes up the thread handler function thread. So I hope that's clear. This is something in the RT patch set that I think is really clever. It's an example of how the folks who've crafted the RT patch set have allowed authors of drivers that were not designed for RT to run well on RT systems. So we need to have something that runs in interrupt context in order to handle an interrupt. And what that is is this default primary handler that's provided. All it does is wake up the thread and return. For it I'm going to run out of water here. Then the handler function is sort of, I guess, the Linux foundation people have left. I really could use more water. They just brought me this one little bottle. Then the handler runs in the IRQ thread. If a thread function is provided if the request IRQ is called with two parameters, then the handler itself runs as usual in interrupt context and the thread function runs in the thread. So in RT this is accomplished by a function called IRQ Setup Forced Threading. So there's kind of three cases for the thread of the IRQs just to kind of summarize this whole topic. In RT, any IRQ that doesn't mandate running in interrupt context by setting this particular flag will run in a thread. Unsurprisingly, timers and perf and other such very low level features of the kernel can't tolerate being run by the scheduler so they express this flag. In the mainline kernel, only those IRQs like I2C and some other slow buses, sometimes request a thread even in the regular mainline kernel. Then a sort of intermediate case that I don't know how many people would actually use this. Does anybody actually use thread IRQs in a real system? So it's just kind of a curiosity that you can also force threading of most IRQs with this parameter. It's sort of a weird in-between case. I already talked a little bit about the shared interrupts. On the x86-64 you can actually see that EHCI, HCD, one of the host controllers actually shares the same line as MMC. And of course, in that case, if you actually look at the MMC driver, you can see that as I was pointing out both the handler function, SDHCI, IRQ, and the thread function, Joel, you are such an angel. You take such good care of me. Thank you. Cheers. Much better. Okay, so for the registration for the MMC IRQ, in fact, both the handler and the thread function are populated and they're appearing then in those two different IRQ threads in the process table and the default primary handler is the only thing that's running in IRQ context. So maybe I will skip past this except to say that if you use atomic operations like say atomic ink in ARM, you actually disable IRQs, local IRQs. This raw local IRQs save function actually disables interrupts even in real-time. So in a real-time system, atomic operations actually are really expensive. So you would want to use them with care only when you really need them. Yes, you're going to correct me again. So that's true for rather all ARM cores, so up to ARM v5. So ARM v6, v7, v8 and probably whatever comes after that, you don't need to disable interrupts. We have exclusive operations which are used to implement atomic operation. So you can take an interrupt in between that will clear a flag on the CPU and say your atomic operation has been aborted and you loop around and repeat the atomic operation that has failed. It hasn't been like a transaction. So that's true but only for something like ARM 926 which is something like 12-year-old. Okay, but I mean this is a real kernel code from the current kernel. So what are you using different operations then? On your IMX6, you're not using that. What are we using then? You're using the exclusive. So the exclusive operations will return your counter, for example, if you do an atomic ache, will be implemented around the small loop that does a load, load exclusive, sum operation, and the store exclusive, the store exclusive return whether that operation has succeeded or not. Which is basically what x86 is doing. So it's basically these functions that are in atomic.h are not used. Is that what you're saying? For a modern CPU, no, they're not. Very good. Well, as I said, one of the things you learn by testing with all these tools is you can get a lot of concern about some particular function and then if you test you say it's never called. So if only I took my own advice, think of how wise I'd be. Let's go on and talk about soft IRQs instead since we've worn out the hard IRQ topic instead because soft IRQs are where the real excitement, in other words system misconfiguration and failure can occur. The easiest way to see soft IRQs is to look at the array that they're listed in inside the kernel. There are three flavors. The flavor that is most obvious are the soft IRQs that are the bottom halves that do the deferred work of the devices and the devices whose needs are most critical are typically the network and the block devices. So they are listed here in this array. Then the second kind of soft IRQs that are important are the timer, the scheduler and the RCU. These are basically the housekeeping done by the kernel, periodic tasks that aren't tied to a particular device necessarily. And then the third kind, which is the tasklets, which are the answer to the question, my device isn't a network device or a block device. Where is its deferred work getting serviced? And the answer is anything that's not a network or a block device that requires deferred handling does so through the tasklet interface. A feature of this is that these IRQs are actually serviced in this order and the system does have a budget that it allocates to servicing soft IRQs. So the high priority tasklets really do occur before RCU, which is last in line. So tasklets, as I said, serviced devices, any other device that's not network or a block device gets serviced via the tasklets. Any driver raider is free to create a new tasklet. The number of soft IRQs has been frozen by the profits of your and none can be added by us ordinary mortals nowadays. You can see because of this ordering of this array that the high priority tasklets get serviced first. And those tend to be drivers like Sound, for example, that has more time criticality in its need for deferred work than some of the other devices. So USB, DMA, TTY drivers all use tasklet interfaces. And you can use MPSTAT to, this is probably completely illegible. This is just from a collection of kernels showing what soft IRQs have been run. It's just an easy way to MPSTAT. It's just a quick way to get an overview of what soft IRQs are most important on the system. The tasklets are, of course, all lumped together. So I found it initially very confusing to understand in a real system how the soft IRQs are getting processed. And so I spent a lot of time reading the code and then tracing in order to figure this all out. And so there are two paths. This is kind of a general statement by which soft IRQs actually get to run in the kernel. In one case, they run right after the hard IRQ handler returns. And in the other case, they run in case soft IRQD, which I showed you before. So the way a soft IRQ will run after a hard IRQ is the handler for the soft IRQ raises or schedules the soft IRQ before it exits. And hard IRQ handlers when the exit call local bottom half enable. And when local bottom half enable call finishes, if it's not preempted and so forth, will actually start in RT due current soft IRQs or a non-RT due soft IRQ. If for some reason IRQ doesn't get serviced in this path because of preemption, for example, or because this particular driver has a soft IRQ further down the list that doesn't manage to run during the time slice that's allocated, then it will run in case soft IRQD later on at the lower priority. Of course, the comments that always apply in RT about priority inheritance applies. So if a bottom half has a lock that high priority process wants, the priority should get elevated. The RCU, for example, doesn't really have a hard IRQ handler. It actually always runs in case soft IRQD. And what case soft IRQD can also exhaust its time slice in which case it will reschedule the work again. So one of the tests that I thought was amusing was to show this here on the IMX6. And so I've actually done this with a K probe. So this is a little module that I've made by making kind of five lines of modification to the code in samples slash K probes in the kernel source tree. I point that out because anyone can use this facility. It's really easy. You have a function that you want to investigate. You're not limited to the existing trace points with K probes, which is great. I'm sure that if I were really a kernel good citizen and I found that the function I was interested in wasn't represented in a trace point, I would add a trace point. But of course, my boss wants to hear that I've gotten things working, not that I added a trace point. So I've given this IRQ parameter called chatty. And so if I say mod probe kp do current soft IRQs, we go into this field because we enjoy watching people type. All right. So the kernel says that it's planted my K probe. And it says that the do current soft IRQs has been called in the context of the task IRQ 43, which is the Ethernet. So if I go over here, I can be a horrible bad person and ping flood my on board. Just something I've been doing a lot of this here. If we come back over here, let's keep my mouse on the terminal window. You can see that a whole bunch of instances of this are running. So now here's one where do current soft IRQs has been actually invoked in the Ksoft IRQD path. And as a system gets busier and busier, that happens more and more frequently. If I come over here and do the same thing again, that should generate. Basically, if you really start to load the system, you see that Ksoft IRQD appears more and more on this list. So I love this, and I should say, where is this task command coming from? Basically, I've used current macro in the kernel, which is looking at the thread info that's in the processor registers and having it print out the current task command. So this is a really simple way to get a very late-weight stack trace, essentially. Oops, if I could type that a bit better. This is a very simple way to get information about, doesn't like it when you unload the module. Surprise. This is a relatively non-intrusive way to figure out what function has called the function you're interested in. And Kprobes really are simple. You have to compile them separately when you change them. But of course, if you have your kernel source tree, you can change your Kprobe and copy it over to the system and insmod it, and you're good to go. The other nice thing about Kprobes is, since they're a little bit lately tied to the kernel version, a Kprobe that works on one kernel version will usually work on the other without any modification. So, let's see. So I showed you a real-life example of the Ethernet. IRQ was sometimes triggering current soft IRQs directly. Sometimes there were enough Ethernet hard IRQs that the work was getting deferred to Ksoft IRQD. The code that has this Kprobe in it is at GitHub, but it's really simple stuff. So, due current soft IRQs, here is the big headline difference between the RT kernel and the mainline kernel. The mainline kernel instead goes through a path that runs something called handle-pending soft IRQs. What this means is that when local bottom half-enable runs an RT, the IRQ is invoked that actually corresponds to the device that raised it. So, in other words, typically when a hard interrupt handler exits in real-time, only the soft IRQs that it has raised will run. In non-RT, any soft IRQ can run when the highest priority soft IRQ will run that is in that list when the hard IRQ handler returns. So, this leads to some real weirdness in stack traces where you'll see the Ethernet hard IRQ exits and then the block soft IRQ runs. Obviously, in real-time Linux, we couldn't tolerate this kind of behavior. You could use the priority inherited system to elevate the soft IRQ, but why not just call the soft IRQ that you care about essentially directly when the hard IRQ exits? So, fundamentally, the soft IRQs raised in the current context run at the exit of the hard IRQ handler in RT, whereas the most important soft IRQs from the point of view of the system will run in non-RT. I think I have an example here. So, here's some tracing from x86 of the function that handles the buffer that comes in on the receipt of the driver. So, in this particular case, some virtual file system rate is the IRQ that ran before the Ethernet soft IRQ, and here some type of memory management IRQ has run before the Ethernet soft IRQ. So, this is a kind of strange situation, but apparently it works. I'm going to go back to the other place here. So, the case soft IRQD path that is the second path where soft IRQs run is basically much more similar than the path that starts with the hard IRQs. So, case soft IRQDs behavior is mostly the same between the two cases. I'll just skip over this. This is just two stack traces that kind of show this behavior in more detail. And so, the mechanism by which soft IRQs run at the exit of a hard IRQ handler is more different between RT and mainline than the case soft IRQD path. Here's just a little information about how to look up the current thread from the registers. I'll just skip past that. So, a remaining question is when exactly do the system management soft IRQs get to run in RT if only the current soft IRQs run? In other words, when would do current soft IRQs get to, when would it invoke RCU scheduler and timer soft IRQs and any other ones that exist in the RT kernel? So, who can guess what current is when the RCU or scheduler soft IRQs are running? So, the answer is system D IRQD. I thought that was a great joke. But no, it's not the right, it's too late for anyone to think this is funny anymore. The right answer is that case soft IRQD itself is the current task when these handlers for HR timer scheduler and RCU are running. These, the functions that invoke these soft IRQs are very interesting. They're fake hard IRQ handlers, essentially, so that they can be handled by the same code as the regular hard IRQs. Interesting reading. So, I kind of already showed you this K probe for soft IRQs, for do current soft IRQs on IMAX 6. We'll show that sometimes they run in case soft IRQD and sometimes in the hard handler. One interesting thing is that sometimes the soft IRQ doesn't run on the core where the hard IRQ was raised. That's a particular point I might discuss later. Soft IRQs can be nested since IRQs can be preempted in RT. It's protected inside an IFDEF for RT in the regular scheduler. A delightful thing that you see if you start using a heavy weight testing tool on RT is these messages. RT throttling activated or RCU sked detected. The second one is what you see right when you notice that the system has hung. Essentially it has become completely unresponsive and you have to hit the magic button. So, I really started to understand this problem when I heard this outstanding talk by Steven Rested from ELC last spring. I'm very happy the Linux Foundation put up the videos from talks that were not keynotes last time. I hope they'll do that again for this conference. I certainly watched a lot of them. The project I was working on at work, which doesn't use the IMX-6, uses a different, very expensive board, was having a problem where our main event loop was missing some cycles. It wasn't missing badly enough that we were seeing these death messages, but a real-time control system that's controlling a truck going down the highway. If the event loop is missing, that would be very, very bad and not a product that anyone in their right mind would possibly ship. So, the problem had to be solved. And the source of the difficulty turned out to be that the system timer, which was scheduled via Ksoft IRQD, wasn't running. So, that was causing the event loop not to run, essentially. That was a big problem. To illustrate this concept, I've searched my cabinets and found the oldest food that was the best before 2008. It's missed its real-time deadline. Some pasta that one of my former boyfriends liked. Come by my apartment. I'll give it to you if you dare to eat it. So, a partial solution to the problem that I alluded to of timers missing their deadlines, because the scheduler wasn't running them, is addressed by Sebastian Sievor, who is in over, I'm sure, in the RT summit this morning. He had a patch that came out this winter, right in time for me, that split the timer soft IRQs out of Ksoft IRQD. I'm sure you don't really want to read this commit message now, but there's some links here for your comments if you want to look at it. Basically, timer and HR timer are moved into their own per-core thread. When I backported this patch and applied it, it actually immediately made the system work better. So you see now we have something called K timer soft D. So this now is where the soft IRQs for the HR timer are running in RT. And the great thing about it now is you can set the real-time priority of this thread. And if you have a problem where the timers are missing, you can elevate this thread's priority and not elevate, say, the block soft IRQs represented still in Ksoft IRQD if they're not critical to your system. So having this experience, you might wonder wouldn't we want to be able to run all the soft IRQs in individual threads and manage all their priorities in CPU affinities, but I'd be curious to hear other people's experiences. That's why we come to conferences in the first place, but it's already such a project and a boggle. And to manage the affinities of the existing user space processes and the hard IRQs and Ksoft IRQD and K timer soft D, the system is already at the limit of developer comprehension, I would say. I'm not sure adding more ability to manage things would make things better. So let's see. So I'm going to go on now and talk more about tools. I really want to talk about EBPF before everybody falls asleep because I think that's the coolest thing ever. I have purchased a 64-bit ARM board because I want to play with it on there, but the go-to tool when you first start working on RT for looking at problems is F-trace. And both the wonderful thing and the sometimes troubling part of F-trace is it produces really a lot of output. You could look at the art of computer programming and say, how wonderful that Donald Knuth has preserved so much wisdom and highly detailed form in this book. I should really read that book. And if you generate a lot of test output with F-trace, you start to feel the same way. You say, wow, my probe produced 30 megabytes of output. I should really read that. But instead, what I'd really recommend is using BPF. So BPF combines the feature of K-probes that you can trace almost every arbitrary point in the kernel. It has the future of K-probes that it's relatively fast to change the probe. It's got a very convenient user space interface that's a lot simpler to use than mod probing a driver. It's written largely in Python with C snippets that get dynamically inserted into the kernel using clangs just-in-time rewriter backend, which unfortunately much to my dismay is not available yet for 32-bit ARM. And BPF uses late-weight methods because it has internal data stores, including even a hash table that it provides. And so it really has very low overhead on the system. So when I use F-trace for all the power that F-trace and event tracers that can be used without the heavyweight function tracing, for all the power that system has, you know, BPF is really more lightweight in that F-trace, despite my best efforts, very often produces the horrible RT throttling or RCU stall message, which means I have to hit the power reset button on my system. I don't get any test results, and I certainly don't get any test results about RT performance that anyone would believe when that's happened. So BPF is mostly permissively licensed. A lot of it is contributed by Brendan Gregg, who many of you will know from previous tools efforts. As I said, it's limitations of it are. It first appeared apparently in the kernel and this forum on which I'm going to talk about it in kernel 4.1. The full support is apparently there in 4.4. It's very actively developed on the user space sign. Excuse me. The documentation for it is excellent. It does require newer versions of LLVM and Clang. And because I believe the tool chain support isn't there for ARM32, it's limited to PowerPC x8664. And sometimes it works on ARM64, which is the project I really want to investigate after this talk is done. So with that, I will, with great sadness, unboot my ARM board and connect the same cable to the x86. My x86 laptop was not opened by airport security since they have the illusion when they see one of these cases that they know what's in there. So if you want to take a bomb on a plane, don't take it in your IMX6 case, take it inside your laptop. That's the moral to this story. Oh, it's true that the laptop has lithium batteries on it, so they've kind of made the news recently. It's not the dangerous Galaxy Note, so I should be able to SSH. Hopefully this will work. Ah, excellent. Okay, so now this is, this is now 4.6.2 running on this trusty older laptop. We can just do, we can just look at the Hello World for BCC. So BCC stands for BPF Compiler Collection, if I'm not mistaken. And BPF itself stands for Berkeley Packet Filter. And Berkeley Packet Filter has been in the kernel for a long, long time, but it was used for packet filtering up till now. And so the new BPF tools, which has been covered really well, both on Brendan Gregg's blog and on LWN, is an extension of that packet filtering capability to real-time testing. So this, one of the great things about BCC is it comes with lots of examples, particularly in the tools directory. Every time I pull the source, there are more and more of these examples that you can play with. I guess maybe it would be a good idea. Let me show you one just to get the flavor of it. So as I said, the, is this front's big enough or should I make it bigger? Okay. All right, well, so this is just one of the examples and one that I'll show you how to use in just a little bit. It's written by Brendan Gregg. It's permissively licensed and basically there's some Python boilerplate that sets up BCC. And then in the middle of all these examples is a section called Define BPF Program. And so essentially the BPF program that gets compiled by the JIT and inserted into the kernel is a comment in the Python source code, which I think is in and of itself is awesome. And so there's some helper macros and really just, you know, just not that much code really here inside the comment to create. Oh, here we're back into Python again. So this really is pretty simple. Oops, I went past it, that's why. Yeah, so this text here that Emacs is showing is red is the whole of the BPF program. It's really quite short. And of course, because this is Python on the outside, if you can program Python, you can connect to all the wonderful analysis and graphical programs that Python has. But here let me show you just the Hello World example. What this actually does is it runs SysClone. And so you can't see the other terminal window. Well, I'm just going to type LS and you can see that the BPF tracing tool is finding SysClone is getting called. So this is just a dumb example, but it just shows you how it works. And it's really that simple to use this tool. And as with K-Probes, you can go in and slightly change the legion of examples and get the effect you want. So I've got some information here about how to get started with this tool. I won't go over that. I want to skip ahead to my examples since I think it's more interesting to see how a tool can actually be used and to hear the theory of it and the spirit of Walter Gropius who said, get out in the workshop and use the tools. Of course, I needed an excuse to look at cycling pictures while I was preparing these slides. So I'm going to show you an example of changing the bottom half that goes with the top half. The example I'm going to talk about is the new API for networking in case anybody doesn't remember what that is. I'll just go past it really quickly. Essentially, the new API is an example to address the problem that PingFlood like I was just showing you is an event that is coming from outside your system that you as an RT designer have no ability to control that can introduce an interrupt storm which can affect the system performance and stability. And obviously we can't ship real-time systems that can be brought down that way. This essentially answers the question, how does Linux deal with the denial of service attack? This is one of the ways how. High-speed network can create thousands of interrupts. We know that. The method we use for interrupt mitigation is essentially to disable the interrupt line when the number of Ethernet interrupts goes too high and to switch to polling on the Ethernet driver. Why does this make sense? Well, you have a certain amount of bandwidth in the system that you're willing to allocate servicing the Ethernet device and if the Ethernet device wants more service than that and you can't afford to give that level of service to the Ethernet device, you're going to have to drop packets. And so the new API gives Linux a way to figure out when it should drop packets and to start doing so or at least if the situation isn't that extreme to reduce the frequency that interrupts our disturbing normal operation or put differently, if you're not willing to service the Ethernet driver any more frequently than the polling loop will service it, it's less expensive to switch to polling. So hopefully that's clear. Probably most people know that anyway. So just to return to the free scale new API enabled interrupt handler, point out that really all this handler does is it sees if there's queued work on the FEP, the free scale Ethernet device and if there is, it disables the particular interrupt of the Ethernet that corresponds to an incoming packet, it masks it and then it starts using this other method of handling packets which is polling instead. So I'm going to show you this behavior on the X86 but the E1000E driver is spread across a huge number of files and is far less readable than the ARM one which is really simple. Correspondingly the new API received packet function will process in the polling loop unless the number of packets that it finds that it has received fall below a threshold and then it calls this not be complete function and turns the interrupt back on. So this is a good safety valve for the kernel to make sure that the interrupts don't overwhelm the system. So there are existing trace points of course in the kernel for looking at this behavior. There is in particular a not be poll event tracer. For those of you who have experienced with F trace, event tracers are a lot more lightweight than the function tracers. So the event tracers can sometimes be used without hitting RCU stall or those other death messages that is RT throttling but I found this existing tracer didn't really give very much information. Its behavior didn't obviously change between when the system was handling interrupts for Ethernet and when it was polling. So that wasn't very useful. So I'm going to show you, I looked at it with K probes on ARM. I already talked a little bit about that. What I really want to get to is showing how to look at it with BCC on x86. So let's go right ahead and try that. Should make these different colors or something. Okay, so that's now the x86 part. And so we're here in the examples directory if we go up to tools. All these wonderful tools that are provided with suggestive names like umkill.pi and wakeuptime.pi and so forth. There really is just so much useful stuff here and it includes U probes as well as K probes. So trace user space events. So everyone really should look into us. It's interested in this film. What I wanted to show you is so we can run a stack snoop is what it's called. So here is one of the functions of interest in this case. And I've done no preparation for this. I haven't modified anything in this directory. You can just pick, we could have audience participation where you shout out your favorite function. And I would just type stack snoop of the function and you get it from this tool. It's really amazing. And so here we can see the path that's leading to due currents after our queues. Clearly this is a ridiculously large amount of output. So like F trace, this is abundantly generous if you stack trace a frequently called function. Not a good idea. There are other things to do that are more sensible. Maybe now is a good time to return to the slides while it thinks about that. Yeah, so we can also look at the handling of the Ethernet packets using something more sensible like stack count, which counts the occurrence of different stack traces. And we can see in fact how frequently the Ethernet packet handling is running from case off-dire QD and how frequently it's running after local bottom half enable. Oh, seriously. All right, I'm going to have to come over here and kill it. There we go. Okay, so let's try stack count, which is not so voluminous in its output. All I'm going to show you is the same example that was there. All right, so now it's counting the occurrences of different stacks related to the packet handling. So if I go now again, and now I'll ping for this laptop, and actually that's, let's do more. Got this function scp forever, you can guess what that does. I wanted to set this up so that you and the audience could attempt to DOS the laptop, but I figured it probably the Wi-Fi wouldn't work, so that's been running for a while. So now we see, as predicted, marvelously enough that under heavy load, the system is running a certain fraction of the Ethernet process handling, packet handling, from local bottom half enable. And so this is right after the Ethernet IRQ is running, and then it's running a certain fraction of the Ethernet process handling in case off-dire QD. So these are the two paths for deferred work happening that I was talking about before, and you can see on the real system they're happening. Still, even with this unreasonable amount of load on the system, the handling of the packet after the IRQ predominates over the running in case off-dire QD. So is this sufficient to show that the NAPI polling is working and that the IRQs are disabled? No. And the reason it isn't, this took me a while to figure out, actually. We'll kill SCP forever, sadly. The reason that this isn't enough to generate, to prove that NAPI is occurring gets back to here, because even if a soft IRQ is raised by local bottom half enable, because of the way preemption and interruption occurs in the kernel, it may not get to run. And so given particularly if a whole bunch of hard IRQs come in while the one is running, obviously the system can't keep up. So what happens in a situation like this, even on x86, is if the time slice for the soft IRQs gets exhausted, or if they get preempted and don't run, they end up running in case off-dire QD. So it's not surprising that more of the packet handling is occurring in case off-dire QD when the system is busy, but it's not sufficient to show that the NAPI is really running. So what would be sufficient to show if the NAPI is really running? In order to display that, I'm going to show you for this kernel some of the network code, which is in netdevcore.c. That would be a funny way for a demo to fail, for the kernel source to be missing. Core dev.c, that's fine. So if we look at the function, netrs, is this big enough? People should yell out if it's not. The function that is doing all the work in network packet receipt is netRxAction. This is the real hot code path when the system is flooded by packets. And if we look at what's in this function, basically while there are packets, I'm going to try to process them. Maybe I don't have time to talk about the receipt packet sharing, but basically if the system gets all through its main processing loop and has used up its budget, then at the end of netRxAction, it calls a function called raiseSoftIrqIrqOffKsoft, which function was introduced by Sebastian Siwer and moves the remaining packets to the SoftIrq to run in KSoftIrqD. So this is really where the NAPI is occurring. If things are really bad, we can use this receive packet steering and actually move some of the packet handling to another core, which is one of the inter-processor interrupts I was talking about before. So just to explain why this NAPI problem was hard for me to understand, if you look at the stack traces from KSoftIrqD and from the path where the SoftIrq is raised by the hard interrupt, you see they both have a function called NAPIPole. Innocently, when I started on this project, I thought, oh, if the processor is running NAPIPole, it must be using the new API. No, it calls NAPIPole all the time. And this is true of a lot of drivers. I think people just don't want to have two completely separate code paths for processing a packet. So they wrote a NAPIPole driver, which was efficient, so they call it all the time. They call it even when the SoftIrq is running in the context of the Ethernet hard IRq. But because the path between NAPI and the hard IRq invocation of the SoftIrq looks so similar, it's actually a little bit subtle to tell when it's happening. So long story short, the way you can really, really tell if the system is throwing up its hands and sending all the Ethernet packet handling to KSoftIrqD is if this function gets called. So that we can look at similar way with this stack count, which I think is great. Or you can do this with a KProb2. I've done the same test with KProb just to make sure that I got the same result as on ARM. So now when this function runs, it's really the new API. Let's see what we can do here. I'll restart scp forever. You can tell when you're getting close to the limit because you actually see that the system starts dropping packets. I don't know how many people know with ping. For each packet, ping is actually rating out a dot and then erasing it if it gets a response. If it starts not seeing a response from the other side, you start seeing dot, dot, dot, which is useful if only because if something is really hung, you see it there. It's still not dropping packets. So since ping is free, from some point of view, it's starting to run. So this one immediately started dropping packets. So let's go ahead and quit this now. That's because you have the address on the ping. I think you're probably right. Oops. Oh, no, that's actually this one. All right. This one is, all right. Watch me comically type. Right. That is the IMX65. How did you know that? All right. At any rate, this probably should be enough to generate the other behavior. So let's see. Stop that. OK. So in fact, raise soft IRQ, IRQ KSoft has been called. And it's mostly been called from the IRQ thread because this now is the switch to the polling. Hopefully that's clear. So kind of to summarize all this work, I really wanted to understand how the network packet handling was working in the truck embedded system, which I'm actually not talking about, interest of keeping my job. And I went down this long path to try to understand it. And I basically thought that other people in particular might be interested in how to use this BPF, which is really so simple and so useful. I'm basically coming into, I've done really the same kind of investigation on ARM with K probes to look at the switch to NAPI, but I already showed you a little bit with the K probes. So probably it's a good time since everyone wants to go to lunch and stop listening to me yack to just go to questions and be happy to show people this other demos if you like or the code is a good hub. Both the BPF and the K probes are really simple, so go ahead and play with them yourself. I'll just say the documentation says K probes can be inserted anywhere. Maybe not quite anywhere is my experience, but they are in fact very useful. So the summary of this talk is that interrupt handling, if you've only learned one thing, which you entirely knew before, involves a hard or a fast part, which basically communicates with the IRQ chip driver that talks to the IRQ controller and returns, and then there's a soft part which actually does the servicing of the device through the system function that can run asynchronously. Hired IRQs have some architecture dependence. It turns out that the FIC interrupt isn't very useful. Thank you for that correction. Interesting to know. In the mainline kernel, a given soft IRQ can run after any hard interrupt, which is kind of weird. In the RT kernel, the RT patch set implements behavior where the soft IRQ raised by a particular hard IRQ typically runs right after the hard IRQ exits. Seems a lot more sensible. The threaded and preemptible IRQs are really one of the primary features of the RT patch set. You can also see a few threaded IRQs as I showed in the mainline kernel, or you can request them with the thread IRQs kernel command line parameter, although I think people probably mainly use that for testing. A main part of the work of a kernel engineer who runs an RT system is changing the priorities and CPU affinities of both the IRQ threads and perhaps the user space processes to make the system stable. And that is, in fact, how I'm happy to say we've solved our ping flood problem with our truck controller and we'll soon have, I can hope, highly automated trucks on roads near you. The management and the study of the real behavior of the interrupts in these systems as illustrated by this example is challenging, but possible. The new BCC and EBPF tools that have emerged into the kernel in the last year or so are really powerful. And, you know, Brendan Gregg is responsible for so much of the great testing tools we have already in Linux. This is the latest thing that he is working on. So with that, I'll just turn to questions or we can all go and eat lunch. Yes? So if you wanted to have an interrupt which is not mass or... The main problem is that on most rolling system it's never going to be returnable. It's going to be with the central side of the core. Well, you have to have control of the Q side. Of course. So that's something of the past. If you really want to be able to actually simulate a non-mass interrupt then you'll have to introduce interrupt priorities on the model level. We have that just in my opinion. Yes, but in a lot of ways that can selectively say Linux that it's not massing one interrupt you're really interested that it's always getting to you. Yes, so that's why instead of... you say the interrupts globally you set the priority mass instead of just nuclear data at the Q side level. Yeah, okay. You have time thinking of stuff like that limit massing all the interrupts and that should be a problem or something really hard to... That's what will change it. Okay, great. We can take out a plan if you want to. We can have a different answer but I'm happy to tell you about it. I've just pulled off... You have been working at the threat priorities and kernel interrupt threats and what I was wondering while listening to your talk is is it better to have your real-time application above the interrupt priority or below or do you have any experience which the ordering is the best? So the question which is an excellent one I think you're trying to ask is how do you set the user space priority of your application with respect to the interrupt threads? So at the end of the day we've ended up setting both the real-time priority and the CPU affinity of our most important user space application the one that is the real control loop. On our real silicon that we're using we just have two cores and so the design of the priorities for the interrupts and the CPU affinity has really been what I call an artisanal task. It just requires a ton of testing. Fundamentally what the real-time control loop is doing is reading messages from the devices in our case we have both CAN devices and Ethernet devices that are sending important messages and so you end up essentially having to put the IRQs the hard IRQ handlers at the highest priority is after all they're fast and so there's kind of no reason not to but you have to make sure that your user space task after all it's actually running the control loop gets to run and the way we've ended up doing it is having different priority different hard IRQs running on the two cores and then pinning the user space task a lower priority than the IRQs but a higher priority for example than Ksoft IRQD on one of the cores and then we've elevated as I said we've back ported K TimerSoftD that Sebastian Civar created this spring to our kernel and we've elevated the Timer K TimerSoftD above Ksoft IRQD and at the moment it's all working the problem is as we fix bugs and other software and as we add features that the system needs in order to ship as a product every time we do this we have to test the whole thing again and I suspect before we really release the product we may have to adjust the priorities again so as far as I can tell this really is going to be an ongoing maintenance project I don't think there's a way to pick affinities and priorities it's going to withstand changes to the other software so, yes As a pull-up to this discussion do you force your way to sort of automate this autism task of placing priorities and in flight on CPUs because it looks like a fairly complicated task and I usually like to leave a machine to do it the audience has asked an excellent question if this is tedious and ongoing why don't we automate it and I couldn't agree more I work for a startup so unfortunately none of the schedules I have say automate real-time testing on them the boss is what the product to work now and not for developers to entertain themselves by setting up board farms where testing can be done but I think a number of us strongly agree that the right way to really address this problem is with our nightly build to run cyclic tests to try adjusting the priorities and affinities and to run our actual user space application and make sure that it's not missing I think even if we end up having to adjust things manually we would benefit from making sure that some particular patch in our code isn't breaking the real-time behavior by running automated tests so if we get enough headroom on our schedule we will set that up soon we already started on it but it's not completely in operation and of course the other problem with the embedded project is simply having enough boards to dedicate one to a test system so that's a type of problem everyone has any other questions or should we go and eat lunch? lunch it shall be thank you all for coming