 All right, good turnout. So it is now 11.35, which means the presentation is about to start. And so being at Linux conference, the next presentation is about Linux. So this is Raphael, and I'm sure he's going to give us an amazing presentation. So everyone, please welcome him. Thank you. So hello everyone, and let me introduce myself briefly. My name is Raphael Vesotsky, I work at Intel Open Source Technology Center, which is part of the software and services group at Intel. And I'm based in Warsaw, Poland, Europe, pretty much on the opposite side of the air from us now. So that's how far I can go. I can travel without leaving the air, actually. So it's good to be here, and thanks for coming and choosing my presentation over whatever interesting things you might be doing instead of listening to it. Okay, so I've been working on Linux for around 10 years. I started to contribute to the Linux kernel in 2005. And I've been maintaining the power management core of the kernel since 2009, and the ACPI cores since 2012, so I've been around for quite a while. That's the plan of my presentation today. I will be talking about system suspend. So this is, you might ask, why system suspend? It has been around for a while. It was there before I started to contribute to the kernel already. So it's not new, it's not very exciting, probably. But the reason why is that it doesn't change a lot over time because that is a fragile thing and it's easy to break, so we don't really change it very often. But in fact, it changes from time to time, and this is the main reason. It has actually changed over the last year. And I'd like to say something about those changes, what happened, why it happened, and so on. So first of all, I will say a few words about where system suspend is. What's the place of system suspend in the whole big picture of Linux power management? Then I'm going to talk about a new flavor of it, which I'm calling suspend to idle. And finally, I'll say a few words about other developments related to system suspend, which are perhaps not the big items, but also important. So let's start with a big picture. Yeah, we have two broad categories of power management frameworks in the Linux kernel. One of them is about individual components like CPU cores, CPU sockets, devices, IO devices, I mean, and so on. So the idea is that if the given component is idle, it doesn't do anything useful. It should be put into a low power state into an energy saving state so that it doesn't consume energy in vain, right? And all those frameworks on the green field are for that. They try to track the usage of devices or usage of components. And if they are not in use, they should be put into energy saving states individually, not necessarily at the same time. So, you know, some devices may be in an energy saving state, some other may not be in such states. Things happen asynchronously and concurrently to each other. And so the green field is mostly about the state of the system in which work is being done. CPUs generally execute instructions, devices do IO, and so on and so on. On the other hand, the brownish or orange field represents the frameworks that work on the entire system at once, like they cause the entire system to go into an energy saving state. And in that state, nothing happens. I mean, no useful work is done. Processors don't do any useful work. And it's just there to save energy. So, things on the left-hand side are for a situation where work is done. Individual components may or may not be in low power states. The things on the right-hand side operate on the system as a whole. And I'll focus on this yellow circle here, which is system suspend, and a particular flavor of it, which is suspend to idle, as I said before. Oh, by the way, why the CPU idle framework is here in the cross-section? It is an interesting thing, because in some cases, we can, on some platforms, we can actually cause the whole system to go into an energy-saving state as a whole through a CPU idle transition. I mean, if it turns out that there are no instructions to execute for no processors in the whole system, then it can go into an energy-saving state of the whole system through an idle transition sometimes. So that's why I put it here, and I'll get back to that later. All right, so I'm going to focus on that thing, and on this in particular, going forward. So the system-wide power management in Linux can be pictured as a pyramid like this one. It's like the base of it is the working state in which CPU is the work, and everything above that red line is a sleep state in which actually no work is done, and which saves energy for us. So perhaps from bottom to top, this is a... Yeah, that's the reason why I call those sleep states, just because they are there to save energy, and no work is done in those states. So the first of them is Suspental Idle. It used to be called Freeze, and there are some... This name appears in several places in the kernel. Yeah, those two are also suspend states that the difference between those two and the Suspental Idle is that these require some support from the platform, either hardware support or platform firmware support to be accessible. On the top, there's Hibernation, which is suspend to disk in other words, so we save the contents of memory to the persistent storage and then turn everything off, hopefully. The idea is that the lower the state is in the pyramid, the more energy it consumes, so the higher it is, the less energy is used in that state. So the width of the level corresponds to the amount of energy that roughly corresponds to the amount of energy consumed in this state. On the right-hand side, there's a list of operations that have to be carried out to get to this level, like, for example, for Suspental Idle, that we need to freeze tasks and then suspend devices, so that's how you can picture all that, right? So first of all, those things have been around for quite a while. As I said, they used to be present in the kernel when I started to contribute, so it's more than 10 years ago. And this is quite new, so I like to say something about differences between those states and this new one, but for that I need to say how these are supposed to work and then talk about differences, right? So that is a suspend code flow, system suspend code flow, this is a system resume code flow, these have to be complementary. And all of the operations here are carried out on the whole system, right? So when I say quantifiers, it's kind of a global thing, and it goes through all the subsystems that have registered notifiers, and it's similar for the freezing of tasks. So first of all, the notifiers are for kernel subsystems, all of that is going to be about the Linux kernel, I'm not going to talk about the third space, okay? So notifiers are for the subsystems that want to know about when a system suspend is going to happen, and they can register a notifier for that and it is called before doing any other steps of this. The next thing done is the freezing of tasks, and actually it's about taking user space out of the picture so that we don't have to worry about what user space can be doing during the next steps, because user space may mess up with things in many different ways, and they are not really controllable, some of them are not really controllable in any way, they are very difficult to intercept. If you think about things like M-map, for example, registers of a device can be memory mapped so that user space can access them directly through memory addresses, then that is very difficult to intercept, and it's better to just get rid of user space entirely and not worry about it anymore, and this is what that is for, right? So here, in this point, we don't need to worry about user space anymore, we are purely in the kernel. That big box represents the operation that I'm calling device suspend, which is about putting devices individually, I.O. devices into low-power states. Each of the balloons here represents a loop that goes through all of the device objects registered in the kernel, so each of them transitions those objects or devices that they represent into lower and lower or more and more energy-saving states, and after that, hopefully, no device should consume energy anymore, except for wake-up devices, right? Because obviously, when we go into sleep or when we put the system into sleep, we would like it to wake up at one point, for this reason or another. So, yeah, we need wake-up devices to be able to signal something, and those have to be powered for that. In the next step, we turn off or put non-boot CPUs offline. That's because we don't want them to take any interrupts and we want things to be scheduled on them going forward. The next step is putting core system components into lower-power states. These are things like interrupt controllers. We just turn them off. And finally, when we get here, we either call a platform firmware to turn the system core logic off for us or we trigger a platform driver that we do the same, hopefully. So this is a step that requires platform support, but if we don't have that, it's not useful to do the previous two steps already. So this is the idea of suspend to idle, right? So we have a transition that looks like system suspend, but the final step is just waiting for a wake-up interrupt instead of putting all the non-boot CPUs offline, turning off system core components, and so on. So we just suspend devices. At this point, they should not consume too much energy and then we wait for a wake-up interrupt just as in an idle state of the system when CPUs don't have any instructions to execute. Yeah, that sounds straightforward enough. Unfortunately, there's a problem with it which is related to wake-up interrupts. Wake-up interrupts are wake-up events in general. So we want the same wake-up events to work in this state for these transitions and for suspend to idle in the same way. So if I have a wake-up device like an internet controller and I want it to wake up my system from full suspend, which was the previous slide, and I also want to wake it up. I would also like it to be able to wake up the system from suspend to idle so that from the user's perspective, those transitions or those states behave in the same way. That is not very straightforward to achieve. And to explain that, I need to go into these a bit. So, yeah, that's how it works on an x86 PC. We have PlatformFirmware here, which is a black box on purpose because we don't really know what it is doing and we hope that it is going to do the right thing for us. But that's how far we can go with it. So we have PlatformFirmware that hopefully will put the entire platform into an energy-saving state in which the platform will consume a minimum energy. And now I use two different colors for those boxes here and for the balloon here and for those things above. And the reason why is that at this point, we are carrying out a very special operation, which is to disable device interrupts. And originally, it was done as a hack to work around PCI bugs or bugs in PCI drivers that were related to shared interrupts. Like, don't really want to go into details with that. But this was done just to avoid crashing the system by buggy PCI drivers. But the bottom line is that from that point onwards, the device interrupts are not really handled. I mean, the interrupts handlers are not invoked by the IRQSAP system. It just marks them as pending and it will trigger the interrupt handlers at this point when we get back from this state. OK, so device interrupts are off, meaning that the handlers will not be invoked. And then we do those two steps I was talking about before. And then we call the platform firmware to put the platform as a whole into a low power state or into another saving state for us. Then when there is a wake-up event, like the user opened a laptop lead or pressed the power button or something like that, the platform firmware, again, is supposed to wake up the platform and then jump to the kernel. And then we can just continue by restoring the functionality from the grounds up. OK, that's when the platform firmware is available, and it can do those things I was talking about. Obviously, relying on platform firmware is not necessarily a good idea sometimes, as you know. So some platforms actually don't have it, but still they support the idea of putting the platform as a whole into a low power state to save more energy than is otherwise possible. And how this works or how this can work is illustrated here. So suppose that the platform has two interrupt controllers. One interrupt controller is used in the working state for handling interrupts or for signaling interrupts to the CPU, while the other, the secondary CPU, the interrupt controller is only used for signaling wake up events. So when the system goes into a suspend state, into a sleep state, the platform driver switches over from this primary interrupt controller to that secondary one that is only used for signaling wake up events. And if we have a wake up device that is supposed to signal wake up system wake up, then it has to say that, hey, I'm a wake up device and I'm going to use this secondary thing in a sleep state. And we have the enable IRQ wake function for that. This is to tell the IRQ sub system that this device is going to signal wake up through the secondary interrupt controller. And if we have that, we can just implement the functionality that was or that is carried out by the platform firmware on x86 PC just by using that hardware support. And in that state, in that step, the platform driver in the kernel actually switches over from using the primary interrupt controller to using the secondary one. And then when the wake up event happens, for example, a device gets a signal from the outside. It is for one another. Then the secondary controller will trigger and it will trigger a CPU, will wake up the CPU from sleep and will trigger the system wake up this way. And in this block, we just go from the secondary interrupt controller to the primary one. We switch over back to it. And then we can just restore the functionality of the whole system from here upwards. All right. Now we want suspend to idle to work like that, too, because we want it to look like real suspend to users. And we really didn't want drivers and other components or subsystems or bus types or other things like that to care about whether we are going to use suspend to idle or we are going to use full suspend. So we needed to make wake up interrupts work in the same way as here in the suspend to idle case. So now we have just this loop waiting for a wake up interrupt instead of all the transitions from one interrupt controller to another. At this point, we are still using the primary interrupt controller, but we want things to happen as though we were using the secondary one. For that, we use the observation that we disable device interrupts at this point and enable them here. So we use that by we kind of tapped on that. And now instead of simply disabling device interrupts, we just switch them over from working mode to suspend mode. And in that suspend mode, the handling is if that interrupt is a wake up one, it will trigger a system wake up. And if that interrupt is not a wake up one, it will just be ignored or mark the spending here. All right, so this way, suspend to idle actually works in this from that perspective, it actually works in the same way as full suspend. So we have two modes of operation. One is the suspend mode. One is the working mode. And in the suspend mode, wake up interrupts trigger system wake up. And in the working mode, they are just regular interrupts. All right, so are we there yet? We have suspend to idle, not we aren't. And the reason why is this. Well, so what if we have a kernel timer, which is periodic? That's what happens. The timer generates an interrupt and wakes the CPU from idle. We get the timer interrupts here. It is not a wake up interrupt. Timer interrupts cannot be wake up interrupts. So we ignore that interrupt and go to idle again. And when the timer triggers again, then we get another wake up and so on. So in fact, that gray box I printed on the previous slide is actually not just waiting, but looping over timer interrupts like that, which really means that it looks like it is suspended. But in fact, it is still consuming energy for handling timer interrupts. So that still is not addressed in the mainline kernel. We have a patch for that, which basically works by turning off timers before going to the idle loop here. So it just suspends the entire timekeeping system of the kernel here. And then we don't get any timer interrupts anymore when we are in this gray box below. So we don't need to worry about this particular thing. But that patch is waiting for a review from the relevant people who haven't been able to do that yet. But it's in a pretty good shape, in my opinion. So I hope we will be able to merge this particular patch pretty soon. So yeah. So that's suspend to idle. This is a new thing. And well, the benefit of it is that there's no platform support is needed for it to be useful, for it to be accessible, and it should be available to all platforms. It may not save as much energy as full suspend, but then again, it should work just for everybody. Well, there are other improvements related to system suspend that I'd like to talk about briefly. First of all, we started the profile system suspend and measured times. How much time does it take to suspend a system? How much time does it take to resume it? And how much time does it take to suspend or resume individual devices? And there's a special tool for that, which is called analyze underscore suspend.py. It's available in the kernel script directory. So it is included into the kernel source. You can use it. You can profile your own system. Well, and this produces, this is an HTML graph, and it is kind of interactive. You can zoom in, zoom out, see how much time it takes for a particular device driver to execute a specific callback, and so on. So yeah, and you get a graph like that. Well, does anyone see a problem here, particularly on that side? So user space starts to be running at this point, right? But it has to wait for all of those two things to resume, and for that one thing to resume, although all of the other devices have resumed already way earlier. And if that particular piece of user space doesn't access those two things or those three things, it could actually start running already here, right? Well, yeah, this is another trace. This was for full suspend. This is for suspend to idle. There are differences that they are not very big. Again, we have those two things. These actually are scarce devices, which means storage. And this is a wireless-fi driver that waits for everything else to resume for some reason. Yeah, and only then we start user space, right? So this is not very efficient, let's just say, and well. And where there's a problem is that Android is a big user of system suspend today. And Android really wants suspend and resume to work quickly, and by which I mean that it wants the user space to be able to run as quickly as possible after resume. When there's a signal somewhere here, there's a wake-up event. We really would like user space to be running as soon as possible, because it has to act on that event, possibly, right? Well, but the problem is we can't really help with. We can't really make those boxes here shorter. This is how much time it takes for a device to become available, because that device may be complicated. It may be a disk, and it may just need that much time to reset itself and become operational. So yeah, we can't really help that. But what we can do, we can cheat a little. And that's how we can do it. That's how we can do it. So the green code flow is for devices that are handled in a traditional way. The blue one is for devices that are cheating, or drivers that are cheating somewhat. And what you can see here is that instead of waiting synchronously for the device to resume, they actually start a separate thread or queue up a separate work queue item that's going to wait for the device to show up, and then it can just wait and allow the other things to proceed in the meantime. So what we can do this way, we can just defer the resume of those two things and move that line somewhere over here, or perhaps not that line, but that box along with that line somewhere over here. Yeah, at this time, the individual drivers have to do that trick, and we don't have framework support for that. We should probably provide it because that's a trick, and it is useful, and it may be useful in many cases, so potentially at least. So probably we should be providing some framework support for that, work in progress. So this is one thing. Another one is, well, sometimes we need to resume devices during system suspend. So I said that we had those two categories of framework where one category work on individual components in the working state, and the other category is system-wide. So if we have suspended a device in the working state and now system suspend triggers, yeah? Then we have a situation where the device is suspended at this point, and we may want to suspend it again. And now the question is what to do with it, right? What to do with it? Do we have to handle it in any way, or do we not have to? Maybe just leave it alone. So yeah, the approach one year ago and before was that we would be resuming all devices that have been previously suspended at runtime, which is not very efficient, right? Because if those devices are not wake-up devices and they don't really need to be reconfigured during system suspend, why resume them? So last year we started to work on that problem, and now we are checking the state. Yeah, the solution was that the device prepare callbacks can be used for telling the power management core that this particular device need not be touched during system suspend if it has been already suspended. And if that happens, then the core will just ignore it, right? So the callback checks if the state of the device is right for system suspend. If not, then it will be resumed. And if the state is correct, then it will just stay at runtime suspended. And we are not going to touch it during the HALT transition here. This graph is for the diagram is for a suspend to idle, but obviously it also applies to full suspend. So yeah, if we can keep the device in runtime suspended state, we'll do that. That requires support from a device driver and a subsystem. And also it requires the dependencies to be satisfied. But generally the idea is that if we have suspended devices when the system suspend triggers, then we won't touch them. Another improvement, that also is work in progress. Some support for that have already been added for a couple of subsystems I'm working on PCI today. And hopefully we'll have more of that going forward. So yeah, work in progress again. Last thing I wanted to talk about today is the reuse of callbacks, which also is a problem a bit. So yeah, every stage of system suspended in the museum requires is probably not a good word for that. But potentially requires a separate callback or even a set of callbacks provided by the driver, bus type, and if there's a power domain used, the device is in, then the power domain driver also will provide the callback. So potentially there's a hierarchy of callbacks for each of those steps. And we have well up to four suspend callbacks for system suspend for every device. Well, it's a lot. Those two are probably mutually exclusive. So if you implement suspend ORQ, you probably won't do that. So yeah, that makes it a little simpler. And in addition to that, we have the runtime power management framework, which also provides or requires its own callbacks. And so question here is, can we use the runtime suspend callbacks for system suspend? Or can we use runtime resume callbacks for system resume? And if so, then how to use it? Well, and this illustrates that, well, it's not that easy because the conditions for running this are not necessarily the same as conditions for running that or that or that or even that. Well, so first of all, the prepare callback requires no changes to be made to the device. Obviously, runtime suspend makes changes to the device because it's supposed to put it in a low power state. When this is executed, the device may still be in use while runtime suspend callbacks actually have to assume that device is not in use at this point. This one assumes that the device, the interrupt handler of the device driver is not going to be executed. So no interrupts, right? This one cannot make such an assumption. So it has to assume that the interrupt handler can be run at any time. So it has to handle that situation. Well, the remaining one that can possibly be replaced or possibly use the same code and this one is suspend late callback. But the problem here is what happens if the device is already suspended when we execute it. So it should, in principle, check if the device has been suspended. And if so, do something and if not, do nothing. And if it has not been suspended already, then it should suspend it. Well, that would involve some code duplication between drivers because then every one of them has the device being suspended. Well, if not, then we suspend it. If it has been suspended, we do nothing at this point. So yeah, people encountered the problem and introduced wrappers around 10 PM at callbacks that can be used in your, that can be pointed to from your suspend late callback pointer. For system suspend, the wrapper function is called PM runtime for suspended. They may not be very good one, but I don't know what. This is what the patch author used and I didn't have a better name for it. So yeah, that's how it's called. And that function will just do that check I was talking about. It will just check if the device has been runtime suspended. And if that's the case, it will just skip everything. And if the device has not been runtime suspended, then it will execute the runtime suspend callback for it. So that's how you can reuse runtime PM infrastructure for system suspend in device drivers. All right, so that's all I had for today. These are two presentations I gave on related things during Linux Foundation conferences. That is some source code documentation and other resources that you can use. In particular, that is a link to the suspend to idle patch for handling kernel timers I was talking about before. And that is where you can get the latest version of the analyze suspend script. Everything else is the kernel source tree. So I guess this is the time for questions. No questions? Oh, yeah, there is. With a new hardware P-State driver in kernel 3.19, will we be getting any user space tools to monitor the P-States? In the i915 driver, you mean? And I'm not sure I understand. In the kernel 3.19, I see you made some new P-State, the hardware P-State drivers. Hardware P-State, oh, yeah. You mean hardware P-State in that is about CPU freak? CPU frequency scaling. And the question is about, are there going to be any user space tools available for that? Yeah, it's the. So I don't think it is related to my topic today. We can talk about it later. No one else has any other question, so. Yeah. It is not related to system suspend really. It is about the CPU frequency. It's about the per component power management part, right? Right, that's good. Thanks. Are there any questions about that? I'm not really a kernel hacker, but from time to time, I found the need to debug suspend because it just hangs and does nothing. And I was googling around a bit, and I was somehow surprised not to find any good instructions how to debug when it hangs during suspend. Can you give any pointers quickly how to do that? Yeah, so there are a couple of things you can do. And basically they are, if you go to the documentation power directory of the kernel source, there is a basic PM debug file. Basic PM, basic underscore PM underscore debug.txt file, I think. There are some instructions in that file. You can use, we have some test modes for system suspend. You can use those modes to figure out when it hangs, basically, at what stage of the transition it hangs. And then possibly identify the device driver that is responsible for that or something. And it is described in that file I was talking about. The next thing you can use is a PM tracing mechanism that was introduced several years ago. It also is described in one file in the documentation power directory. Actually, I don't remember, but I guess I can look up the name for you. I'm not sure which one of them it is. But anyway, under documentation power, you have some instructions on debugging, including the description of the trade. So the way you use it is by using the RTC to store the mark of what was the last device driver that was called. And during a transition, when it hangs after that, you can read the information from the RTC. Obviously, it will override the date and time information that was previously in there. But then you can use the RTC memory to store it. There is some work on using a persistent store for debugging suspend.resume. I don't know how far that goes, honestly, at this point. What else? Well, one of the techniques you can use is if you suspect something, like the driver, then you can try to not to unload the driver before suspend and then try it. Sometimes, if the graphics is involved, it sometimes works either in console or in the graphics mode. And it may work in one of them and may not work in the other. It's worth trying. And yeah, there are other ways. I can't recall at this point, but there's multiple different techniques you can use. The first one I always recommend is to start using the test modes, which is you can just exercise the individual steps of the system suspend and see where it actually fails. Right? More questions? Unfortunately, we can't do any more questions. Oh, I see. Yeah, we've just run a bit quite a bit over time. If you do have more questions, though, you can take this outside. You can have fun discussions all day long. Yeah, I'll be available. So on behalf of Linux, we've got a small gift for Rafael. Thanks a lot. And another big round of applause and enjoy your life. Thank you.