 My name is Rafał Wesocki. I work at Intel in the system software engineering division and I work mostly on the mainline Linux kernel and mostly on the power management support in it and ACPI support code in it. Today, I will be talking about suspend to idle, which is the power management feature in the Linux kernel. But before getting into details, let me give a short introduction to this topic to get everyone on the same page. So, suspend to idle is a variant of system suspend, which is a feature by which the system can be put into a low power state as a whole. And in that low power state it doesn't do anything except for basic maintenance like battery monitoring and temperature monitoring. And so the state is referred to as a system sleep state because it is kind of analogous to animal sleep in which the body is not, does not appear to do anything. And whatever it does is regarded as basic maintenance. So, why is it needed? Why is it there in the Linux kernel? So, there are use cases for that. Obviously, that's the reason. And actually, there are more use cases than just the two listed here in the slide, but these two are canonical use cases for system suspend. The first one is a laptop in a bag case. When the user puts laptop into a bag, closes the laptop lead and puts the laptop into a bag, and then the laptop is not expected to do anything except for the basic maintenance. And that's why this is a use case for system suspend. Another use case is when the user goes away from the computer keyboard, walks away from it, and after a certain time, the system may decide to suspend itself with the assumption that it is not going to be used for some time going forward, or it only needs to do basic maintenance. So, again, this is a canonical use case for system suspend. As I said, there are other use cases, but the two above or in the slide are just enough to have it for having it in the kernel. So, that's why we have system suspend. Now, a bit of terminology. So, as I said before, the system state after suspend is referred to as a sleep state. And in contrast to that, the system state in which the system is operating and processing data and doing everything the user needs is referred to as the working state of the system. So, you have two states, the working state and a sleep state, and a suspend transition which takes the system from the working state to a sleep state, and a resume transition which takes the system in the opposite direction from the sleep state to the working state. The suspend transition is triggered by user space through a special control interface. I will be talking about it a bit later. And the resume transition is triggered by a so-called wake-up event when one of the devices in the system signals they need to wake it up from sleep state. And these are the cold flows for system suspend in Linux. One of them is referred to as platform-based suspend and the other is the suspend to idle I will be focusing on in the rest of my talk. So, the platform-based suspend has been there for like several years. It's been introduced around or was introduced around 2004 and was stabilized around 2009 and then it has not been changing a lot since then. And suspend to idle is newer, it was introduced in 2013 and it evolved quite a bit since its introduction. I will be talking about that. So, if you look at those cold flows, you will notice that there are similarities between them and there are differences. So, the first steps of the suspend transition, the three first steps in both cases are the same and also the three last steps of the resume transition are the same in both cases. So, when the system is asked to suspend itself by user space or the kernel is asked to suspend the system by user space, I should say, it first calls notifiers in order to notify all the interested subsystems about the upcoming transition. Then tasks are frozen, which means that every user space task gets a signal and that causes it to enter a function in which it doesn't participate in any mutual exclusion and waits for a notice to leave that function. So, it's looping in a function until a corresponding towing of tasks happens during the system resume. Kernel threads may or may not participate in that it is on an obtained basis for them. The next step is suspending devices, which is carried out in four phases. I'm not going to go into much details about what those phases do in particular. There are four of them in which phase all the devices can be accessed, can be touched and that the phases happen each after another in the order given in the slide. So, during the system resume, there also are four phases of resuming devices, which are analogous, each of them is analogous to one of the suspend phases. And after resuming devices, the tasks are thought so they can continue running and then all the interested subsystems are notified about the completion of the resume transition. So, in the platform based suspend case, there are more actions carried out in addition to the ones I was talking about, but they are not carried out in the suspend to idle case. So, generally speaking, the platform based suspend after suspending devices, it prepares the system to call the platform firmware to complete the suspend transition. So, the last step is calling the platform firmware, the platform online step, offline step, sorry. And then the platform firmware takes over and does something to turn the power off and finalize the transition. And also, in that case, the platform firmware handles the wake up signaling. So, suspend to idle is kind of a simplification of that, which is based on the observation that the last steps of platform based suspend are not actually necessary on contemporary systems. So, they may be skipped and instead of that, the CPUs can go into idle states and wait in those states for a wake up event. In that case, all of the wake up events have to be handled by the kernel, by the operating system. So, as I said before, suspend to idle is kind of a simplification of platform based suspend and that's the reason why it was introduced and the slide summarizes the most important properties of it. So, first of all, it is a system suspend variant. It is expected to draw less power than runtime idle or active idle, which is, so runtime is a combination of energy efficiency features which are available in the working state of the system. So, if the system is suspended, it should draw less power than any combination of power management features in the working state. That is theoretically, suspend to idle is more lightweight than the platform based suspend, which is kind of obvious because it does less or takes less steps in order to suspend the system. It relies on the CPU idle infrastructure. So, if the CPU idle time management infrastructure is not available, suspend to idle is not actually supported. And it should be available on all platforms that have the CPU idle time management infrastructure in contrast to platform based suspend which requires specific support on the platform side. Here's a control interface to control system suspend. So, there are two main files in SysFest for that. One is SysPowerState, in which can be used by user space to trigger a transition into a sleep state. Actually, mem is the string to write to this file to trigger system suspend and it can be either platform based or suspend to idle depending on what is supported by the given system. The one in this slide, the example system from which those files were taken, supported both suspend to idle and the platform based suspend, which is referred to as deep in this interface. So, the SysPower mem underscore sleep file can be used to choose the variant of suspend that will be triggered by writing mem to SysPowerState. In the example in the slide, this is suspend to idle by writing deep to mem underscore sleep, the square brackets removed to deep and then the platform based suspend will be triggered by writing mem to SysPowerState. In addition to those two files, there are multiple files in SysFest for controlling wake up. So, every device that can wake up the system from sleep or from the sleep state entered through system suspend has a power slash wake up file, which can be written to with one of the enabled or disabled strings and if it is enabled, then the device is expected to wake up the system from sleep state in particular from the suspend to idle sleep state and if disabled is written to this file for this particular device, then the device is not expected to wake the system up. Okay, so this is how it works and how to control it, but it is instructive to have a look at how it evolves. So, in order to kind of recall how we got here, how we got what we have currently in the kernel and I'm going to focus on that in the rest of my talk. So, first of all, suspend to idle was introduced in Linux 3.13 in 2014. So actually it appeared in 3.14, it was introduced earlier in the development cycle of this kernel. So, but this is the first release in which it was present. And that that was very similar to what we have right now in Linux, except that it was much simpler than the current implementation. And I would even call it oversimplified. The idea at that time was very, very simple, it was that, yeah, the system suspend infrastructure, which was introduced for system-wide suspend or system platform-based suspend, sorry, which was introduced for platform-based suspend, was actually considered to be sufficient for suspend to idle as it was without any modifications. And the idea was that, oh, we can just, instead of taking non-boot CPUs offline and so on, we can just kick all the CPUs into the idle loop, and then make the suspend task, carrying out system suspend, to simply go to sleep and wait for a certain situation or certain thing to happen. And that thing was that it waited for a specific data to be written into a specific location in memory. And obviously, this had to be done by somebody, and the idea was that it would be done by an interrupt handler. So, when there was a wake-up interrupt, a handler would run, and that handler would write into a location of memory monitored by the suspend task. And then the task, the suspend task, we'll see the write, and then it will wake up and then it will carry out the resume of the system. Well, that was straightforward enough, except that it didn't work as expected. And the reason why it didn't work as accepted is related to how interrupts are handled during system suspend and resume in Linux. And that is, so in order to explain that, I need to say that there are two levels of interrupt handlers in Linux. There are so-called low-level handlers, which take care of the action specific to the interrupt controller in the system. So, for every interrupt controller, there is a low-level handler. Or you can think that for every interrupt controller, there is a low-level handler, which is called every time this interrupt controller signals an interrupt, and that handles the controller itself. On top of that, there are device-level handlers, which are referred to as action handlers, and device drivers can register those action handlers, and there may be multiple action handlers per interrupt line or IRQ. And they all are invoked when the interrupt happens or triggers on that IRQ after the low-level handler has handled the controller. And the problem is that the action handlers were expected to wake up the system from or kick the suspend task to trigger a system resume while the system was suspended to idle, but action handlers actually don't run while the system is suspended as a rule. During system resume, there is an operation referred to as IRQ suspend, which prevents action handlers from running between the suspend late phase of system suspend and resume early phase of system resume. So the orange and gray area in the slide is when the action handlers don't run. So actually they cannot wake up the system or they cannot kick the suspend task to carry out the system resume. So in order to address this problem, the low-level handlers, the low-level interrupt handlers had to be modified, and the modification was done for actually for all of them at the same time. And what was done, there are two flags or two new flags were added. One of them IRQD underscore wake up underscore set is used to notify the IRQ subsystem that this IRQ for which it is set is expected to wake up the system from suspend. And there is another flag called IRQD underscore wake up underscore armed, which is set by the IRQ subsystem during system suspend if the other flag was set earlier for the given IRQ. And if the armed flag is set and the IRQ triggers during while the system is suspended to idle or in this gray box that was showing in the previous slides, the IRQ subsystem will automatically trigger the suspend task to carry out the system resume. So this is automatic. This is done in such a way that it doesn't, it is not noticeable in the, the handling of this is not noticeable in the working state of the system. And it works quite well and the actual handlers don't actually need to worry about system resume anymore. All right, this change was made in Linux 319 in 2015, along with several other improvements listed in those in those blue balloons in the slide. So the analyze suspend tool was introduced, which allows people to, to profile system suspend and find out which devices take, take how much time is taken by, by individual devices to suspend during system suspend and so on. So if you can get a profile system with respect to system suspend and resume. There was another interesting feature that was added during that time is direct complete which allows some devices that were already suspended in the working state of the system to stay suspended over system suspend and resume. And of course, there was this wake up interrupts handling rework that I was talking about and, and suspended idle started to ask for the deepest idle state of CPUs after, after devices have been suspended. And also device callbacks in all of the phases are now asynchronously calls, so which means that they can be in each phase, they can be called in parallel with each other, as long as there are no direct dependencies between the devices. In question. Okay, so improvements were made. But that was not enough or not all yet, because there are timers. And some of the timers are periodic. And the way the periodic timer works usually is that the timer function programs the timer itself to trigger again after a while, and so on. So that periodic timers program, periodic timer programs itself to trigger again. The problem is that timers, the timer interrupts are actually are actually handlers for timer interrupts, I should say, are actually allowed to run during the late suspended early resume of the system, or the Noir Q phases, and also while the system is suspended to idle. If there is a periodic timer, it or if there was a periodic timer, it would cause the system to cause at least one CPU in the system to wake up, see what's going on. Obviously, this was a timer so it was not a wake up in a wake up interrupt for the whole system right. And so the CPU would go to idle again, and this was going in a loop, while the system was suspended to idle. Obviously, it prevented the system from or the CPU, the CPU blocking the system from, from saving energy, or at least from saving as much energy as it could save or could have saved. So that needed to be dealt with. And the way to deal with it is to add the ability to suspend timekeeping to the suspended idle code. Kind of straightforward because that the the suspended idle code knows which CPU is going into the into an idle state. As the last one, right. So if this is the last CPU going idle during suspended idle, it will trigger suspended suspend of timekeeping. And the suspend of timekeeping causes basically prevents any timer from being used going forward. So it is, it is sufficient to suspend the timekeeping in order to get rid of the timer problem entirely. If the suspend going idle, if the CPU going idle, sorry, at the moment is not the last one, it only has to stop the scheduler take on itself. And after or during wake up the first CPU waking up will resume the timekeeping and the next CPUs going out of idle will start the local scheduler tick on themselves. And the rest of the processing is, is as it was before. So the suspend of timekeeping allows us to avoid the timer issue. So this is the last one in the in the news for four in 2016. In addition to that, in the same kernel, the that there was a new framework for for managing IRQs dedicated for system wake up. So on some systems are there are dedicated for waking up the system from from sleep states, and they are not used for anything else. And, and the kernel got a new framework for managing those, those IRQs. Okay, and this is the time when, when it looks like that everything was done. And, and the feature was complete. So it was sort of like in the in the human development when the when the kid goes to a university. And he or she thinks that he knows everything and now, you know, can can do by by herself or himself. But, but it turns out that there are challenges and, and, and they need to be faced and they are like part of the adulthood life, adult life. So that happened to suspend to idle as well. So in 2017 to changes were made in Linux for 10 which improved the feature. So first of all, system suspend generally started to support device links device links are featured that allows device dependencies to be taken into account during suspend and resume more precisely than it was possible before. And that allowed, allowed developers to fix many problems related to device to device interactions during suspend and resume of the system. And interface to the control interface I was talking about was was changed to get its final shape as it is today. So the interface that I was talking about was actually introduced at that time in Linux for 10. And then it turned out that that there are problems. And the problems were related to, to the way that system wake up events were handled by embedded controllers in some PC laptop computers. So, the embedded controller is a small processor running in the, are located in the system, which executes a program to control system features like battery fans, thermal monitoring and similar things. Usually, the embedded controller doesn't have any business to be to signal anything to the main CPU during system sleep, it should, it should just take care of the basic maintenance by itself without, without interrupting the main processor. However, in some systems, embedded controller is also responsible for handling system wake up events, such as the power button press. So if the power button is pressed, it, the signal from it goes to the embedded controller. And then the embedded controller is responsible for signalling the main CPU about that through the special IRQ associated with ACPI which is referred to as ACPI SCI or system control interrupt. And SCI is a, is a multiplex interrupt which is used for many different reasons that are made trigger for, for many different reasons like the wake up signalling from PCI devices or wake up signalling from, from non PCI devices and so on. The ECGPE signal, signal shown in the slide is a, is a, is a signal line from the embedded controller to the ACPI SCI multiplexer that, that causes the SCI to be triggered. And usually, if, if the embedded controller was not responsible for handling system wake up events, that ECGPE line can just be disabled while the system is sleeping or suspended to idle. And so the embedded controller would never trigger SCI in the states. Unfortunately, if the system wake up events are, are to be handled by the embedded controller, that cannot be done. Which means that all of the activities of the embedded controller would now cause the SCI to trigger during system sleep and that would interrupt the main CPU. And then it is necessary to, to distinguish the wake up events coming from the embedded controller, from non wake up events coming from the embedded controller, like, you know, battery status change or thermal, some thermal events that, that it may want to notify the CPU about or something. And that is not easy. So there was an attempt in, in, in Linux 4.15 to handle that by, by looking. So obviously, you know, in order to distinguish between wake up and non wake up events from the embedded controller, the embedded controller needs to be looked at. And then the events need to be processed and then, and then it turns out if they are wake up events or they are not wake up events. So the attempt in Linux 4.15 was to do that after partially resuming devices. So between the noir, curism phase and the resume early phase during system resume. Because the idea was to do that after allowing the action handlers for, for interrupt lines to, to be, to, to run again. So that if there is an interrupt trigger during the EC processing, then it would be handled as usual and then there wouldn't be problems with processing of events. Unfortunately, it turned out that this, this, this didn't work because device drivers were confused by this additional loop between the noir, curism and noir, irq suspend of devices and some of them didn't work correctly because of that. So that needs to, that needed to be reverted or rather, or rather changed. So that the whole flow looks like the initial one. Actually, there is only, there is no additional loop in this, in this flow. All of the processing of wake up events happens in this gray box at the bottom. And, but, but today it doesn't only include wake, wake up interest processing, but also things like embedded controller events processing. So if there is a wake up interrupt, and it turns out to be ACPI, SCI, then the SCI interrupt handler is run in line or in band in the context of the suspend task, which is handling all the suspend and resume of the system. And that SCI interrupt handling covers the embedded controller processing. So if, if an SCI triggers the embedded controller is looked at and the events coming from it are processed and they are, and they are handled as appropriate depending on whether or not they are, they are system wake up events. Most of them are not system wake up events, but if there is a system wake up events, event coming from the EC, it will trigger the suspend task to resume the system. Okay, so this is a whole story basically, and it illustrates very, very nicely in my opinion that the virtuous cycle of the kernel development, which is that if you change the code in order to implement a feature or, or improve it. People will start to use it and then experiment and test it. And then there will be problems. The problems will be reported and that, and that obviously needs to be investigated. So the tools are developed and diagnostic features are added, and then it allows developers to understand the issues better. And then that's in turn allows them to write better code and make code changes and the circle closes this way, right? So this is like the, or the cycle closes, I should say. So, so this is what happened in this case in, in, in, in a few, actually a few times, right in a row. So some changes were causing people to try them to try try the new, the new code, and then they found out that it didn't work as expected and then reported problems and that caused the new the problems to be analyzed and understood and new code was written and it happened at least three or four times in the development of suspended idle. So it's a good illustration of how the kernel development at large works in my opinion. All right, so that is basically all I had to say. But let me mention two things. So first of all, I have references to presentations that in that space are related to power management and to system suspend and resume in particular. So even before, mostly during Linux Foundation conferences. So in the reference part of my slide doctor is there are a few references that should, the links should actually be active so you can use them to see what I had to say about this. It's mostly is the links are mostly to my presentations, one of them is to the presentation from land Brown. So that there are more details in those in those presentations and so you're going to come to try them and see what what they are about. So I have to show this slide. That's climbers like required by Intel. All right, so that's it. And if you have any questions or comments on concerns. Please let me know.