 We'll go ahead and get started. Hi, I'm Marstill. I'm doing a talk on suspending the SW suspend hibernation framework to ARM. A little background. I've worked on Colonel Patches, Colonel Work for about 10 or 12 years, mostly hobby, but for the past year and a half I've been working for Texas Instruments, mainly working on Sitara. It's the processor in Beaglebone. The core of the talk is the SW suspend hibernation framework on ARM, getting that ported, but I'm also going to talk about restoring SW suspend image from Uboot. I've got code and an eLinux wiki at the links below. The clearest motivation for this is you have a zero power consumption sleep. You can't really get any better than that, but there's also some additional benefits by doing the work. There's this cool PM state called off plus DDR self-refresh where you turn everything on the board off, except the DDR. Leave the DDR on self-refresh and you start up with Uboot and jump to a magic address within your Colonel and restore from there. It's a little more complicated than regular suspend resume because you don't have all the things in your CPU that you normally have saved. And the other neat thing you can do is snapshot boot which is where you make a hibernation image and then just keep reloading it every boot rather than doing an actual boot. So a little background on SW suspend. It's been the mainline hibernation implementation since about 2.6. There's another implementation out there called tux on ice or suspend 2 that supposedly one day will get merged, but not yet. It uses the swap device in the system to store the image. It's got this requirement where you have to have 50% of your system RAM free, which kind of sucks. But at least you have a swap device so you can swap pages out while you're trying to make your image. And there's also this cool tool called you SW suspend, which is a user space component where the Colonel will make its image and pass it off to this user space component and then the user space component can encrypt it. It can store it on the network. It can do your graphical process whole bunch of stuff. So that's pretty cool advantage of using SW suspend. For anyone that's not familiar, this is kind of a convoluted diagram of the basic suspend resume without hibernation. The first step is to suspend all the devices and then there's this neat function called CPU suspend, which isn't really linear. It saves off all the CPU state and does a dance with the stack and saves off the stack and then calls your architecture-specific function that turns your hardware off to the best degree that it can. When you hit resume, there's nothing stored in the CPU anymore. So your bootloader or ROM usually starts and calls the physical address of the CPU resume function. Sorry? Oh, go ahead. And it pulls a stack pointer from this special sleep-save stack pointer and when it does that and does a return, it's like it's returning directly from CPU suspend. So it looks like a straight code path, but it isn't. The CPU resume function restores all the state to do with the CPU like the MMU and things like that. And then once that's done, you can just resume your devices and be done. So SW suspend reuses that same functionality but in an even more convoluted way. Instead of calling some architecture-specific function that does the actual PM power management, it calls this SW suspend save function that snapshots all the pages in the system, which is why you need 50% of your memory free. And then once that's done, then it calls into CPU resume. Which is kind of weird because you've just snapped out of your pages. And then that returns from CPU suspend and it checks this value that it's in suspend. So it knows whether or not to go to thaw or restore. Hibernation renamed suspend resumed all these other things because it's got a lot more things going on. So freeze is kind of like suspend, but it's not turning hardware off. Thaw is the opposite of freeze. So before you make your image, you want to make sure your hardware is not doing anything. And then after you make your image in memory, you thaw your hardware out and you can write out your image and then power off your system. And so when you boot up your kernel that reads in the hibernation image, it finds the image, it reads into the memory, and then puts the hardware in the same exact state it was when the image was made. And then calls a special swspend resume function that takes the pages that it loaded into memory and puts them in the right place. And then it does the same thing with CPU resume and it looks exactly the same as when it was making the image except the in suspend variable zero. Because it saved in a special area of memory that it instructs the snapshotting to not save. So that never gets set to one. And then it calls restore which is a little different than thaw because it's able to bring the hardware back up fully and it's done. So a generic ARM hibernation implementation actually turns out to be the fun and easy part. It gives you the generic implementation, gives you these function pointers to implement. The ARTS suspend function is responsible for saving the CPU state, snapshotting the memory, and then just returning control back. The resume function kind of does the opposite. It takes the pages that it was loaded into memory and relocates them to the original location. And it restores the CPU state and it has to then return control back to the suspend color. So that the original kernel will call into suspend and then return from suspend. There's a few other functions that do some miscellaneous things like tell the kernel which page is not snapshot. And the ARM work is forward ported based on an original work by Frank Hoffman. So with this CPU resume functionality and CPU suspend functionality we're implementing, this actually becomes pretty easy because we can use CPU suspend to save the CPU state exactly. And then once we're done with that we're running beyond the saved CPU state. So we'll return from CPU suspend when we resume. And the first call back into CPU resume happens here. Sorry. Which will go back and snapshot the system. So there's two returns from ARCH suspend. So ARCH resume does the opposite. This one's a little convoluted because we're having to copy memory over what we're doing. So we put our stack in the area where we told the kernel not snapshot. So it doesn't get overwritten. But the actual restore image function gets overwritten by itself. Which turns out to not be a problem because Hibernation has a requirement that says that your resume kernel has to be identical to your snapshotting kernel. So it's not a problem. So it relocates all the pages that need to be relocated. Uses again soft restart and CPU resume to return the CPU to where it was. And this is where it gets not very fun. The OMAP PM. There's a lot of complications here. And there's this module called the power reset clocks module that controls everything. And it gets put in a domain called wake up. That never gets turned off. So the kernel has all this great support for suspending and resuming and restoring and everything all the other peripherals except for the wake up domain. The wake up domain gets used for just about everything that you wouldn't want to lose. Your pinmux, your clocks, stuff that shouldn't stop when you're suspended. Things that need to wake you up. And so after Hibernation we have to restore the state and usually restore it quickly because things like OMAP DM timer zero controls schedule timeout. And if you call them you delayed it never returns. That doesn't get you very far. So these slides are just some basic information on how it's done on OMAP with the individual portions of the power reset clock manager. There's just some power domain, save restore. Each IP block on an OMAP processor is a hardware module. So there's a save restore for those. Clock domains. And then we do something cool. There's the whole clock tree that manages clocks on the system. We leverage that to restore all the clock data, which is a lot easier than the other areas of the PRCM. We just add a save context and restore context generally. And then for individual clocks we can add in a save context, restore context, and then it's a tree. So we can just call in and have it save the context when we go down and restore all of it when we come up, which is a nice advantage of the clock tree and the clock framework. So I'm guessing a lot of people are somewhat familiar with pin control. It basically controls which things in the chip go to which pins on the chip. So again you won't get very far without that. And on Sitara there's this errata that complicates things. They put some of the registers in one power domain, but some of the registers in another power domain. So when you come up out of suspend you lose one domain and when you come up out of hibernate you lose both domains. So the pin control subsystem needs knowledge of which registers are in which domain. So I don't know what a final solution would look like, but I just used the pincon function as a way to store which ones are in which domain. Normally in pincon it's not a function used for. So there's some work that needs to get done there, but I was able to list which pins are in the wakeup domain, which pins are in the peripheral domain. And also pin control needed a way to save the context, all the pin control registers and restore the context. So that's coded and you need to be added in as well. And it's done by pin control function group. So individual function groups can be saved and restored. And of course the current solution is a bit of a hack, likely not of streamable, especially since it changes some DT bindings around a little bit. So there really needs to be a new type of pin control register grouping. It would need to contain reference to which power domain it's in so that when a power domain loses context and comes back, it knows to restore those registers. But of course problem on OMAP right now is power domains are architecture specific, and so that information isn't really exported to pin control. So you can say okay, we don't know if we lost or not during power off, we just consider that we lost everything and we restored it. But if the problem could be hardware, so if you restore something that is already restored, then maybe you don't restore it in the correct time. So this is the problem. In this case, the pin control register list is pretty long and the current turning off and turning back on per domain can happen pretty regularly. So restoring all those registers every single time would add latency to that operation and make it less desirable. And there's a whole bunch of other registers and hardware blocks and things like that that you'd end up also saving, restoring every time. There is a chain of these people that the hardware will follow and restore some part of the other context like USB and so on. So this is done automatically, even if you don't want. Right, for those type of peripherals is not a problem, but on this particular processor the wake up domain is something that in the design and in the kernel design is to assume that these registers will always be there, that they'll never be lost. So there's really no hardware method or currently kernel method for restoring those registers. So USB, peripherals, not a problem. They all already have context saved, context restored. They know when they lose power, they know when they reset, that sort of thing. So clock events, one of those that's already handled. Clock source would be handled, but it's assumed to be always running in an A domain that doesn't lose power. So it's used for all the kernel delays. So it has to be restored especially early. So there's a bit of a hack up function there to do that, to call in directly to restore that early before normally we restore it so that we don't get stuck in delays. SRAM is another one of those where it's something that doesn't usually lose power. So it needs to have specialized functionality to restore it. And then other devices in the system that already have save and restore context functionality just need to know that they lost context. So I've added something into the power domains here that goes through when there's a hibernation and for each power domain, if it's not already off, it tells it that it's been turned off so that when drivers call into that, they can see that they've lost context. This is a major problem on ARM right now. The hibernation callbacks are different than the suspend-resume callbacks. And if you use dev-pm-ops in your driver and you only put suspend-resume in, it'll assume that you don't want anything for hibernation. And that won't work. So there's a special macro called set system sleep-pm-ops that'll set them all to your suspend-resume function. So this is an example with the OMAP MMC driver needing individual functions for hibernation and suspend-resume. And then there's also hardware that really doesn't expect to do what you're doing. The watchdog is one of those. If the kernel that's restoring your image had the watchdog running, then the kernel that starts running needs to be able to handle that properly. So the restore function for hibernation that's called after we restore our image can do something special. And in this case, we just double ping the watchdog to make sure that we don't wait around until the next ping time is scheduled. This is just the collection of all the save and restore context. So right now, it's pretty ugly because it's calling them all individually. It's certainly an improvement to have these in the suspend-resume core, but really right now in the suspend-resume core, there's no way to communicate which devices will need to have their context saved and which devices just need to be powered off or put into a low power state. And on resume, which devices have lost their state need to be restored? There's actually some drivers that will go in and read the registers and see, does this register value look like something that I recognize or is it the recent value? Which actually doesn't work that well on hibernation because the kernel that reads in your image will program that to a value you recognize. So the architecture-specific hibernation code, pretty basic. And then the individual callbacks for the pre-snapshot, for the leaving, and for the finishing. Debugging methods. This was the most fun part of the project. Whenever you hibernate or do anything with suspend-resume, when something doesn't work, you're often presented with something that is completely dead and the memory is scrambled and there's not much you can see. And especially on ARM processors, when you're doing suspend-resume, it turns off a lot of clocks, a lot of power domains, and things like that. So stuff you'd normally use to debug with is disabled. And if you try to write to it, it'll cause an abort. So GPIOs end up being pretty handy because there's usually one that is always on and you can use. And so you can just binary search by turning that GPIO on and off in your code to see, did you make it to this point? Did you make it to that point? And of course, printk is handy if you have serial up and running, but a lot of times it's something has to be turned off. JTAG with OpenOCD and GDB is really great when it works. And it can retrieve your kernel log buffer. So if your serial port isn't on, you can still see what your oops was when you died. Strangely, while one is a really useful debugging method. A lot of times that things will do, will cause this and go, hey, wire to reset. Yeah, go ahead. Is there any strategy in order to have this kind of clock that are instead the first to be enabled in order to have ATM working in order to do something at least? I imagine the GPIO, we have all the proper debugging. I mean, for me, the debugging is the one-on-one. I'm sure it's one-on-one. Yeah, a lot of the things you do end up resetting the system or going into a board handler that isn't properly configured because it was the one for the other kernel or the one for Uboot in an overwrite memory. So while one actually ends up being really useful, because you can say... Yeah, I've already used this at times. And then the other thing that's important is register map comparisons, because a lot of things can get swept under the rug because your kernel that reads in your image can program all your registers properly and you can miss restoring certain registers because they were already what you expected them to be or near what you expected them to be. So using something can M-map your registers in the user space and just read in the values and store them in a database and then you can look at them later under different conditions is really helpful to try to track down buggy restore code. See, I have a quick demo if I can get out of this presentation mode. I'm going to do this the hard way. It won't even let me quit. Okay, control plus is not working. Okay, so this is the kernel that's being hibernated. So it goes through all the freeze steps and the snapshotting and the creating of the pages and all that that powers the system off. And then you get a restore kernel that starts up that its only job is to put that stuff back into memory and call resume. So it goes through the steps of putting the hardware into a state where it's sane and wants to screw stuff up. Loads the pages in. And then promptly crashes because it hates me. It knows I'm doing a demo. This is a beagle bone. Okay. Okay, now for the other fun part. Restoring from Uboot. They restore the kernel that reads in the image. All it is doing is copying pages into memory and jumping to address, which Uboot is really good at that. So we can get Uboot to do the same thing. And because Uboot doesn't actually have to boot up a full kernel, it can be faster to use Uboot to do that. But there are some problems. Uboot has no idea what address to jump to. And the no save pages, it aren't supposed to be overwritten. It has no idea what's in those pages. Because when you're using a kernel to load the hibernation image, it's stored in that kernel. So there's a few little changes that have to get made. I've added here an architecture specific function that goes through and saves off the start and end of the no save pages. So it can make a copy of them. So the pages that aren't supposed to be saved, the kernel stores two copies of them. One to write to and one to overwrite those with. And then instead of calling CPU resume, we get to call CPU resume no save that copies those no save pages back over again. A lot of saving for no save. And then for CPU resume, we can just store that CPU resume restore no save into the hibernation image. So when you look at the hibernation image, it knows where to jump to. This is what a hibernation image actually looks like. It looks a little bit complicated, but it's actually pretty simple. The main area down here is the hibernation image. The area up here just provides metadata and points to where the hibernation image is stored on disk. So the head of the hibernation image tells where those pages in the hibernation image go into memory. So it's a page frame number. So for each data page, it says this page frame number goes into this address. So when you're restoring, you just go through each data page, copy it to the right place, and you're done supposedly. But of course, you're running so you can't copy pages over what you're running, especially in Uboot, which doesn't look anything like what you're running. I'll go back here really quick. There's a cool operation we can do with Uboot. A hibernation image has a signature that says I'm a swap file or I'm a hibernation image. And so when you restore a hibernation image, you overwrite it to say, again, I'm a swap device. In Uboot, we can choose not to do that. And so every time you boot Uboot, it'll go and boot the same hibernation image. So it's a snapshot boot. So the basic functionality is to read in all the page frame mappings. And we have a bitmap that says which pages in memory are used and which pages aren't used. So when we're copying pages, first, we know if we can copy a page there. And second, we know if we can't, where there's a free page, we can put it in. Fortunately, Uboot is really simple on ARM as far as memory map. At the very base of memory, there's CPU vectors. And at the very top is all its memory with the stack growing down. In this case, the zero address is the top. So it looks like it's growing up. So it's really easy to keep track of in Uboot which pages are free to use. This is the section of code that reads in the mappings and marks them as used. And then this is the piece of code that goes in and actually reads those pages into memory. And for the cases of pages that can't be copied into memory because they're currently in use, it has a pair of remapping pages where it can keep a tally of where that page is and where it should be. So then we allocate a single page for some code and keep track of where those remappings are. There's a really bad error in this code. And I haven't had a chance to update the slide. I guess it's kind of a lot to read, but the stack address there, I put it way beyond that page, which is really bad. But it usually actually worked, which is also really bad. So the very last thing we do on Uboot on the restore is we call this finish copy function with its own stack, which is in the wrong place on this example. And this is the finish function. It just goes through those remap pages and remap indexes and copies that page data from place to place. And it won't copy over itself in this case because we determined that this page is free both in terms of Uboot and in terms of the kernel we're restoring. And at the end, we just call the CPU resumed function and it's exactly like as if we were a kernel restoring a hibernation image, except we're Uboot. I was going to show a demo with... Oh, that's not it. Maybe I'll say that for the end. So open issues. Saving or restoring context on devices that are part of the SOC in power domains is kind of complicated and it goes along also with a reset framework. And it's made even worse by things like, well, do you want to support people that open up one of your devices and use DevMEM to read and write from it? Because you're going to have no clue from a driver point of view that they've done that. So when you hibernate and come back up, all those registers will be lost. Yeah, don't do that. You're on your own. And then there's registers that are initialized at boot but never read by the driver or there isn't a driver for them. So sometimes when you restore, it's not the same value. Pin control needs to be properly integrated. Some things that I really learned that are really important is set up your debugged framework early. Don't try to limp along on guessing. As much as possible, have one step to deploy whatever it is you're building and testing. Even though you're going through really quick cycles, the less steps there are from hitting make and having it run on your device, the less mistakes you can make so that five, six iterations later of moving things around, you'd find out that, oh crap, I didn't even do this right. And it goes quicker if you have one step to deploy. And keep notes, what worked, what didn't. It sucks having to go back and say, what did I change? It's really good for that. You can keep all your little teeny changes in the source control even though they're not really important changes and blow it away later. Let's see. I'll really quick try to do the demo. There we go. It will hopefully actually work and not have my while one stuck somewhere in there. This will be from the U-boot restore not from the kernel restore. So there we are in U-boot. We get a command to restore hibernation image. Actually, it's set up so it's a no op if there's no hibernation image so you can just put it in your U-boot script. Let me tell it we want the MMC card, the first one, the third partition. Unfortunately, U-boot's a little bit slow right now at loading pages from MSC. It's not really optimized for loading 20, 30 megabytes. So there's some working to be done there. And my while one is still in there. Yeah. But I think OpenOCD and JTAG are really underappreciated. It's actually not that hard to run. This is OpenOCD. As long as you have a configuration file for your board and for your JTAG adapter, Beaglebone white is nice because it has a JTAG adapter right on the board. It's really easy to run. And then with a GDB, I don't know why you can't see this. It's as simple as saying you have a file like VM Linux and those are your symbols. And you have a target that's remote and a port number and you're good to go. And you can look at your code, look at your variables, set break points, all those sorts of things. JTAG is usually kind of intimidating, especially doing JTAG with a kernel. But it really ends up being not that different from running GDB on a user process. And then my last slide was questions, but I'm not trying to get that going again. Okay. So questions. Yeah. Right. So on ARM, there's a lot of frameworks that really don't lend themselves to hibernation. The clock tree is something that really does lend itself well to hibernation. But a lot of very architecture specific things like omap power domains don't lend themselves very well to hibernation. And each processor and SOC probably has its own idea of what a power domain is and things like that. So stuff like that really needs to be made more generic. So it's easier to integrate into a actual hibernation framework and doesn't have to be re-implemented every time for every SOC. And especially with power domains, once that's made generic, it's a lot easier for stuff in drivers to tie into that to see if it's lost context rather than try to guess if it's lost context. So basically everything from the omap PM slides on was things that were specific to this SOC and the types of things that have to be re-implemented again and again and again until those are in generic frameworks. Which to spend? Oh, okay, yeah. Yeah, from what I understand the 50% limitation is there, but it's very rarely hit because it only has to relocate pages that it's not marking as read only before it starts the snapshotting process. Okay, I see. Yeah, the code within here that's used to implement the hibernation restore could actually change so that it copies code into the no save region. So as long as it's running code from the no save region, it's not going to get overwritten and technically you can resume a kernel that is not the snapshot kernel. The kernel versions can differ. Yeah, that's the safe pages. As far as I remember, there's just one variable in the no save region and that's the variable about this seems to be the only user left or the only user we've ever had. Well, like the... It makes it easier if I want to show it some other way to get rid of this. Well, the no save region is something that's actually in the mainline generic hibernation framework. But it's not really used as far as I know. So what are they doing for the stack, for instance? I don't know. You don't need it on the boot kernel, right? You might get away with all zero something like that. Well, in this case, the boot kernel is the one that's using that no save region stack. So for its function that relocates those pages, it actually has a stack. Now if that was written in assembly, you can do it without a stack. Take a look at the users. Instead of implementing the no save stuff within Uboot and... Yes. It might be easier and better to just bring up the kernel and do some other way. Right. Because it might have been a mistake in Uboot. At the time that I... Oh, go ahead. Right, rebooting can be really fast. It really is a specialized type of applications. Systems that maybe have a long running data collection process. The snapshot boot is a really good application. I think the fact that it's really the niche type applications is a reason you don't see as much push for hibernation because people say, well, I can boot up in five seconds. You mean like with the containers framework? Is that what the containers framework people are doing that with now? Where they're stopping entire groups of user processes storing their state and restoring them somewhere else with the same bid numbers and all that? Yes. It ends up... Yeah, the freezing and restoring a user process ends up being pretty complicated too because whatever state they know about the kernel, whatever functions they're sleeping on right now, all have to come back the same. And so it's pretty complicated too. And like I said, there's the other use cases that are also advantageous. They use this work almost completely out of the box, which is the turning the system off except for the DDR and doing the snapshot boot. There's a lot of race conditions out there with suspend and resume frameworks and that sort of thing. That's always fun. I'm pretty sure PRUs are in peripheral. It could be wrong, but I'm pretty sure. So they can get turned off opportunistically. Well, no, no, peripheral is one of the first things you turn off. The peripheral domain is tied to a whole bunch of stuff that you really want to turn off. There's actually a Cortex M3 within this processor that's the PM core. And it runs its own firmware. And it actually takes a major role within regular suspend resume. But it's actually not used at all in the hibernation framework. It actually has access to all the peripherals. It's got access to the same memory space. It chooses what to power down, power up. The only problem is programming it is more complicated because it's much more difficult to debug when things go wrong. So, yeah, there's actually a lot of ideas of, well, we can use firmware to transparently save and restore these things so that the kernel only needs to tell the firmware, hey, suspend this device and the kernel copy and the firmware copies the registers, does everything it needs to do, configures the clocks. And right, that would assume the firmware is correct and firmware bugs are more difficult to debug and it's more difficult to push firmware out and all those things. So, the hibernation image, I think it was on there saying that it was 10,000 pages. Anyone want to do the math on that? It's, what, 40 megabytes? 256 megabytes around. But there's not very much stuff running on there. So, anything else? Yeah. Really, to get the generic framework bits, it kind of needs a user. It's difficult to go and say, hey, I have this hibernation framework, but you can't use it. You can't test it. Um, the core suspend resume for AM335X is probably going to be in 313, which opens up this type of work to get merged in after that. Also, I think there's some people either working on OMAP-4 or OMAP-5 automotive with this same patch set. So, depending on what gets finished first, it would be a good opportunity to get this upstream. From my point of view, it would be time to send a register in. Right, but... Because I know that you have, but you have it working, you have it working upstream, and the generic parts are really useful for other people too. People already know that they work because you are using them. So, I believe in the core part, the artist picks up a few lines to show it should be measured now. Yeah, it's really difficult. Yeah, really, without the approval concert, the core is at least in one architecture. It was. Well, I'm blocking on... I'm using internal patches from that add, suspend, resume support to AM-335X. And until those get upstreamed, I would have to upstream... Those have to get upstreamed first because it depends on it, then I'm making changes in device tree, so that has to get approved. These patches aren't right. It would be useful to at least, or see it so somebody can say, hey, you're using this no save thing, x8664, not doing that anymore. There's a much better way to do it. And so they get cleaned up and get more interest. So, anything else? All right, thank you.