 My name is Russ Dill. I work with Texas Instruments, and I'm going to talk about the implementation of a hibernation on ARM. I'm going to talk a little bit about the SWS implementation in the kernel, some of the challenges in bringing that to ARM, an actual implementation, what work remains to be done, and some of the debugging techniques I used. I'm also going to talk about a SWS-USP restore implementation that I implemented within UBoot. There's code at these links here, and there's also elinux.org wiki page linked to there. Why would we want to do this? The simplest reason is that hibernation provides a zero power consumption sleep mode. It also allows you to configure a snapshot boot, where you boot the same hibernation image over and over again for a faster boot. It also shares a lot of the same implementation requirements with self-refresh only sleep modes like the AM33XXRTC only plus DDR self-refresh, which is where they actually turn the entire CPU off, but leave the DDR memory and self-refresh. So when you come out of that, it's just like a power-up, and you have to reconfigure everything again. SWS-USP has been the mainline hibernation implementation since 260. Eventually, it seems like it will be superseded by Tux on Ice. That hasn't happened yet. I don't know what the status of that is. It uses the SWAP device to store your hibernation image, and it's got this requirement where half your system RAM has to be free for it to snapshot your system. If you guys have any questions or anything, go ahead and stop me anytime. That's fine. It can be used with the user space SWS-USP component to support additional features. You can encrypt your image. You can store it anywhere on the network, wherever you want, and you can also get graphical progress on that snapshot. This is a little bit convoluted diagram of the basic procedure for doing suspend to disk that SWS-USP uses. It freezes your devices, which is analogous to the device suspend in regular suspend resume. On our implementation, on ARM at least, it's going to call it a CPU suspend, which saves your CPU state, saves your stack pointer off in memory. And then we're going to jump over to SWS-USP save that will snapshot all our pages. And then we're going to do kind of a weird thing. We're going to call it directly CPU resume, which reloads a stack pointer and returns from CPU suspend. There's this in suspend pointer that we set to one. We need that variable because it indicates to us that we haven't actually woken up. We've just returned from CPU suspend, and we have to do the snapshot. That variable is stored in a special region called a no save region that doesn't get snapshotted. So when we restore, it's not going to be one. So we return our devices to running state, we write out our image, and we power off. So the restore kernel looks like this. It just reads the pages to memory. QES is all the devices. Calls your resume code, which just takes the loaded pages and returns them to the original location of memory. Calls into the CPU resume function just like happened when we snapshot the system, reloads a stack pointer, returns from that same point that we did before, but now the in suspend variable is zero. So we'll call the restore functions for devices and return to running state. So how does OMAP PM complicate this? On OMAP, we have a lot of different configuration options for clocks, power. It's managed by the PRCM on OMAP across the SOC. And this diagram here shows a little more how that looks. It shows the voltage domains and power domains. Normally, during suspend, we're going to lose power to per MPU graphics. But the current PM core, the code assumes that a lot of these things in the wake-up domain, you're always going to have power. We assume it has power. We use it for things like power reset and clock management, turning power domains on and off during sleep, the pinmux configuration, any modules that have to wake the processor from suspend are in that power domain. And any modules that should continue running while the processors in suspend are in that power domain. So after hibernation, we have to restore the state. And some that we have to do really early because things like the clock source timer are used by a lot of things like schedule timeout, even the print K timestamp. Okay, so we'll go ahead and just quickly go through this on a step-by-step basis, how we did this on AM33XX. Power domain is pretty basic. There's just a single register that stores what power state that's in. So we have to restore the power state, wait for that transition to finish. Hardware modules. They get represented by OMAP hardware mod. We can leverage that to restore their state, whether they're in reset, whether they're enabled, disabled. Clock domains. We also have to restore their state. And again, we can leverage existing code to put in a save context, restore context function. Clocks end up being a little more complex. This is the whole clock tree. So this initial cut here just walks the entire clock tree, saves the context, and a restore walks the entire clock tree again and restores the context of each clock. And then for each clock type, we have to put in a save and restore context function. There is some room for improvement here. I'll talk about it a bit later. So pin control, of course, controls how internal signals are routed to external pins. For OMAP, we use a pin control single which just has a memory map of the register area where those pin control registers are, but there's no complete description of the registers. So on processors where only certain registers are accessible, because that's the only register that's actually there, you have to know which register is actually there. The errata on the AM33xx complicates that situation because certain registers also lose context from the per domain powers down during suspend. Current code handles that by just saving those in the PM code and restoring them again in the PM code, but I wanted to work towards a more general solution. So I tried to give the pin control subsystem knowledge of which registers are available and which power domain they're in. And this is a first cut on that. So I just listed the pins as a pin function group by power domain. And then so once I did that, I could go into the pin control core and add a save context for a particular function group and a restore context for a function group. And then on pin control single, just walk through that function group to save and restore. And of course the current solution is a bit of a hack. It's probably not upstreamable. So what's the way forward? Maybe a new type of pin control registered grouping where we can associate a pin control registered group with a specific power domain. And there's a problem with that though. The OMAP2 power domains are all currently managed in Mach OMAP2. So it's a little difficult to get that information out to pin control. Clock source clock event. Clock events actually already handled. Clock source fortunately is already a suspend-resume hook. So on suspend we can get our context loss count and our cycle counter and our resume. We're resuming that really early. So we directly restore the context to the functional clock there, reset it and reload the context. SRAM on OMAP generally gets used for suspend code that needs to put the memory into self refresh mode. OMAP3 already has a function for saving and restoring this context because they lose it. I kind of generalized that because their solution was to just recall all the functions that load code into SRAM again. I didn't want to turn that into a if CPU is this, if CPU is this, if CPU is that. So I just made a general function there that copies it out and copies it back in. Other devices are already set up to handle context loss, but they need to know that their power domain lost power or lost context. Usually that's handled by the power domain code keeping track of power states, but when we hibernate the power domain code doesn't know about that. So we just add this function in here that can walk through each one and increment their counters if they aren't already off. Right now with the device reconversion a lot of the drivers that depend on the OMAP context loss function that get it through platform data don't get that anymore on DT based systems. So just to get this up and going I put in a hack fix, just put an extern into OMAP PM get dev context loss count and in the OF load functions just assign to that. So there's clearly a need for generic framework for informing devices when their device or informing drivers when their device is lost power. Some drivers are already set up properly to work when they lose context, but they're misconfigured and they won't work for hibernation. The suspended resume callbacks in platform driver will correctly call all the hibernation functions, but if you put your suspended resume callback in dev PM ops, the power the driver core assumes that you don't want to run any of the hibernation callbacks. So there's a set system sleep PM ops you can put in there. So the correction here is just to put that macro up in there with the suspended resume pointer and all the hibernation callbacks will get assigned correctly. Some devices of course do need special hibernation callbacks like the watchdog timer when it gets initialized and run by the running kernel and then the restored kernel doesn't know what state the watchdog timer's in and the watchdog timer works by writing an alternating pattern. So we just double ping it in the restore function to make sure we're alternating the pattern and not writing the same pattern twice. So we put all this together in the AM33X PM code. You can see our pin control save context there. We get the interrupt controller save and restore from OMAP3 for free and the other functions are all functions that we've added. So what does it take for the generic ARM hibernation support? There's a patch set originally from Frank Hoffman but fortunately with the K exec implementation in the kernel it was able to be compressed down quite a bit into a very small implementation. So you have to provide four basic functions an arch suspend function, an architecture resume function, a function to tell the hibernation code whether or not that's a page or a key snapshot and an extra save processor state and extra restore processor state. So arch suspend function actually ends up being really simple again with the K exec code that's already there. So our job is to just save the current CPU state and call the SWSUSP save code. So we use CPU suspend to save our CPU state and CPU suspend gives us a handy function pointer. The CPU suspend will call that function once the state is saved. When we're done saving, snapshotting all the pages we just jump back into a CPU resume which will return from CPU suspend. On resume, there's a bunch of leftover pages that the framework couldn't copy on its own because it'd be overriding kernel pages that are in the current kernel. So it gives us a list of pages to copy. We have to be a little bit careful with this because we'll be overriding our own pages so we put our stack in that no save region. It doesn't get snapshotted so it won't get restored. We use the Kexec call with stack function there that lets us call a function easily with a new stack. Our function itself will actually get overwritten but the hibernation code ensures that the resume kernel is the same as the snapshot kernel. It'll refuse to reload a mismatch. Go ahead. Okay. Okay. So again we have some Kexec functionality there, the soft restart function to call back into CPU resume and that'll again return from the SWS arch suspend code but our in suspend variable in the no save region will tell us whether or not it's... we're supposed to snapshot the system or restore the system to running state. And then with all that back work the actual implementation for AM through 3x is really easy. There's a couple of wrapper functions around all of the hibernation code. In this case we just disable PM idle. There's an interfunction. We just want to power the system off. There's some callbacks for the restore kernel where we put the hardware into a sane state before we want to restore the image and then if that fails we want to put the hardware back into a running state. And then for the kernel that actually gets imaged our pre-snapshot is going to be the one that saves all our extra states so we want to save the per domain context and the wakeup domain context. And then there's the leave and finish function pointers. These didn't fit in as well as I would have hoped but it's pretty basic. Leave is just called after restoring from an image immediately after that. So we tell all the power domains that we lost power. We restore the basic wakeup context and then finish is both called after this leave function and after snapshotting the system so it's to put the hardware back into a sane state. On the snapshotting system of course we don't need the per restore context but it ends up being a no-op anyway. Debugging methods. What can you do because it's not going to work on the first time, the second time, the third time, a bunch of different times. It's a little difficult because the hardware is in some unknown state. On OMAP you're not going to be able to talk to a lot of the registers because their functional clocks aren't enabled and if you try to the kernel is just going to hang. So one easy way is to debug using GPIOs. It's pretty easy to configure the functional clock for those, to configure it as an output and just have a piece of assembly code that you can move around to try to do a binary search of when the kernel is failing. If you're up far enough for it, you can use print K. Usually if you're up far enough for serial output though you're in pretty good shape. Open OCD is pretty good too except if you're just looking at the PC it's usually down in the weeds maybe talking to some spin lock or something like that. So you really need a backtrace. I never had the time to get open OCD plus GDB running so I probably learned helpful. And then once it's up and running again you want to make sure that you didn't miss restore register state. So I used dev mem 2 and a database of registers to just store all the registers before suspending and then look at them again after restore to see what differences there were. So another thing I did was doing the SWS USB restore directly from Uboot. If we look at the restore process it involves copying pages from a backing store from disk putting them into memory and jumping to an address. Uboot's really good at that. So restoring from Uboot can be faster than booting an entire kernel just to copy pages but there's a couple of pieces missing. Uboot has no idea what address to jump to and it doesn't know the contents of the no save pages so it can't restore those to what they originally were. So our first step is to save the no save pages. Some irony there. So we just do that in late init call so that they'll basically look like they would be from a restoring kernel and then we store some physical pointers to them to make it really easy in a special CPU resumed function to restore those pages. And then when we're done we just call the regular CPU resumed function. And then that new CPU resumed function we have to pass that into Uboot. There's this SWS USB info page within the hibernation image. So I just added on the architecture specific function there where we can tack on extra data. So this is what an actual hibernation image looks like. We have a page of your SWAT partition there and we have a signature that tells us yes this is a hibernation image and then it points to a link list of pages that have sector pointers into our entire hibernation image. Our entire hibernation image across the bottom here can be compressed with LZO which fortunately Uboot has LZO support so we support that. There's an info page that contains our CPU resumed pointer and then the first part of the image is just metadata. It tells us where each one of the data pages needs to be loaded into memory. So our Uboot modifications are just to add a SWS USB command that functions like the restore functionality in the kernel. If there's no signature it's just a no op it continues on. If it finds a signature it restores it with the original signature in that page so that if there's a failure you'll get a normal boot next time. For snapshot booting you can just replace original signature with S1 suspend 2 so the overwrite will just be a no op. So we have to read in all the metadata pages with the PFN mappings and at the same time we're going to populate a bitmap of which pages are actually used to make it easy to get free pages. We have to do all the copies of data pages to memory. We have to be careful not to overwrite Uboot when we do that because there were any pages that we have to copy to where Uboot was we have to have a finishing function to do that copy. This is what Uboot memory map looks like. Fortunately it's very simple. If we're above the vectors and far enough below the stack then it's free memory. Any memory we allocate within Uboot it goes into that malloc space there. So our first step is to load those metadata pages we have an image page getNext function that abstracts away the compressed image and reads the pages, decompresses and that sort of thing and then we have a bitmap that we mark when we find pages. We can then use that bitmap to get free pages that aren't in Uboot and aren't in the kernel. So we build a remap index using those pages and then we just go through each page and if it's in the Uboot memory space we get another free page and put it in that free page update our remapping indexes and then read that into memory. And then when we're done we have to cap off our remap lists get a special page for actually doing the final copy we put our finishing function in that page we put our context in that page and we put a stack on that page I added a call with stack function to Uboot for this that's actually the only architecture specific part of this SWS-USP restore code and then that's the finish function it just has its own simple mem copy that it copies the original pages to where they should actually go and call cpu-resume which in our case is actually cpu-resume-copy no save and then it's just like it was a restore kernel. What are some of the things I learned what could be improved? The save and restore context functions seem like a bit of a hack it'd be much better to keep the context as we go and then just look at the loss count on resume a lot of drivers already do that there are a few problems what about people who bitbang, dev mem and modify registers you're going to miss those and then any registers that are initialized to boot but never actually read by the driver you're not going to restore the state the bootloader programmed in DAZ so there needs to be some careful looking at the individual drivers to make sure they actually have the context some of the restore code currently it's overzealous we probably don't need to restore the clocks whose parents are disabled if the clocks parent is disabled then it should also be not running just because it doesn't have a clock but we got to go through an audit to make sure that when a clock driver enables it does program the state on resume one thing that would be helpful for doing things like this set up your debug framework early you're going to do it anyway you're going to have a debug framework might as well do it early so you're not guessing and of course pin control we got to properly integrate that with the system with giving it knowledge of where the different registers are when it needs to restore the state and that sort of thing do you have any questions? yeah go ahead when you were setting up your clocking some state over to SRAM yeah it's actually reverse I'm copying state out of the SRAM because we're going to lose the SRAM contents but on AM33XX it assumes that that's saved so there's no current well we do the resume for SRAM after CPU CPU resume and CPU resume is going to reload all our MMU entries for us is that what you're right that's okay that's okay in initial boot the SRAM memory does get mapped with IO remap so in initial boot that's already mapped so in the saved context we're running with MMU enabled that memory is mapped we've already run CPU resume so MMU is enabled and that memory is already mapped normally when you're running from SRAM you're in suspend resume context so you can't take a lot of things for granted but here we're above that level so we don't have to worry what would be a feasible way of generating that initial image and then prevent it and then prevent it from going that way at the core? the core code is going to need to restore the hardware so it's going to need to have the state of all the hardware for all the resume functions so if your hardware doesn't match exactly what QEMU looks like it's not going to work it's with USUSP it actually makes it pretty easy to make a hibernation image and then save that off somewhere and do something else with it the only thing with snapshotting is you can't have any file systems at the time of the snapshot and you can't modify any of the read-olding file systems between restores it all has to be the same go ahead I'm not sure at which stage you're talking about but all these are modifications to the kernel because none of this context-saving resume support is currently there the only context-saving resume for pin control in AM335X is it manually goes and saves each one of the registers that get lost when we lose power to the per power domain for for pin control is that go ahead yeah that would really help restore time but unfortunately the USUSP framework doesn't really have that functionality when it saves pages in the hibernation image I don't think they're really swap pages they just happen to be in the swap backing store so it doesn't really have that support I don't know but yeah it does take a few seconds to restore all that memory go ahead right now unfortunately they're about the same because the uboot functions for reading from MMC and decompressing are not running as fast as the ones in the kernel whereas the kernel right now boots very fast so it's actually pretty close right now but if you were to improve the uboot functionality for copying pages and decompressing then it should match what the kernel can do and be faster you don't have to boot a kernel you don't have to read the kernel from disk either that's about it I was thinking a little bit more about the page on demand instead I think actually if before you suspend you evict a lot of the memory to swap that memory will stay in swap and then when you make your image image of those pages because they're in swap and then when you resume it will load those pages on demand I'm thinking only Linux for this I have no idea about any OS other OSes hibernation formats you're going to need to do a lot more for these other OSes yeah I really don't know anything about the hibernation or suspended disk frameworks for other OSes my point with this is oh I see for parts I'm going to need to add if this was just a demo board that's great but all the device drivers I have to go and map the kernel in order for this to work right this kind of went through what the implementation entailed for aim33xx and to fix some of the drivers that aim33xx uses so for your particular application you can go through and do similar steps for your own SOC to get that going hopefully if I'm able to get this merged a lot of drivers will be improved to support hibernation if they don't already so if I'm able to get this upstreamed it'll be a lot easier to add support to your own SOC any other questions go ahead we talk about the spinning of drivers I don't really know a lot about what that means so you're talking about platform drivers you're talking about device drivers so if you have a device driver all of the state and all that is on the entire level up right depending on the device some devices assume that when you suspend when they come back out they're not going to have any of their registers stored so they're going to go and reprogram all those registers a lot of other devices particularly platform devices if they're in a power domain that doesn't lose power and there's not versions of that same device in power domains that do lose power everything's going to be fine the registers are just going to be there they just have to bring the thing back up to a running state so some suspend resumed functions are just putting the device into a sleep state bring it back some are also saving and restoring state that answer that and how the driver's written what assumptions it has those sorts of things typically when you do require management there's probably some area where the shutdown that the driver has to figure out what to do with the driver tell you what things to get done are you planning on adding that kind of stuff the hibernation framework actually already has those services there's these p.m. ops callbacks thaw, freeze, restore, poweroff that are hibernation specific so when you run restore you're telling the device we just restored a kernel image to memory we don't know what the state of the device is we don't know what the previous kernel did but we need you to restore the state of the device and bring it back to a running state so those callbacks are already in there and there's also the poweroff callback so when you're making a snapshot image and you're done making the snapshot image each device gets a poweroff callback call so those are already in there for device driver writers is that it? okay, thank you very much