 So my name is Michael, Michael Ulbricht. Michael Ulbricht, if you wanted to do it in German, but Michael is okay too. So I'll be talking a bit about updating embedded systems. Let's start about me. I've been working on embedded Linux for quite some time, working for KengoTronics the whole time, and this is mostly support kind of tasks. Getting more hardware to start running Linux, helping with things with our bugs, basically to allow our customers to do their own applications, right? So we do the base Linux operating system, and they kind of focus on their own applications, the actual stuff that they want to do with their hardware. So, and I mean, updating is not something that you want to care about. This is something that's supposed to work. So I've seen that for quite a few projects in over the last few years. It's getting more. In the beginning it was more like updating, do we have to do that? Maybe we don't need to update our software, it's okay. And nowadays, updating comes more and more important. And as I've said, I've worked with a lot of different projects. So my focus is less on how the individual, the updating process itself works. But I want to focus here today on more, what do you have to do to integrate an updating tool and updating software, product, whatever, with your own application, your own product? Because, I mean, we've had a talk yesterday about updating, and if you look at other conferences and look at the talks there, the videos, you'll find several talks about updating software that describes the software itself, and it's important. And it's not easy to get right, but we're getting there. And that's the stuff that can be shared. But there are a lot of parts of the updating process itself on a particular device that is device-specific or in ways where you have to add your own features, your own stuff to make it really work. And that's the parts where I want to focus on. So, a bit about updating first. I'd like mostly to focus on the standard, simple A-B setup kind of updates where you have one running system and then you have a separate partition where you can write the new version of your operating system into it while your system is still running. And then, when all that process is done, then you say, okay, now I'm finished and I can start the other operating system, the new version of my operating system. And if that fails, that's the important part. Maybe there's some bug in there. Maybe the rating the operating system failed in some way and didn't detect while it was flashed. Or maybe in this special use case, there's some outside circumstances that make the startup or whatever fail. So, in that case, we want to go back to the original version, the previous version, that worked, right? And if you look at that, writing the update itself, that's mostly generic, right? I mean, there are a few ways to say I write a file system image to the disk or I format a new file system and extract a toggle into that or these kind of things. But this is not really project specific. That's something you can implement once it can be tested really well. But when it comes to checking, okay, so my system is not starting. What does it mean my new system is not starting? There are a lot of ways that can fail and many of those are project specific. So, that's part of where I want to go, right? We need to detect those failures and then we need to say, okay, something happened here and I want to go back to the old version and then maybe there's some user interaction there. Do I say explicitly? Okay, something's wrong, can we go back? Depends, really, it's really on the devices. Sometimes it's some embedded IoT device then this has no user interaction and it has to be automatic. But often there are devices and then operators in front of it and say, this is not working. I've just applied an update and now it's not working anymore. Can we go back to the last one? These kind of things and that's when you need to integrate the update mechanism with your own system. So, let's start with the failures. I'd like to classify at first what kind of different categories there are. So, and depending on those, how we can react and how difficult they are to work with that. When they start a failure, this is simple ones and that's typically what you see when it slides. I remember from yesterday there was this, okay, we have installed a new and then there was three lines on one slide where we said, okay, we start a new image and then there is our own, we talk to our update server again and then you can hook in something else to check if the new system is okay. That's the start-up. A lot of stuff can be detected on the first start-up if it comes up and it works. But some errors come later either because some user interaction comes afterwards and only with that user interaction things start to fail or there's memory corruption or there are memory leaks that take some time to occur. So, maybe the system starts but it starts to break down afterwards. And then there is a call this update preparation errors that means you want to install a new update and you don't get to the point when you actually write something, right? You can try that again or it still doesn't work and maybe you can go to the old system and then there's this, I call it update errors and if they are permanent, that's a problem because if you say, okay, I have a system that is running but try to write an update and it fails because the update process is broken and I cannot go back anymore to the previous one. So, we have to be really careful there to make sure that that doesn't happen or there's even more fallback. But at the end of the day, at some point, the effort to work around or to handle those issues gets pretty high and the probability that they occur pretty low. So, at some point you need to stop and say, okay, this is enough, these are the things I check for and for everything else, okay, the device is bricked and we have 0.001% boards that fail and then we can replace those and that's okay. So, failure detection on startup. The most important thing there is first, the hardware watchdog. If you're doing these kind of things, it's pretty mandatory that you have a hardware watchdog that can detect problems and booting because at some point it might be the Linux kernel that's basically just not starting or gets stuck somewhere in hardware initialization and the only way to recover from that is when a hardware watchdog triggers and reboots the system and say, okay, here something went wrong. So, we start the hardware watchdog in the bootloader, that's important too, the bootloader is the first step here. Then keep it running, that's many, many drivers in the kernel by default will reset the watchdog hardware when the driver initializes and it will not activate the watchdog until the user space activates the watchdog again. So, there are ways that you can basically tell the driver not to touch the watchdog at the beginning or the state of the watchdog but you need to make sure that that happens actually. And then system D comes up, so that's one part I'm always using actually, system D has really great infrastructure for that, so system D starts, opens the watchdog and starts triggering that, so as long as system D itself is in a useful state, it will continue to trigger the watchdog and as long as that happens we say, okay, system D is okay and system D will handle all the other errors and propagate those correctly. That brings me to the next part, system D. Don't forget to handle all cases because system D will happily trigger the watchdog while it's stuck in a situation where basically nothing works. For example, emergency and rescue targets. System D by default is configured in a way for a desktop user or a server where an administrator can log in and say, hey, what's going on? What went wrong here? And so if the FS check of the root file system fails or at the critical file systems or if you cannot reach your default target system, D will start these rescue targets or emergency targets and then you may get a shell there but for an embedded system the shell will help nothing because there is no one in front of it that can actually do something with it. So you need to put some service there that will actually do something useful like restart your system. And then we come to the actual applications and that where most of the work usually is for the individual devices, for the individual projects is you need to figure out what do I need to check to verify that's still running to make sure, yes, my system started and yes, it's still working and it's doing its job. System D will give you a lot of tools for that. There's a watchdog per application, basically the SD notify hooks where you can say, yes, I've started and then later on you start pinging. Yes, I'm still here, I'm still here, I'm still here. And if you stop, then System D will either restart your application or kill it completely or act on that depending on the option, there's the failure action basically or there's the start limit stuff where you can say, okay, if the service restarted three times in five minutes then maybe it's not going to start again properly so let's do something, maybe reboot the system. Or if you prefer to fail immediately it's the failure action to say, okay, this service failed in some way, let's immediately reboot the system. These kind of things help to propagate errors from the individual applications to System D and from there you can then restart your system and say, okay, now the boot logger is again here at there to make the next decision which system is going to start. And that's pretty tricky actually because what do we actually want, when do we say the current system that's running is not working out for us so let's go to the last one. Basically, from a high level perspective if the error is transient, well, let's try again, right? Maybe it works out afterwards. If it's persistent then we can try again and again at some point we have to say, okay, well, it's still not working so let's go back to the previous system. But that really depends on what your individual requirements are, maybe this one part of the system, well, okay, it fails too bad but let's say running or the other part that's really, really important so we need to reboot into the old version and that really can be some kind of conflicting requirements because if you say, okay, I cannot reach my update server but the actual application is working and I want to have my new service running but I need to go back somewhere to do an update for later on so it's really a decision process where you have to look at all the possible failures and decide this is where I go back and this is where I stay with the current system and how often do I try to reboot and to restart individual services or restart the whole system. And to actually implement that, I'm fusing here Bearbox as a bootloader example that's basically because that's what I'm working with most of the time and it has this boot user. Basically what it does is you have a slot, this is one version of your AB scenario, is one slot, the A is one slot, B is one slot and it says okay, both of them have a priority and the boot counter and then Bearbox comes and I say okay, the priority is for this one is higher so I try this one and look at the boot counter, boot counter is greater than zero, minus one and then start this one and then in the user space you have to reset that boot counter and there are various ways where you can use that to implement different kind of scenarios where you say okay, I try this multiple times or a failure at startup is more important than a failure later on and these kind of things. The important thing is that we can reset those boot counters again because I mean, let's say there's an outside effect that causes failure, say the power is really low and the system starts and as soon as you start powering on additional components, it breaks down and resets the system and then you do that in a loop and say okay, three times, four times, my system failed, okay, let's follow over to the previous version and then you try it here three times, four times and then yeah, both boot counters are zero and what do we do now? We have no system that's seemingly working and then I can either say I guess get stuck here and don't do anything or we can say maybe there was some outside effect so if all systems say boot counter zero can't do anything, we start again from the beginning with reset the boot counters to some value to three or five or whatever you want and then try again and maybe that now the power is better and then it works again and also important atomic updates, this is something where I like it's built in into the barebox boot loader, it's something where a lot of people spend time fixing things because writing multiple values on an embedded system in an atomic way that when any time the power fails, it's still there in a consistent state that's pretty tricky because if you have an ant flash for example then doing okay, I need to erase this block and I need to write here and there and do things in a way that at any time the power can fail, that's pretty hard and so there's just basically a framework there that says okay, set these values and then now sync it to whatever spec and storage we have and then it's always an atomic update. So and now let's look at actually how we use that. Just a few examples of what you can do and this is basically one end of the range of options, you probably want to do something in the middle but this is okay for many cases, actually not what many people actually do right now is this, well it's started, maybe things can fail but afterwards everything errors that come there that just translate and it works afterwards again after we reboot. It's probably true in many cases because it will work for a time if it's an error that some say a memory leak, right? Memory leak that accumulates memory over time so it runs for three days and then runs out of memory we reset the system, it runs again for three days. That's okay under some circumstances because the system is running for three days which means at the very least it's running for three days in which time we can actually continue an update a new system. So we can write a new update in there and that's always the most important thing that you're always in a situation where you can write a new update so you can fix things. In this case we just reset the boot counter when the system has successfully booted, system that will help us with that because we can just pick a target and say okay we have this target and any application, any servers must be, is required by this target so if any of those fail then we never reach this target and then we have one service that waits for this target and says okay when we reach this target we have the next service in the chain and this one will simply just reset the boot counters for the running system. If you're a bit more paranoid you say well any errors may be permanent so we don't actually reset the boot counter after booting. So the system starts, the boot loader counts down the boot counter and then it's okay and then it's running and only if we do an explicit reboot of the system if we trigger a normal reboot for example after an update or for whatever reason else then we reset the boot counter. Say okay it's running for some time and then it's good and everything works and maybe we need to reboot for whatever reason or not then this is kind of like any failure later on it triggers reset, the boot kind of was not reset and so it's counting down. However well if you're developing or something it's often this well let's switch off the device and on again and do work and retry. Doing things it will you'll never actually reset the boot counter happens pretty easily that you run into the case that you accidentally start the old system because you didn't actually never reach the situation where the boot counters were reset. So this is the other extreme so there must be in most cases probably somewhere in between that you say maybe it's running for five days and now it looks good or you have an outside operator actually monitoring your devices and say hey this device was running for 10 days and this looks good so let's reset the boot counter. It's probably working properly now. It's really application specific device specific what kind of choices you make for that but what I like here for the system we have with system D doing the monitoring and barebox with its state and relatively isolated is the state you can modify with any tool here. You can choose when you update these states when you when do you reset the counters and you can modify how it reacts to your requirements. So let's go to a nicer topic and we're always talking about failures. Let's talk a bit of user interaction because update has a lot to do with users as soon as there is actually someone watching. Can be someone local. A lot of devices are starting to get this place more and more devices with this place and where I can display some info and see progress of an update and these kind of things or some remote operator that says okay I installed this update here and I want to see something. And then it helps actually because if you can ask a user to or the user can decide to turn the device completely off and on again you can use that to handle errors as well because well if you can detect it in your software maybe because it's a high level user interaction stuff that you can still trigger a reset of the whole system by turning the power off and on and you can use that okay the power was turned off we didn't reset the boot counter so it's one less and if you try that multiple times then we start the whole system at some point or immediately or we can reset the boot counters after power off and have this all counters for zero and after power doing a power cycle once we start again trying to do things really depends on the situation. And you can also say you have something in your UI to do yeah switch to the old system because this is not working or yeah these kind of things this is faster I think I don't have that much time and what's really important what I think is that they takes time even if it's running more or less in the background because you can still do the actual job if it's a device where a user is standing in front of it puts the USB again so I do run my update and then it says nothing for 10 minutes that's gonna be yeah the update is not working let's power off and try again. So adding information there to update is really important I think. So and also in case of error handling your customer says yeah it's not working and then say well yeah and it says the customer says I don't know why and if you don't have can pull out some information that's integrated with your application where you can pull in your log files that's pretty hard to debug afterwards. So not just the black box and I've put here the route it's one update system I know it because my colleagues are working on it that's just has a debug API and says hey I'm doing this and this and this step. So yeah that's it. Few links there but just questions. Getting pretty late no questions anymore. Okay I think then that's it for today.