 Hello everyone and welcome to my presentation facing the challenges of updating complex systems with a subtitle putting it all together Well, shut about me. My name is Enrico Jons. I'm an embedded software engineer working at Pengatronics We are basically an embedded Linux consultant company and I am the co-maintenor of the rogue Updating framework. So I'm from the topic a bit close to Updating and this is what the talk today will be about. So a bit on the background on the motivation so in the past years, we've seen a growing zoo of embedded update tools like SW update, Mender, Rogue Rise and so you could think that Updating embedded system is well a self-topic, but if you take a closer look At your system. Well, there are many different Aspects many different components in your system That not only live together separately, but they are connected together You have your in the system that has to start the application. You have maybe watch watchdog You have to care for your data during updating Handle in the bootloader then you have a deployment server where your update comes from and in the end you also want to test this entire change on there are a lot of building blocks that Yeah, that cannot be covered by such an update framework and that are very custom to Your hardware to your use case and to what your application needs and well When we do with updates the bootloader is a very critical point because in most systems the bootloader is a single point of failure and Now when you want to start with you would or grab or something like this and Implementing a redundant system often means that you have to write a script for the bootloader that takes some variables and Does a selection of the proper boot targets So this is exactly what we just removed in the user space with those update framework So why don't we use those frameworks in the bootloader to and Here's one example of an already existing framework from the Babax bootloader. It's called the boot chooser framework Well, here's a short overview About the structural view so it has a basic algorithm that does the actual build selection some configuration to adapt it to your system's needs and a persistent status which stores information about The current status of the boot targets and if we take a deeper look at it the basic Simplified algorithm looks like this. We have for each boot target two variables that we maintain a priority and a remaining attempts counter and if we start our system We parry it on and then first of all check if they were remaining attempts counter for the individual boot target is Larger than zero if then then we choose the boot target with the highest priority Decrease it remaining attempts counter and then boot and if the booting fails either because we can't load the The kernel image are because we hit a watchdog reset during starting then it ends again in this bootloader and does it again and again and Doing this the attempts counter decreases and if the attempts counter for one of these boot targets reaches zero then we switch to the other one and The normal case would be we decrease the attempts counter then boot to our running system And if we assure that we've booted successfully then we reset the remaining attempts counter to its default value So this is basically what the framework provides with some configuration like for example if you power on the system we want the Remaining attempts counter for all types to be reset to the default values so that we can recover our targets by switching the power off or on Well another example where we don't have to implement the switching itself is on x86 so and on x86 you have uaf normally and There you can or have the boot entries for each target you want to boot and with it boot entries It is possible to load a kernel fully without an additional boot loader just by referencing the kernel we want to boot and giving a root FS argument that references the root file system to boot and Then we are able by setting the with your boot order variable to automatically switch the order of the targets to boot and Also, we have the boot next counter which is some sort of Temporary setting it to our boot target for trying to boot and if it fails to boot then it's not persisted at the next time The target reboots we have the original order back So there's no inner drama first required no boot loader. Just kernel and root FS boot. So this is an other option for that Well, when updating the boot loader, it's often is a single point of failure. You don't have redundant one But depending on the hardware that your users There are some exceptions one example I fear is if you use an emmc this one has a dedicated or two dedicated boot petitions So boot zero and boot one and we take this example We booted with our boot loader from boot one Start a system and if we now want to update the boot loader We simply write it to the boot zero petition Not having switched anything in the xc sd which is responsible for Telling the rom code which boot petition to load And then at the very end step when we are sure we've updated the boot loader successfully Then we do the atomic switch and Have a running updated boot loader in an atomic way So Another common issue when starting our updated system is something goes wrong and the system hangs maybe in the kernel or during execution of our application so a common way to solve that is using watchdogs, of course and A watchdog is normally triggered during booting. So we started doing booting and then the kernel and the internet system They have to trigger the watchdog in regular intervals. So that is counted does not remain zero It does not count down to zero if that happens in this Case shown there when the system hangs then the watchdog counter expires and the watchdog Boots the reset of the board and reboots the board so that we don't end up in an In an unbehaving system well as for the basic system also our applications might hang and We want to be able to detect that and want to be able to recover from that and a nice Two for the disease system the watchdog a system the internet system that provides the watchdog multiplexer interface so system the itself Triggers the hardware watchdog so that you can be sure that system D is running and also for Rebooting due device it sets a watchdog so that if you get stuck during reboot And then it provides the software watchdog for the individual applications that you run and for each application you can configure a proper interval and then your application has to notify a system D in that interval to not get reset and If it's not notified then you can select a system. What do you want to do with it? What do you want to restart it or? Skip or also reboot the entire system depends on how critical the application that just crashed is and in general system D are using On a better system that you want to update in a robust way is a good idea because it's a very central point and it has a very central view of the Entire system and all the components running on the system and it provides fine-grained control about behavior when an application or a service Failed so if you want to restart it when you want to restart it with which intervals between multiple tries and when you want to Abort trying to restart a service And reboot the entire system and so on it has a watchdog multiplexer already described and for example the system update method that is a way for bootstrapping the bootstrapping configuration data on the initial boot of a system by Just placing a system update file in the root file system then systemly boots into a special target Perform some actions and then reboots and so you can bootstrap or initial data on your system. For example Well, what you also have to care about when thinking about Developing your system is How you store and where you store your data and how and where you migrate your data So a simple example is that you store your data in the root file system if you have a writable file system And then your update tool has to care for it to first write The image to the secondary root petition and after that copy all the data that you need on the second petition To it from the currently running one. So it's good if you fall back you still have Data that is accessible by the old system, but it may be outdated. Well, of course And another approach is that you have a single data petition That is Mounted in the root file system There no copying is required because Both the current and the updated system will use the same data Migration is also possible But if you fall back and migrated data before that might be tricky because the old Application might not be able to read the data. You just Just migrated with the new application So There's a third approach if you have redundant root file systems to also use redundant data file systems And again the copying of the data can be done by the bootloader the migration Should be done by the application because this has a best view of what data it requires Mounting is a bit more tricky because you have to find out In the system which of the data petition Belong to your actual root file system and not mix it up And yeah falling back Is again more simple, but Again depending on your use case you You also access all data. So you Have to make clear if this is a valid use case for your application to access all data or if you want to Not allow fallbacks in general Also, I tell I guess I'm Frequently asked about how Updating performs with Verified or trusted boot Just a short note about this because In many cases this is orthogonal to the updating Taking the invariant here as an example Is There in this case you create your root file system image And the hash tree the miracle hash tree for the root file system on your build system And for the update system itself, it's just like writing data to your block device. So it's transparent to us If you're using the integrity Then you've got the handling in your currently in your running system Covered and then you can use for example a trial extraction to update your system Then also the running system will handle all the Mapping itself because it's like simple extracting a tar to next for file system and The dm integrity layer below handles the journaling and the text creation and so on Well, as I already said, um, it is Important for an update that you tested your update for robust updating and it's not like only testing software It's like testing the entire updating chain. So From the From the our start of the system to the updating cycle to the reboot to the newly running system And this is quite tricky because you have to interact with the hardware and a good framework for this that provides This functionality is lab grid. It's Um Based on a pie test and you can it has an abstraction and way of Interacting with shells over serial lines. So a shell driver that knows how to Deal with a linux shell bearbox drivers that knows how to communicate with bearbox, how to type something in the shell a shell Get variable names and so on and the power driver that allows you to power the system and then the simple test case for your Device could be you provide tested provides a target Triggers up to it to to install it power cycle it And then test in the boot load or k to select the other root file system petition and test in the linux or k or Am I running the right root file system? The new application is every service up and running and so on Well Normally when we want to perform updates over the network. So remote updates We have two major issues that we often come across This is that our updates are too large We often have devices that only have constrained connection. And so an update takes very long and wastes a lot of data and We also require a temporary storage for the update image We download where we can download it to the target and temporary start before actually installing it and so what you want to come across it is delta updates and Yeah, this is an example taken from the rauc update framework and We wanted to have delta updates network functionality And we didn't want to run to reinvent the wheel And then suddenly ca sync plopped up and according to vikipedia ca sync is a linux software utility Designed to distribute frequently update file system images over the internet Sounds like something we could use. So let's have a look what it actually does ca sync is a chunking algorithm basically so you take your block device either or a full directory tree And ca sync creates a serialized stream of this and then splits the stream into small Chunks in a producible way and in a way that you can later on compare chunks to similar To similar images in a good way So then it takes the creates hashes of the different Chunks and stores them secretly in an index file and All the chunk data itself it compresses and stores under the name of the hash in a chunk store and now We can Reverse it and extract it Then if we extract a ca sync index file to either a block device Or a directory tree then ca sync scans through the index file and looks in the chunk store For each of the chunks fetches it over network and directly a dc realizes this and writes it to the device So there's no temporary storage required We have remote access because ca sync brings all the remote functionality required to work over hntps hntp s fdp and so on and Yeah, we can write it sequentially to the device so Now how do we use it in rauc currently? Well What we do is we want to update running from slot a to slot b and So we perform the same chunking algorithm We performed on the image we want to use for updating on the slot and store the information about where to find the data for each chunk in a so-called seed store on our target and then when installing an update which is a basically an index file of ca sync here and We scan through all these Different elements then we first take a look at our seed store if you can fetch the data from local and if it's there we Get the data from a slot a and write it to slot b and only For these chunks that differ from the currently one For this we have to make a remote access and Call this small chunk from the chunk store server. So this is basically how it works And another topic that you care about when having a lot of devices is field deployment well, if you Have your update and deploy it to your large field of devices and Yeah, there is A bug that you didn't discover during testing or something like this in the update Then you've updated all your system and might say oh damn. I bricked almost half of them so it's a good way to have a Good deployment strategy for this and for this also Good open source tool exists called hawk bit. Well, hawk bit is basically a deployment server Provides a web Interface a web UI for the user where you can manually Configure updates or and configure complete rollouts. We'll see later on and a management api that allows to do this automatically So By some other back ends or something like this and on the other side it has a device integration api This is where the devices register to hawk bit and verify and pull of In a Fixed interval to check if there are new updates for them available and if we take our scenarios then Hawk bit supports deployment strategies Like I could switch I could split up the entire group of Of targets here into three groups And say okay, I said an error threshold at 50 percent Which means if at least 50 percent of the devices come back alive, okay Then I continue updating and what hawk bit and does is Start with the first group updates and then after having updated successfully the target reports that start us back and okay, we see More than half of the targets came back successfully so Hawk bit will start to schedule the next group of updates and in this group of updates we see oh It's below a threat. It's above a threshold. So two devices failed. So We say oh stop something went wrong. We stop our updates and we will not kill the other targets with a broken update so Yeah, what we've seen We have these update frameworks they solve Much challenges we had in the past but they don't solve all challenges. You still Have to know what your system requires how your application behaves which parts of the system you want to monitor have running on the system and How to properly configure watchdog and rebooting on fewer devices how to interact with your hardware and so on so This is only yeah, it's not just Staking different components Like an update tool and a deployment server cacing and so on but you have to Well configure all these components so that they can work together and give you a robust a fully robust update system so Come to the end. Are there any questions? I think we have a few minutes left Otherwise if you want to discuss a bit more tomorrow at Half past 1 p.m I'll be at the open embedded stand. There is a small updating demo that shows Raoq in interaction with hobbit So if you like to come and see all the time and if you want to be sure that you can discuss with us Just visit us tomorrow at 1 30 p.m Yes, well, it depends on the actual On how you build a system if your system is secure or what do you mean about a secure system If you want to be sure the question was if someone could Change something in the system and Does it well the updating itself is if you use Raoq and most other tools too is verified. So Only those who the right key are for Signing the image Will be able to create images that will be accepted by Raoq This is the part of the update system then on the target you Can prevent some body from accessing it by using trusted boot and all these mechanisms So that it just Briefly Came across Or yeah Does it answer your question basically? Okay, great Yeah, can you speak up a bit? I don't get it So the question was the smallest system It's working on so um a bit Dependencies Raoq uses glib So it has to be a system that is able to run in glib. It has to be a system that is running linux Raoq itself is quite small and it also Supports Not the full AP setup code. It needs more storage. You can also have a recovery or something like this which needs Less storage. So I think glib is about Seven megabyte or something like this and Raoq is a few kilobytes Do you still have time questions left? So otherwise, thanks for attending my talk