 All right, can everybody hear me all right? We got sound this time. We got sound in the stream. Very good. All right, I guess we're good to get started here. Welcome, thank you all for coming out. I know your time is valuable, and I appreciate you coming to hear what I have to say. I'm here to talk about implementing PowerSafe Atomic over-the-air updates, as the slide says. I want to thank my company, Toradex, for sending me here because they're covering the tab. It's my first time in Japan, and I'm happy to be here actually seeing people in person. It's nice to pick things back up after being on lockdown for so long. And I definitely want to thank the organizing committee for putting this together and getting us all out here to hopefully learn something. All right, so just real briefly, because I'm kind of obligated to mention it, we do have a booth here. I work for Toradex. We sell hardware. A lot of this talk is based on some of the architecture that we use in our Verizon operating system and platform. And that's a full end-to-end system that's intended to be used in an embedded IoT industrial controls, that kind of thing. And the architecture that we use for our over-the-air updates is what this talk is based on. So it's a third-party open-source package called OS Tree. And just to kind of give you the background, that's where the meat of this topic has come from. It's our work with the Verizon operating system and using the open-source OS Tree infrastructure to develop the secure atomic over-the-air update system that we use in our platform. Real briefly, since everybody seems we have to kind of introduce ourselves. So you guys know a little bit about me and maybe why you should listen to what I have to say. I've been in the embedded space, and embedded Linux specifically for about 15 years, embedded in general, much longer than that. Like I say, I work for Toradex. I'm currently a solutions architect helping customers implement our solution in their design. So we sell them the hardware. We also have software. And my role is to figure out what portions of our system are appropriate for their environment and help them make best use of all the features that are in our system. Briefly, what we're going to talk about today, obviously, is a little bit of background on what OS Tree is and how it's used in the over-the-air update space. We're going to define what it means to be power safe and how OS Tree implements that. And then we'll talk a little bit about the architecture in OS Tree and how specifically it will allow you to get these over-the-air updates in a safe manner. And then I've got a short demo, which is really just a video that I had to take. The timing on most of these things is pretty tricky. The system will finish and reboot before I can actually demo it live. So I've got some video that can show a few of the components of this system to help you understand how it's useful for this environment. So just some background to motivate this. Simple fact is the updates in software are no good. We all know this. We've all stared at these screens for many, many hours. And while you might be able to get away with this on your cell phone or your desktop laptop system, it really doesn't scale well when you're talking about an industrial IoT system where you might be deploying hundreds, thousands, or if you're lucky, tens of thousands of devices. You need something that is much more robust, much more reproducible, guaranteed to be atomic, and that kind of thing. We do need a better way to handle updates. There's been a lot of talk, a lot of different software packages implementing updates over the last five, six years. It's a hot topic, certainly, in the industrial and IoT spaces. And the simple fact is we need to do better. So why do we care about updates? I think we all understand the basic motivation for needing updates in the field. There's patches, not just to the kernel, but really to all software that's on your system. We know there's lots of critical vulnerabilities that are found all the time. We need to add additional drivers. Sometimes we need to add new hardware support. Our users might have a USB that they add, and they want to add a device driver for a specific device that may have not been tested at the time you produce the system. Custom firmware binaries are constantly getting updated. So for your Wi-Fi chip, there's your custom firmware binary that the manufacturer might push an update for. You want to make sure that your end users get access to that. And just other vendor updates, think Spectrum, Meltdown, all those kind of things that resulted in lots of changes at the firmware level and in the kernel. And then, obviously, user space, stuff, heart bleed. There's been quite a few documented cases in the last few years that make it all that much more obvious. One thing that a lot of people don't think about when they're thinking about updates, though, is you can actually use updates to deploy new features and therefore generate more revenue. Tesla has been a good example of this with some of their add-ons that you can add after the fact. You can buy them when you buy the car, but if you decide you don't want it then and you come back later, they'll sell it to you for even more than it would have cost you to buy it in the first place. And they're happy to do that. And because they've got a full-fledged OTA system, they're able to actually deploy that. So it's very desirable from a business perspective to be able to deploy these updates because you can make more money. And just in general, all bug fixes, obviously. That's the main impetus for adding OTA. The first step to securing a system is to being able to update it because the simple fact is all software has bugs. And I think we all know that. So what is OS tree? So OS tree is an infrastructure piece of code that we use in Verizon. And it's used in a number of other implementations of similar systems. This quote you see on the screen is directly from their documentation. There's a lot of words here. It's both the shared library, Swedish command line tools, yada, yada, yada. But there's a few important key points that I want to point out. It uses a Git-like model. So if you're familiar with how Git stores the source control information for your source code using a content addressable file store, OS tree does a very similar thing. So each file is essentially associated with a name that corresponds to its hash so that if I then try to upload that same exact file again, the system is able to very easily detect that it's the same exact content as another file. It doesn't need to deploy that object. It's able to easily locate it by that hash of the file. So if you're familiar with Git, you'll be well on your way to understanding how OS tree does things. Where it differs from Git is that OS tree is really meant to manage bootable file system trees. So Git is managing just an arbitrary collection of files, whereas OS tree is really designed about bootable file system trees. And one of the things that is kind of a result of that is you don't really have the concept of branching and merging in OS tree. You're not gonna go work on another branch and then come back and merge it together at the OS tree level. You might do that back at the source code level that generates the OS tree, but for actual managing of the bootable file system trees in the device, it's pretty much going to be a straight, linear history of the device. Sometimes you might have different versions that you put in there that may not be completely linear, but you're never gonna be merging back and forth within OS tree. And it also manages the bootloader configuration. That's just kind of a big, because of the fact that it is managing bootable file system trees, all of the configuration needed to select the appropriate bootable file system is handled through the bootloader configuration as part of the bootloader spec that you see here that's part of the free desktop foundation's definition. So a little bit more detail here. I mentioned the content addressable file systems and so all the files or as OS tree calls them objects are stored indexed by their checksum. Like I say, the system will take the checksum of the file and it uses that to generate a file name with some level of directories so that you don't have all the files in one directory. And then they are checked out via hard links. So that's the key here. So that when you have a bootable file system tree, each of the files in that tree are hard links to the appropriate objects in the repository. So you have the repository, which contains all of these checksum content addressable files just in a directory. And then what they call a hard link farm is generated to reference one specific release and each of the files in that hard link farm are a hard link to the object in the repository itself. By necessity, one of the characteristics of an OS tree bootable file system tree is that it is immutable. It cannot, it has to be read only. It has to be managed by OS tree. So you're not gonna go in and just modify files directly in one of these file system trees. OS tree does have mechanisms for handling things like configuration files in the Etsy directory and there's other things that are ignored as part of the OS tree managed files. But for any file that is managed directly by OS tree, the only way you will update it is to deploy a new version of your bootable file system. It does provide a mechanism to commit and check out branches as I said. Occasionally it'll get branches but you're never gonna get a merge back. The boot loader specification that's available on that free desktop link, it basically tells the system how to automatically discover what versions are available on a system, where the kernels are located, where the device trees are located for a specific version and that kind of thing. It does support both grub and u-boot for all of the systems that we support. We use u-boot on all our arm-based systems but it can also support grub and there are a number of desktop distros that are using it in that fashion. And it's important to know that it is a set of user space tools and that run at the command line. There is not an active runtime that is associated with OS tree. It's all based on the hard links and standard file system features so you don't have an active runtime that's slowing down your system or anything. It can run on any file system that supports hard links and because of the way that hard link form is set up, your performance is just the same as if you weren't using OS tree. There's no performance hit to using it except maybe a little bit in initialization time at boot time when it's processing and doing the auto detection of which versions are available and selecting the appropriate bootable file system tree. So this kind of repeats a lot of what I just said. Just a couple more points though. Normally the root of your file system is mounted as slash or the root. In an OS tree-based system, you actually bind mount slash to somewhere else in the actual file system. It's kind of an inception-based indirection that's a little bit awkward to get your mind around but normally once you've started playing with it, it's really not that complicated. To actually be able to do all this processing, it does require an init ram FS. So the system will boot, launch the init ram FS. The logic in the init ram FS is what does all the manipulation of the bootable file system trees and sets up that bind mount for the root file system. So I think this picture kind of helps explain it. So the diagram on the left is the actual raw files that are in the system. So you've got the files under boot and this is where all the bootloader configuration is stored and then you have the slash OS tree and that's what I referred to earlier as the repository. So underneath this OS tree repository under repo, you have a number of, that's where all the objects are stored. So if you look down inside OS tree repo, you'll see directories that are just, the names are just hexadecimal numbers and then the files are just longer hexadecimal strings that refer to the checksum of each file. And then under the deploy directory, that's where we start getting into these bootable file system trees. So deploy and then we've got a specific OS identifier and then another deploy directory and then underneath there you will have multiple commits and each of those commits is essentially the entire bootable file system tree. So the runtime view on the right is essentially looking, this is where that bind mount comes in. So you've got the files on the left but the runtime view is that slash is actually looking down at one of the specific commits down inside that OS tree repository. And then just so you're aware, there's also the var directory. Note that in the repository view, var is actually outside of the commit object. So in this case, var is actually persistent data that's essentially unmanaged by OS tree so that anything you put in that portion of the directory tree will show up and persist across versions and across updates. So for instance, this is where we store our container volumes and configuration and that kind of thing. Anything that we, you know, any user data that's part of the system goes under var. And then I will mention, I don't think there's actually a slide on it but there's also some special handling for Etsy. So you see here, we've got Etsy that's in the managed portion of the repository but actually what happens is the Etsy is there's a read-only version and then there's the runtime version which is read-write and on any update, the system will actually do a three-way merge between the previous version, the new version that you're installing and what's currently on the device. So any changes that your system makes in the field will be maintained via that three-way merge of the slash ETC directory. And just briefly, some users of this. I mentioned that there's a number of desktop distros that are using it, Fedora Silverblue and Fortora IoT, notably. I've not personally used them. I know some people who have and like them a lot and then there's a number of others that I have not played with. I come from the embedded Linux space so we're focused mostly on Yachto-based distributions. You've got the meta-updater layer that can be included in a Yachto configuration to provide all the classes and configuration needed to add OS Tree support to your configuration. And then some other interesting uses where they actually go up to the package management level where things like RPM OS Tree and Flatpak, they actually use OS Tree under the hood to essentially reimplement the standard RPM view that we have of the world today. Again, I've never played with them. I'm not sure how they work but the intent there is to give you much more atomic updates and rollback which has always been an issue when you're dealing with package-based updates. And some alternatives to something like OS Tree, at least as far as implementing over-the-air updates. The one that most people have heard of is the dual-AB, dual-bank redundant partition setup that a lot of the OTA systems out there use today. Obviously, the downside of that is it takes up additional disk space. Package managers, there are systems, you can just go in and do an APT update and APT disk upgrade. Doesn't scale real well because it is hard to manage that atomically. It's hard to guarantee that the exact same set of packages you tested in your lab is exactly what is on all your devices in the field. And then there are, you can do some containers. There's SystemD has some container infrastructure that's part of SystemD itself. Again, I'm not terribly familiar with it so I can't give you the details. But to give you some of the advantages we think of OS Tree is the first one is space-saving. The fact that it's got automatic deduplication based on the check sums makes it very space-efficient both in terms of the space on the disk as well as the amount of bandwidth you need to download any new update. OS Tree is very capable of figuring out exactly which objects you need and transferring only those files that you don't already have or those files that have changed. Whereas when you're doing a dual AB you might be able to do a binary delta to transfer less data but you still have to write the entire partition. So OS Tree is gonna save you significant time both in download and updating of the flash. And another very important thing is that integrity can be verified. So the check sums and cryptographic validation of all these objects is integrated into the system. You know that if your system has downloaded completely in the check sums, check out. You have exactly what you're expecting to have. You don't have to worry about man-in-the-middle attacks and things like that. That's all integrated in part of the system. And yes, I'm sorry? You are assigning the list of changes and verifying that something has been altered that way. As far as I know, yes, I couldn't, I can't tell you all the details of it because it's a little bit outside of my scope. I haven't dealt into the details of that. But I know for instance, in the get side of things, let me repeat the question. I forgot to, I forgot that. So the question was when you're dealing with the cryptographic check sums and that kind of things, what kind of guarantees does the system have that they can't be modified? And it's very similar to the way get handles things so that if the file is changed, the check sum is gonna change. And then there's signatures based on those check sums and that kind of thing. So they all feed in to be able to guarantee that the files are unmodified. I know I didn't really give a real thorough answer, but that's essentially it. And another thing that we find is very valuable. The last thing on this slide is the fact that it is immutable. The system is read only. We know that our devices are running the exact version of all the files that we have tested in our lab. We don't have to worry about package drift where somebody modified something or they install the package. They either install everything for a new update or they install nothing from a new update. There's no way for them to just do a half update and get two packages out of six or whatever. So that gives us kind of a rudimentary revision control of the bootable file system tree on the device which helps, especially when you're dealing with large numbers of devices. So what does power safety mean? The issue with a lot of these devices is that they are generally stored in environments that you as a system developer don't have control over. Our customers range from people putting things in public spaces all the way to very tightly controlled lab environments and industrial controls environments. And so the range of customers we have spans the gamut but in general the idea, you wanna think of these devices as they're pretty much out of your control. And so they could be attacked at any time by network attackers. Somebody could reach over and grab the plug and yank it out of the wall at the wrong time. And it might even be very expensive for your team, for your operation staff to access them. They're generally not on site with your team so it could be very expensive. And the idea is you wanna be able to make sure that when you are deploying updates, any of these kind of situations that can interrupt the update do not leave you with a brick system. The simple fact is the cost is just too high. So you wanna make sure that the system is either completely installed or not installed at all. No component outside of the update system even really needs to be aware that an update is being installed until the update has downloaded and been properly propagated through the file system. So a couple of things that the OTA system has to do, it has to detect these failures and know when they happen and it has to be able to clean up after itself. It has to be able to handle OS changes to that bootable file system tree as a completely atomic operation and it has to be able to handle automatic rollback. That's also very important, right? I deploy a bum update with a kernel that oops is on boot. I don't wanna leave the device sitting there completely waiting, so you wanna use Watch Dogs and things like that and have logic in that init ram FS that's going to detect that and be able to roll back to that previously known good configuration. So looking at the OS tree update states, so these are the main states greatly simplified of an OS tree update. So on the left in blue is you have anything that happens in the context of the old deployment and then on the right in green is anything that happens in the context of the new deployment, okay? So when you're getting ready to do an update, obviously the first thing you have to do is fetch, then you assemble that deployment directory that we looked at inside of the repository and then you do that three way merge with Etsy that I mentioned and you'll switch over to the new deployment and run from there. So let's talk about each state individually just so we can kind of understand how OS tree is able to detect failures at any of these states. So fetching pretty straightforward. All new objects are fetched over HTTP or HTTPS. There's compression built in. Any existing objects don't need to be redownloaded. Fairly straightforward stuff, everything you would expect. Check some is verified and then the objects at this point are stored raw inside your repository under that OS tree repo objects directory. At this point they're just objects getting added to the repository. You haven't done any configuration. You haven't actually built up the bootable file system tree yet. And there is an extra feature you can actually enable a binary delta per file. So depending on what files are in your root file system you might even be able to get a little bit more savings of download time by using per file deltas. Out of the box OS tree generally does not enable that. Since it enables downloading only needed objects you already get quite a bit of savings there. And so the additional savings for implementing per file binary deltas is minimal. So if power failure occurs during this stage there's a couple things that happen. First thing is any of the objects that are downloaded they are fetched to a staging location first. That is based on the current boot ID. So each time the system boots there's a unique boot ID that is generated and the objects that are downloaded are gonna be indexed based on that. After all the objects has successfully been fetched if the power fails during that fetch to a staging location then this is one of the things we'll see an example of then it's able to clean that up. Once all the objects have been fetched then they can be moved from staging and clean up the staging area. So and then when the system boots it can inspect the staging area and if there is something associated with the boot ID that is not the current boot ID it's able to say that there must have been some kind of power interruption and is able to clean up. And then any active deployments that are ongoing from the server at that point will be redownloaded and the objects retransmitted. So the next step is assembling that deployment directory. So this is where we create the hard links that make up that bootable file system tree. So we're going to create a new deployment directory based on a checksum that is created cryptographically tying all the components together. This is where all the hard links are created for each file pointing to the specific object in the repository and the ability to detect failures in this case it basically all comes down to a single symbolic link that is created at the very end. And if that symbolic link is created then the update has completed and we're able to switch to it if that symbolic link does not exist but some of this deployment directory stuff exists and the system is able to detect that a power failure has occurred. So it's all tied to that atomic creation of that symbolic link at the very end of this phase. And then after we've created a complete new deployment directory then we have that three way merge of the Etsy directory. So the system is able to look at the unmodified Etsy configuration files from the old deployment the unmodified Etsy configuration files from the new deployment and any files that have been modified in the active running system and is able to do that proper three way merge to make sure that any new changes get pulled in any changes from your devices that have been made in the field get pulled in and that all those changes persist properly. Similar to the previous phase a data origin SIM link is used to determine when this phase has completed. So once the three way merge starts that file doesn't exist once the three way merge completes the system is able to create that SIM link and be able to detect on which side of the power failure you were sitting at this point. And so now once that has completed we're ready to move into the new system. So at this point we're gonna update the boot configuration now we're in the context of the new system even though we haven't booted yet we've got everything we need downloaded to the system we've got everything staged properly so that we can boot into the new system and so now we're just going to modify the boot environment so that on the next boot we'll select the new version. So a new boot directory in this case will be created with an extension for the index so in the case of Verizon we only ever have two we have the current and a new. OS tree supports an arbitrary number but that index is essentially which version is at zero or one and it's just gonna ping pong back and forth between them and then similar to the other phases there is a SIM link created once all of the configuration is updated and that SIM link is just OS tree slash boot and that will point at which index we're going to be booting the next time around. So until that SIM link has created at the very end of the process the system will stay in the old version if it gets past the point where it is able to create that SIM link then when the system reboots it will be in the new version and then the next step is the reboot that's pretty straightforward and then on any boot and of course the reboot could actually happen anywhere in here this is only showing the successful reboot but if the reboot happens at any of the other phases there are the system has ways with those SIM links and with the boot IDs of detecting where in this process it was and being able to clean up and retransmit as appropriate. So with that I've got just a few minutes left for demos let me just see if I can pull this up here. Yeah so this is the first one I hope you can see it in the back it's a little bit small. So what I'm running here is I'm actually running a horizon system with OS tree in and all I've got displayed here is the staging directory so this is just periodically viewing the staging directory that's happening during a download of a new version. So I wanna say it takes a while before anything happens thank you. So at some point here we come in here and we will start to see things showing up in the staging directory and what I've done in this particular instance is once the staging has started I just yank the power on the device and so we see now that in the staging directory our files are based on the boot ID which is that big long string I know it's kind of small you can't probably read it in the back but you could probably tell there's a long hexadecimal string there and that's the current boot ID. So if we let this run a little bit more at some point I reboot so we start to see the reboot here and then when the system finally gets booted I log in and we look at the staging and so in a moment we'll see that the staging directory has gotten cleaned up so it knows that the staging was from our previous boot ID or rather from not the current boot ID is all it knows and so it just deletes it and we'll start again so we'll see another staging directory show up here in a minute if I didn't cut off the video before that happened so it cleans everything up but the server side component in this case is still deploying and so I guess I killed it before it actually got done. So that's the basics for power safety during a download. Now the next one is the slightly more complicated one is the power cycle that happens during a deploy and activate phase so at this point, let me just pause right there so you can kind of see a number of different things here you see the sim links that I was talking about that point to load or ending in the dot one and then you see we have a couple different boot ID based repositories and we start to see new things starting to get created here, if I come forward a little bit. Okay, now we're starting to see the system is actually has downloaded all the objects and now it's creating that new bootable file system tree and then we also see now we see that we've got the loader dot zero but the loader sim link still points to loader dot one so that means that the system is actually actively setting up the boot configuration and then I think at some point, yeah, so now we see loader points to loader dot zero and the system reboots so the system is then able to clean up and retry the downloads as needed. So with that, I think we have just a few minutes for questions, let me get to the last slide here because there's a number of resources and links, the slides are up on the conference website, this last slide, if I can get through all of this has a number of documentation links and things that are potentially very useful. So a lot of this talk came out of this next to the bottom link the lot more details about this on that blog post. So with that, we've got I guess just five more minutes for questions, anybody have any questions? Yes. On to the edge device and then rebooting into the new deployments. Do the old part of the main on the device for any couple of questions? Yeah, so the question is when you're installing a new update what is done to clean up files that are no longer needed? So the way that repository works is it's just a bag of bits, you've got a bunch of files in there and for one release, you've got one set of files for another release, you've got another set of files, some of them may be the same and in general OS tree won't clean up at that point. Any more than get would clean up an old version unless you explicitly tell it but there is a mechanism within OS tree to prune old releases and so with our system, once we have detected that the new update is installed all the user installed post install checks have completed then we will prune out any unneeded versions to help save that disk space. But we obviously keep it around because if we have to roll back we've got to make sure that we've got it. So that prune only happens once we have completed all the post install and sanity checks. Yes. Yeah, sometimes the system requires file system upgrades for example, like from, for example, like FAT32 to NTFS, such kind of stuff. So how do OS tree deal with those things? Simple answer is it doesn't. We are, in our system, anything that requires manipulating the partition table is pretty much unsupported. We do have, we are getting ready to roll up a bootloader update feature but that uses EMMC boot blocks and a few other things that are part of the hardware. Once these devices are deployed in the field changing anything at the partition level is essentially impossible to do completely safely so we don't allow that. But I will say that that is one of the benefits of using OS tree versus the dual AB mechanism is with the dual partition mechanism you have to define the sizes of those partitions at the very beginning of the system's lifetime. Whereas with OS tree you have a single partition so you don't have to worry about the sizes changing but yes, if you had to do something like move from the EXT forward to ZFS for whatever reason you're gonna have to find an external method to do that because you just can't do it robustly. All right, and I think we've got just another minute or two for questions. Is there a plan for something like DM integrity? I'm sorry, a sign for what? Is there a plan for something like DM integrity where you have a per runtime check on your data that there's nothing either malicious or unintentionally damaged? Yeah, so that's a great question and it's something I didn't point out. All these safety mechanisms built into OS tree and the OTA system only protect at update time. DM verity is intended to protect at normal boot time, right? It uses essentially check sums of the kernel blocks and all the blocks of the root file system. So they're really orthogonal to OS tree. Specifically, are there plans to include it in Verizon? Yes, there are. It's not included today. I don't know all the details because ARM has trust zone and all these other things and there's a lot of components that will play into that but that's completely orthogonal to the OS tree mechanism. Yes, so for me as a developer, I'm just interested, for example, to really make a quick update to the file system. For example, deploy a new binary of my own application, right? So I understand the system is read-only but how does it work in reality for me as a developer? What have to do? Yeah, so I mean, as a developer, there are ways you can kind of temporarily remove that read-only and just force a binary in. It will break the ability to then do updates with OS tree later. So that's really just a do it at development time and know that you're gonna have to wipe the system later. The alternative would be to store those binaries in a portion of the file system not managed by OS tree. So under that var directory, you could just scp a file in or even just put it under slash etsy. We do have IDE integrations that actually do that. They copy when you're developing your code. They just copy a whole bunch of files into the user's root directory, which is not actively managed by OS tree so that you don't have to deal with doing a full OS update every time you want to deploy a new version of your application for testing. Okay, thank you. Yep, anybody else? Very good, well thank you so much. And if you have any questions, I'm around all week.