 All righty welcome everybody it's so nice to be here after three years and meeting finally everybody in person missed it so much my name is Antonio I'm an engineer for reddit where I've been working for the past eight years taking care of containers open-shift and now Peter folds the edge right so today we're going to talk about resiliency at the edge who we are so there I wanted to put a team picture but I basically meet my team now so you know we've been working together for three years but no picture together in real life so we are the team building Ralph for Edge we also maintain Fedora IoT along with the upstream community and we developed all the technologies for the edge like FDO Green Boot which is the topic of this presentation and all the parts in image builder that makes it possible to actually build Ralph for Edge. I lied this is my team you can see a couple of funny faces like Micas there you gave permission to share this so this is my amazing team so as I said we built the operating system for the edge or you know that's what we call it and Ralph for Edge is built using image builder can be Fedora and Ralph the team focuses on both in the day today with a push to do upstream first and more upstream in general all of the Ralph for Edge system is based on OS 3 and RPM OS 3 and it runs on those tiny devices like the Fitlet 2 the Intel Nock and it's it's made of a curated curated package set things that we only need that the edge or things that we don't need then we just remove that's the last line it's lean and secure you don't want a huge attack surface you want the system to run with minimal amount of RAM and CPUs so that's how that's how we've designed it so before talking about resiliency the way we build Ralph for Edge is we create an OS 3 container commit with image builder that basically just build the OS 3 layout that then we serve as a as a remote repository to image builder image builder what it does is with that simple blueprint that creates the OS 3 commit and then when we serve that we can build an installer like Anaconda or the simplified provisioner you can build a raw image you can build pretty much whatever you want kudos to the OS build folks here I've seen a killer somewhere yeah after you have this artifact I can be a raw image can be installer Anaconda itself we're gonna what we're gonna do is basically flash the OS to a device that's the provisioning part which I have another talk after this to talk about as part of that as part of the flashing provisioning itself there is some sort of tagging that you want to do to the device in order to later identify it and things like that and then with this tiny devices the the edge use cases is you provision the device with an operating system the analogy here is when you go and buy a windows laptop that stuff goes out of the manufacturer with windows pre-installed but it's not activated so this is pretty much the same you have a bunch of devices all provisioned with Ralph for Edge and then at the end you're gonna ship those anywhere in the world this is IoT so this stuff can be a light bulb can be a sensor at the top of the address anything that is tiny or not really tiny that we don't have easy access to exactly yeah I can go to space too that's what Peter is saying and then once it's at the address somebody's just gonna power it on and it does on boarding some of my team members are gonna do the onboarding part of a Ralph for Edge system so make sure you sign up for the rest of the stock and at that point this device is at the top of the Mount Everest supper running all is good sending sensors data back to the to the main data centers or whatever crunching data on the actually there on the mountain so whoever shipped the device to the Everest like actually walking up there just power it on and go back to the office like that is easy right now this is a two operations that so the device need an upgrade so that is the blueprint that creates an upgrade and it's still an OS 3 container commit that we build using image builder in that example I just started nano and that file and then we're gonna serve this commit build the upgraded OS 3 commit and then what we're gonna do is basically just run our PMOS 3 upgrade and then reboot it once we run our PMOS 3 upgrade we get this kind of output and it says just run system CTL reboot and everything is gonna be fine you're gonna be booting into the next version of the operating system this is cool right so what can possibly go wrong with this scenarios basically so you know upgrades don't always go as planned so probably the worst nightmare is something like this like you have a device at the top of the Mount Everest it's like oh my god I can't connect to it anymore there may be an issue right so there are a couple of solutions for this the very silly and simple one is to send somebody to the Everest it's probably expensive tiring for those who know that's no Emmy my partner she gets six six every time so that's not even the Everest of course it's super sunny there so somebody has to go physically there and debug the device which you can imagine it's probably not gonna happen it's really unlikely you have these tiny devices everywhere in the work or in space you want these to run smoothly to upgrade to continue working to run the applications that have been deployed on these devices so everything has to be smooth so this is not gonna work so that's when Green Boot came into the picture so some history here Green Boot is a it's basically a tool that makes it possible in an RPM on three and I was three system to go back to the previous deployment if there are any issue in the in the upgraded deployment I'm assuming some familiarity with RPM was three basically you have two deployments at all times and when you upgrade after you reboot you go into the new one with Green Boot it's possible that if you add some check and the new one doesn't work or doesn't behave the way you want it goes back to the to the working deployment so that any upgrade that goes wrong you have a chance to revert it and then maybe the buggy part or ship another upgrade and things like that so Fedora has been created Green Boot has been created this part of the Google summer code in 2018 2018 yeah yeah the whole idea is being created and led by Peter Robinson here Christian Glombeck was the intern working on Green Boot at the time also Dusty Maybe and Jonathan Le Bonn helped him as mentors too and again this has been part of Fedora IoT because Peter saw the need to have unattended rollbacks on these devices because otherwise again you need to send money back to the Everest unlike the whole project is has been designed around our PM history because of course in a normal system it's not as easy to ship an update rolling back without any disruption that's also unlikely and it has been made to work with grub so right now we just support grub but any UFI you know boot system or whatever can be made to work if you have ideas if you want to contribute feel free and so with these two pieces in mind our PMOS 3 and grub Christian created the very first implementation in 2018 and so how Green Boot works is pretty easy somebody runs our PMOS 3 upgrade and system city of reboot when our PMOS 3 upgrade is called Green Boot sets some variable in grub and after that that I'm going to explain in a little bit and then we reboot so at the next boot Green Boot starts a couple of health checks what we call health checks and based on those it's gonna run a couple of boot status script meaning okay the deployment is okay now I'm gonna send an email to an administrator maybe or somebody saying everything is okay or you know worst case the upgrade doesn't work and we need to roll back yeah the last one is rollback can also happen maybe you can have transient failures so you want to you may want to retry a couple of times or many times you want the new deployment maybe it's a network hiccup or something so in grub busy you know how Christian made it work and how it is today there's Christian it's outside it's basically leveraging a couple of grub functionalities like setting variables there are two main variables that drive all the Green Boot processes the first one is the boot counter the boot counter is set when the upgrade is done but before the reboot happens once the boot counter is is set and we reboot every time that the deployment fails we're gonna decrease it until it's negative that's the signal that we need to roll back so hopefully it's super easy and the other variable is boot success before the upgrade we set that to zero and then if the new deployment works we're gonna set that to one you know those two variables basically drives all the green boot operations we shipped some template in grub.d just to make all of these logic works as I've explained it hopefully it's clear two main one it's a full back counting basically just decreases the boot counter and the other is boot success it's zero before rebooting it it's resets to one once the upgrade goes well this is this is all Green Boot Green Boot is a couple of scripts and many at this point system deservices it has a state machine like operation model so every time it checks the variables and where it is at that moment and buys it on that act so the first service that we have is the Grim Boot Grub 2 set counter service this is the one that is directly wired to OS 3 so that when OS 3 runs an upgrade well you run an upgrade and OS 3 takes care of that it's gonna fire that service Grim Boot intercepted and say okay it's time to set boot success to zero and boot counter to whatever the you know you retry the actual retries you want to do for that specific boot then there is Grim Boot Tellcheck which is the it's the main service it it's ordered before boot complete target and what that does is running the custom health check that everybody can actually add to Grim Boot in order to say this deployment is good this deployment is not good if it's not clear I'm gonna show an example later with a demo so that it becomes really clear so if the health check service passes boot complete is reached as a target of course and then there is another service setting the success variable in Grubb and I'm setting the boot counter route variable and then if everything goes well with the new deployment there is a further option to run additional script that's in the green D directory again it can be an email to Peter saying okay the deployment is good all good deploy it to a space if it fails instead there is the whole roll rollback mechanism run by Redboot how to reboot that service and it does as a slide says a series of checks to remind if there is a requirement for minor intervention because that can happen too otherwise it just reboots the system and try again if Grim Boot has been made it's been configured to retry and when there is when the health check fails then the red dots these scripts run and at that point you can still send an email to Peter this point you have a filter saying okay no this deployment doesn't really work you don't want to boot there maybe it's you know create another one I could I'm gonna send you yeah so this is the very basic directory structure for Grim Boot itself the as I said the required dot this script under Etsy Grim Boot check is the one with scripts that must pass it means that if you boot into a new deployment and any of this script fails then the whole boot is marked as not good and what Grim Boot does is either retry or just rpms3 rollback to the previous deployment the wanted the script directory is a it's a directory that holds scripts that may fail can be failures that are okay to open in a new deployment but then maybe you know network for instance probably better in required of D and then it wanted to have something else assuming network has been working you know to further the bottom and as I explained Grim.d and Red.d are just boot status script if it's green run this if it's red run the other yeah this is this is the configuration I know it's a pretty slim but this is the configuration variables that we we implemented in Grim Boot the max boot up temps it's the one that takes care of rebooting until that number and the other two are part of a script yeah the other watchdog but I'm not really not gonna get into that because the watchdog is probably a whole new topic but those are the one that we support today the first one is probably the most important and in my demo it's configured to three so you will see the virtual machine rebooting three times I don't know much time 48 so 10 50 minutes okay this is that all the alt checks are scripts bash scripts I know it's bad we're working on that not necessarily bad and this this is what you actually write in order to have something run at the next boot and then mark the boot as a good boot and then keep the deployment or something that has to be rolled back this you know this is bash pretty easy there is if there is that file we're gonna fail and we're not booting into that deployment if there is not everything is good we stay in the new deployment can you see yes awesome so this demo is basically demoing what I've been saying and we're gonna well I've built already the the upgrade and we're gonna do together is run the upgrade and reboot I didn't trust the network to do all of that so I did this in advance you can see here that so right now I'm at this version market 5.7 this is normal RPM was three things mechanism so we are booted into this deployment I've built instead a an upgrade I've built an upgrade days at dawn dislike where you can see that I've added that file which is the one that is gonna fail you know it's easy it's up it's really silly as a demo but it explains how this works so the new upgrade is gonna contain that file that it's gonna make the new deployment fail so I built that using image builder and if we follow RPM was three commands I can ask for the upgrade and you can see we're gonna go and upgrade to six seven there is a new package added this is always three works it's just nano if you remember the blueprint and so now what is left to do is just reboot the system actually does it work it works so right now it's upgrading you can see we added the nano package so these are this can be a normal upgrade during normal operation at the Mount Everest somebody needs something a new customer application a new sensor new sensor data or you know a program to send back sensor data whatever can be anything so at this point what we're gonna do is just system CT what this does is basically going to try and boot into the new OS 3 deployments hopefully it's visible but it's not really important there were two boot entries that that have been shown there is some issue there yeah so I'll start from here since this is not gonna work anymore I prepared for this but this can happen so after I rebooted what was gonna really happen I'm sorry for this is you know two boot entry were shown those were the old deployments and the new deployments so we were trying to boot this is the new deployment as soon as you are in the new deployment green boot runs check that else check that I've shown you before basically checking that file is there and if it's there remboot is gonna run and reboot the system and the next boot for three times we're gonna still have the two boot entry where we retried the new deployment after that is done and we are at minus one with the boot counter available then we're gonna just roll back to the old deployments it's super silly didn't work but yeah so that that was basically the demo I wanted to show I'm sorry for this all right so yeah so what's next for green boot we have a couple of points that we want to we want to enhance in the future not everybody like bash grips so what we're gonna do right now is rewriting the core in Rust and then perhaps enhancing the API so that everybody can drop any executable and you know base it maybe on exit code 01 things like that anybody can write their own health check the way they want with the programming language that they want what we're gonna do next to is having better RPM was three integration right now we're basically hijacking the way RPM was three does the different boots but we want to do better there starting in Fedora OT we already started talking with the chorus folks you know the folks maintaining RPM was three to to actually do that there is there are some e-cups with Greenwood itself and RPM was three the biggest one to me is that etc is writable that means that if somebody messes something up under etc while a system is running and then run an upgrade that wouldn't make the Greenwood health check fail then Greenwood is gonna just you know be fooled by that saying okay this is actually not booting and it's not green so what it's gonna do is just roll back and this opens for you know all sort of attacks to maybe you are upgrading because there is a CD but again can be fooled and it's not good for security either and lastly we have some actually live users of Greenwood the most notable is micro shift they are implementing health checks to make sure that they could go back and forth with micro shift itself which is you know open shift and things like that so this is the future of Greenwood if you have again any idea a couple of links here starting from yes starting from the actual repo on github and then the whole explanation by Christian Glomback who created all of this and then it dropped a link also to OS build composer since it you know the way we we actually build all welfare objective acts in our okay questions are we handling that you know we're not handling that basic so this is repeat the question Peter how do we handle right so the questions where Peter already know the answers and the others that we haven't fixed that is yeah the question is how do we prevent somebody from actually forcefully rolling back to something that isn't patched right maybe we are upgrading to something that fixes a CVE but then with something like this maybe we are forced to go back and then introduces of course a vulnerability in itself we're working on this there have been a couple of ideas too probably the RPM was three integration is one of the thing that we first have to take her out in order to it's complex it's a complex topic machine sorry machine machine repeat the question why we deal with it so that's unavoidable in general the way we deal with that is you run out of the container on the vice-president system there are scenarios I actually had one when a new version of a container can you know affect hardware or affect the machine in other ways yeah so we would go back to container and we would deal with that that would be independent of the system we can deal with that within like the whole check suite yeah yeah yeah so the yeah yeah I just made me realize that is there a room for a check where before doing the reboot for the upgrade to know whether the rollback would work because maybe a configuration change doesn't make the current state I'm broken and then you're in a sort of lift off state so and the same thing that we used to do when I worked in OpenShift with the MCO was any change on the RTC would and of course we can do that in OpenShift because the hardware is free or less basically every change in RTC triggers a reboot so that you make sure that at the next reboot if it's working the you know the change itself you know exactly but we don't have the authority currently to work with that yeah and we're looking at how we can how it works like a conflict layer and we're getting the same load again because if you change say like the example I always use is if you need to change say the network provider out on the edge and that is like a three step process where you may need to change the network interface to be able to connect to the router and other companies if any of those work well you can't upgrade it again to fix the problem because you can't connect to the other end servers because you don't have that and so how do we deal with that and that we don't deal with it very well and that's complicated but it's sort of like and besides RPMOS 3 and Fedora there have been a couple of solutions like using weather fast snapshots to say okay this is just a conflict layer as Peter said which is different from RPMOS 3 conflict layers OS 3 conflict layers but the you know the solution is probably lays there composite fast yeah I'm not sure if it's the case of moving into an order of convenience but it does strengthen the verification yeah yes so if the the comment was around composite fast I don't know if you heard everybody hurts can be a potential solution to this if you don't know what's going on some pieces of the composite well except in that case because unfortunately I think the machine that I was using runs a rail version that had a bug so that was basically just retrying forever because that's how the ignition service is made to work on first boot it wasn't first boot that's a bug otherwise yes if there is anything that hangs and at some point reboot then reboot will still work this is like the ignition fetch routine is running forever so that never ends yeah exactly I mean that's a good point perhaps something to consider into you know new features but maybe if you're running RPMOS 3 like I don't know silver blue yes it's very bound to how RPMOS 3 works that's it I know yes the integration point is the only integration point between green boot and RPMOS 3 is when the grub variables are set yes right now what's next there is better RPMOS 3 integration that means we're going to make that link more robust and not just rely on the OS 3 finalized staged that run after somebody run RPMOS 3 upgrade because that can be flaky too sometimes rarely you can miss that too perhaps yes anything else on time