 OK, welcome. My name is Greg Waynes. And this talk is basically to share Starling X teams' experiences in evaluating using OS Tree to do parallel atomic software upgrades on our Starling X cloud servers. So the abstract to paraphrase, OS Tree is basically an upgrade system for Linux-based operating systems or Linux-based deployments. I believe it's typically been more used in embedded operating systems. But in STX-8 of Starling X, we introduced OS Tree, and we're using it for patches in STX-8. And now we're starting to evaluate it for full software release upgrades of Starling X. So in these slides, I'm not an OS Tree expert, but I'm going to give an overview of OS Tree, what it is, how it works, how upgrades are done with OS Tree. And then I'm going to take a look at how we're proposing and looking at it to use it in Starling X, and then what's the impact, the positive impact, on upgrades, the elapsed time and outage times for upgrades in Starling X. OK, so myself, Greg Waynes, I'm a principal architect at Wind River for the Wind River Cloud Platform. That's the commercial version of Starling X that Wind River has. I've got 25 years of kind of platform experience kind of dating back to kind of nortel days. So mostly telecom space. I'm a founding member of Starling X. Five years ago, I was kind of here in 2018 when we announced Starling X. And I've worked in various areas of Starling X, some security, some open stack, high availability, software management stuff now, even docs. Last couple of years, I've been on the technical steering committee for Starling X. That's been fun. And then previous to Starling X, I've done very, very minor contributions into Masakari and Horizon. So yeah, so the agenda, I'm basically going to give a brief overview of Starling X, very brief, a single slider. And then, like I said, introduce OS Tree and then talk about how we're actually going to use it in Starling X upgrades. So Starling X, I'm sure you've heard a lot about this and this week. So as you probably know, Starling X is an open source project in the Open Infrastructure Foundation. It basically provides a complete, ready to deploy, fully integrated, private cloud infrastructure management. And as far as infrastructure management, it manages bare metal servers, the OS resources, device configs, manages all the infrastructure software that's running on the servers, so the biggest of which is Kubernetes, manages containerized application services that we have in Starling X. Most of them are other open source projects or services that we've really just packaged and reintegrated into the Starling X. As far as deployments, Starling X has a wide range of deployments. It can be deployed as a standalone cloud and really on a single server standalone cloud. And then it can be multiple nodes in a standalone cloud, the classic controller nodes, worker nodes, storage nodes. But by far, our most popular deployment option is distributed cloud, where you've got geographically distributed sub-clouds. And those in the telcospace, those sub-clouds, are typically single server clouds. And those are autonomous clouds. And like I said, they can be single server. They can be multiple nodes. But they're autonomous clouds on their own just for reliability. And then the central cloud and the distributed cloud environment is really there for automation and orchestration of managing the infrastructure across all the remote sub-clouds. So that's Starling X, OS Tree. So as I mentioned, OS Tree is an upgrade system for Linux place deployments. It basically does two things. It manages versions of bootable Linux file systems, so RootFS, in a very Git-like fashion. And then the other thing it does is it manages bootloader configurations and defines a file system layout in Linux such that the Linux system will boot and run a RootFS that is effectively a checkout of a RootFS commit from an OS Tree repo. So with those two things, with OS Tree, you can manage software and manage software upgrades such that the bulk of the steps actually happen while you're still providing service and still running the active RootFS, basically. And so resulting in primarily the outage time being reduced for an upgrade on the server. When you first look at OS Tree, you kind of think it's a package manager. It's not a package manager. And actually, kind of in embedded systems, it's basically used instead of a package manager. But there's been developments where there's really hybrid approaches to using OS Tree with a package manager. And I'll talk a little bit more about that later. So like I said, the first thing that OS Tree does is does version management of RootFS just like a Git. So I thought the easiest way to show you that was just show you some commands and you'll quickly realize, oh, it's like Git. So you can just see there's an OS Tree init command initially that initially creates an OS Tree repo and I just dumped out the subdirectories under there. It looks very much like a Git repo. If I wanted to create a first commit for my OS Tree repo, I could make a temporary RootFS. I'd populate it with a RootFS for a Linux deployment such as Starling X. And then I'd basically commit it into OS Tree. And so I'm doing a commit there. With specifying a branch, SCX-9. If the branch doesn't exist, it'll create it. And then I'm putting a subject tag on it saying it's a GA release SCX-9 and specifying the RootFS content that I just created. So I'll create the commit in OS Tree. I can do an OS Tree refs to list out the branches. You can see there's an SCX-9 branch now. And then I can do an OS Tree log to list the commits and the commit is there, the GA commit. And then I can obviously remove the temporary RootFS because I've checked it into OS Tree. And just like Git, I can check it out if I need it again. You can see the RootFS is back. I can make a change, I can fix a bug. And then I can commit the whole RootFS back into the OS Tree. And you notice that just in this example, I'm gonna commit it back into the same branch SCX-9. I'm gonna tag it as, okay, it's patch one for SCX-9. And then you can see that I just still have the same one release branch, but now I've got basically two commits in my OS Tree branch, just SCX-9 GA in the patch. And then also, like Git, you can actually specify in a local OS Tree, you can actually specify remote OS Trees, the repos that you can pull from. So in my little example here, I said, oh, I've got an SDX build server. So I'll create a remote for this OS Tree. I'll call it my SDX build server, specify the URLs, specify the branch that I wanna pull from. And then once I can do that, then on my local OS Tree repo, I can pull that. It'll pull all the commits in that branch down into my local OS Tree repo. And then now you can see that when I do the OS Tree refs for the branches, I've got SCX-9 that I just locally created. And then now I've got a remote branch from my SCX build, SCX-10. And then I can list the commits there and then I got a SCX-10 GA commit. So you get the idea, it's exactly like Git. Okay, so that was the easy part, the hard part. So like I say, there's an OS Tree deployment layer that's defined that effectively defines a file system layout. So, and a Linux deployment can basically boot and run from effectively an OS Tree checkout of a root FS. So I'll go through the key concepts there. So first key concept is that the most OS Tree-based Linux deployments have a sysroute partition. And in that sysroute partition under kind of sysroute OS Tree repo is my local OS Tree repo. So that's the main OS Tree repo for this OS Tree-based Linux deployment. And it's basically got one or more commits in it, basically all the releases that I've kind of loaded onto this system at one time or another. And then under sysroute OS Tree deploy, basically I have one or more deployments which are actually basically checkouts of a particular commit in OS Tree. And though those commits, one of them is the, one of them is the actively running root FS and other ones could be ones that I'm in the process of upgrading on. And yeah, I forgot to mention that just when you do OS Tree checkouts, it's very similar to Git in the sense that it uses hard links to, you know, to conserve disk space. Okay, the next, okay, so the next thing is there's three top level directories in Linux that you need to understand in order to understand how OS Tree manages the software. There's slash user, slash ETC and slash var. So slash user, slash user in an OS Tree-based system contains the entire root FS. So it's exactly what you committed in OS Tree as your root FS. It has all the software that's being managed by root FS by OS Tree, OS Tree completely manages it. And you can see that it's actually read only, that the actual slash user top level directories actually read only bind mounted to the checked out root FS commit in that under the deploy directory. So actually kind of side benefit of OS Tree is that the software managed by OS Tree is immutable in the sense that even a user with root permissions couldn't accidentally change the software. All the software is all changed through OS Tree operations. And then obviously in a normal Linux deployment, all the managed root FS software is not under slash user. There's other top level directories like slash bin and slash lib that have managed root FS software. Those all end up being symbolic links to the analogous directories under slash user slash ETC. So slash ETC is partially managed by OS Tree. So slash ETC contains configuration of the services in the OS that are being managed whose software is being managed in slash user. So obviously it's configuration data that the user will configure for a particular deployment. So OS Tree can only partially manage that. It's got to preserve the user configuration. So what OS Tree does is that when it's actually doing a deployment from one root FS commit to another root FS commit, it will actually do kind of a funky three way merge of the old default ETC config in the root FS that I'm migrating from the active systems ETC that's got the user's configuration in it and the new default configuration of the root FS that I'm upgrading to, basically. And then finally slash var. So slash var is in an OS Tree based system. It's basically used for storing all the runtime persistent data for the OS and for all the platform applications in your Linux deployment, like in the case of Starling X it would be the Starling X infrastructure applications. So this is not managed by OS Tree at all. So OS Tree does not touch this at all. And again, obviously all the runtime persistent data in a typical Linux deployment isn't necessarily under slash var. Again, there can be root level directories with persistent data. And again, it's just symbolic links to under slash var. Oh, and then finally, finally, there's the slash boot partition. So this is like a typical slash boot partition for Linux. It basically has the boot loader configuration to indicate what kernel to run. And in the case of an OS Tree system it also refers to what deployment under assist root OS Tree deploy that I'm actually wanting to boot into the next time my system boots up. And so when an OS Tree deploy happens to do an upgrade it actually knows how to update the boot configuration in order to get it set up so it will boot into the right root of fs. Okay, so how does a vanilla OS Tree upgrade actually work? So I've just started out with an initial condition here where I'm running an OS Tree based Linux deployment. I've got one commit say commit zero five in under my assist root OS Tree deploy. And so that's the starting point. So first thing I do is I do an OS Tree pull to pull down my new software release. So I'm doing an OS Tree pull from a remote to external OS Tree repo that pulls down into my local OS Tree repo. Obviously there's no impact to the root of fs that I'm running at the time and the services I'm supporting. And then the next step is a multi step step is basically doing the OS Tree deploy. So what happens at this point is that a new deployment directory happen gets created under deploy. It basically OS Tree checkout is done to check out the new root of fs commit which is the new release of software. And into that commit it does the OS Tree will do the three way merge to manage the ETC directory and then OS Tree deploy will also update the boot logator configuration to indicate that the next time you reboot you will boot into that deployment under sysroot OS Tree deploy. So again, all that can be done without impacting root of fs. And then finally you reboot the server to reboot it back, reboot it into the new root of fs. The updated boot configuration will run the updated kernel and the updates to the boot loader configuration scripts will actually change the rebright mount that are the bind mounts in order to point to the right root of fs that you wanna come up in. So that's a general idea and the last thing on OS Tree is, I mentioned it's not a package manager it's a really typically used instead of a package manager. I think in a classic package manager like with RPM packages or Dev packages those deliver partial files to some trees that contain the software for a particular software. It's got metadata, it's got install scripts and typically the package manager takes those packages and installs them directly on the running root of fs. So in the case of OS Tree, as you've seen, OS Tree deals with complete bootable file systems the complete root of fs. So it has no idea about how the root of fs got built or anything it just knows how to save them in the OS Tree repo and it knows how to deploy them basically. So I mentioned that there is a hybrid approach now there's a number of tools that use it. RPM OS Tree is a good example of a hybrid tool that can be used on the server that you're managing the software. And it makes it a combination of the two software still gets delivered as packages but then you use OS Tree to actually deploy it. But what that means is that the hybrid package manager has to effectively check out the current version of root of fs on the target. It's got to install the packages in a shrewd environment and then it's got to commit the updated root of fs back into OS Tree so that it's basically got the staged new root fs ready for an OS Tree deploy. But so why would I do kind of hybrid? It's basically kind of get the best of both worlds. You get the package delivery which certainly in the case of patching you get small kind of much more readable software updates in an air gap scenario. I don't have to do OS Tree pulls I can deliver through files. And I can also deliver different package sets to different deployment sites if that's a requirement. So I can do all that with the flexibility of the packaging delivery but then I can still leverage OS Tree to kind of be able to build and stage my root fs in parallel with running my active root fs system as well as do the atomic upgrade on reboot. So how does this translate into Starling X? So the proposal that we're looking at is that we're gonna use a hybrid package manager. The diagram that I show here shows a multi-node Starling X cluster. Obviously each of the nodes will have a local SysRoot OS Tree repo. That'll be the main OS Tree repo that it boots from. But then we're also gonna have a central remote OS Tree repo on the controllers. So it'll be in that repo that our hybrid package manager will basically stage and install the packages into a root fs and store the commit back in OS Tree that basically stage the new release upgrade in that remote, in that central OS Tree repo. And then when Starling X software upgrade orchestration kicks off to do basically its rolling upgrade of the different nodes in the cluster, basically whenever it gets to the certain point for a particular node to say install the software, instead of wiping the disk and using package manager to install to the root partition, it's gonna just do an OS Tree pull from the central OS Tree repo to bring in the new software in a root fs commit and then do an OS Tree and deploy and reboot like we saw before. So how does this help Starling X software upgrades? And what I wanted to show in the slide was basically the major steps of a Starling X software upgrade. And specifically this is for a single server Starling X deployment, just cause that's the one where the impact is the greatest because I can't do a rolling upgrade on a single server. So basically what I've got here is the major steps for an upgrade of a single server. I've got some time estimates here for each of the steps just to kind of get a ballpark view of what the times and what the improvements might be. So to just walk through this, so when you start an upgrade on a single server system we actually have to back up the Starling X platform system data that's on the root partition because today I'm gonna wipe out the root partition. This is like today's model without OS Tree. Then I administrative locked the host to get it into an upgrade mode. I wiped the disk, I wiped the root fs because I'm gonna install the new load on it. I obviously don't wipe the partition that has a backup that I just did and I don't wipe the set disk because it's holding the persistent data of my guest applications. And then I basically run the installer ISO for my new release of upgrade to basically do the package installations on the wiped root partition. And then I reboot into the M plus one software. But then I basically got a new M plus one installation which I really have to bootstrap using the backup data that I have to basically restore the system data associated with the Starling X deployment. So I have to do that. And then more than likely you have to do some migration of Starling X data due to data model changes. And then you finally unlock the host and upversion the containerized system apps. So there's a lot of steps in our current upgrade and most of them are actually in the outage window when I've basically disabled my hosted guest services. So how does it change in with OS tree? So two things change. One is there's less steps. I don't have to back up data because I'm not gonna wipe the root partition. And then I, because I don't back up the data and I don't wipe, I don't have to do the bootstrap and restore the system data from the backup. So those steps are gone. And then other steps basically move outside of the outage window. For example, the whole installation of packages actually happens even potentially outside of the maintenance window because my hybrid package manager is basically going to create the root of S and the shrewd environment, install the packages in that shrewd environment. So that can all happen well before even the maintenance window. And then I can even migrate the data in that shrewd environment such that that's done outside of the outage window as well. So kind of with a combination of reduced steps overall and some of the steps moving out of the outage window, you can see that kind of the elapsed time and outage times could be significantly improved with the OS tree solution. So this is a summary, kind of talked about, gave a brief overview of the OS tree as an upgrade system for Linux deployments and just talked about how basically it manages versions of root FS just like Git. And more importantly, it has this deployment layer with a file system layout so that you can boot and run effectively checked out versions of the root FS and that allows you to basically do a lot of the staging of the upgrade and upgraded root FS in parallel with continuing to run your active root FS and then the atomic upgrade to switch over to it. And I also talked about just fact that OS tree can be used as a hybrid package manager and that's the proposal that we're looking at for Starling X and then finally just kind of walked through the impact, positive impact of the elapsed time and outage times for upgrades just because of reducing the steps, not having to wipe the disk and actually moving some of the steps outside of the outage window, especially in that single server scenario. And I didn't mention before, but I mentioned at the beginning that like to say in a lot of the telco kind of 5G deployments that we have, a lot of those remote sub clouds are single server solutions just because that's got enough horsepower to run kind of the remote 5G apps. And that's really it. I just wanted to say that if this sounds like something cool that you'd wanna work on, you can get in touch with the Starling X community either through our mailing list or you can check our website for when our community meetings are you can come to the main community meeting and then if you're interested in looking at that, we can kind of connect you into the different sub projects within Starling X in order to kind of collaborate with whoever is working on this. And that's it. I don't know if there's any questions. Yes. That's a good question. I suspect it just fails and the three way merge that happens is not when you haven't committed to doing that deploy yet. So I suspect it just fails and you have to kind of figure out what happened and likely supply a new commit in order to do it. Yeah, like I say it's Git like, but it doesn't have all the Git capabilities. You just get an error on that merge and you're gonna have to fix it manually. Yeah, so I found this last three is actually applied like would that mean that the actual repo with the S3 would actually explode in size? I mean like assuming that you would see it through the package manager upgrade for the S3, so there is a delta which is not really like... Yeah, there are some, we haven't got into details of that, but yeah, there are some cleanup requirements in two areas under the SysRoot AusTree deploy. There's cleanup that's required under the different deployments that you have there. And then there's also cleanup of the AusTree repo itself. AusTree does support pruning of the tree and definitely, yeah, after in a lot of product deployments, you would have frequent patches and you would get a lot of dis-space used and cleanup required, so that's definitely something that has to be looked at. What about fail back in case of some errors during the upgrade? Is it possible to restore the previous version? You can, I mean you can effectively do an AusTree deploy to any commit, it doesn't have to be the next commit, so you can really go back to a particular commit. And certainly, I mean there's gonna be details with respect to migration of data. I know in the first iteration of doing this, we really are supporting going back only in kind of a rollback scenario where we can roll back the data easily to what we were just previously running as far as data migration. And for some things it's actually better than that. You can just, in your bootloader, select your previous tree, just like you used to select your previous kernel. And that works, you know, through common line parameters. We're better if you make sure that you version some of your data and if you're doing data changes Yeah, yeah, like for the migration that we're doing, I didn't go into the details here, but yeah, for the migration that we're doing for the Starling X data, we're managing that data in slash var and basically versioning the data in there so that when we do the migration data, so we could kind of get that data when we go back. But the reason we're only doing it for rollback is because we still have the data and we can go backwards and get it and we know on a rollback, nothing's changed as opposed to if I wait three months and now try to do a downgrade. You know, typically in Starling X, we don't write downgrade data model scripts. We've been asked to do that, but yeah, typically it's only upgrade scripts that we're doing, so that's why we're kind of trying to limit the scope to rollback. How do you generate like the first commit, you know? It's a base image. Yeah, so for the base image, we'll deliver, like today, we'll have an installer ISO that basically has all the Dev packages in it and it'll really be, again, I don't know the specific details about this, but it'll really be like a reboot strap type setup of the initial RudaFest and then using Shroot to install all the other packages on top of that. All right, thank you.