 So, thank you for coming to this session. My name's Eddie Renneri and I'm with the Sterling X project. And today what I want to talk to you about is how Sterling X is solving upgrades and updates with our architecture. It'll be both how we solve it for the customer and also some challenges we encounter as a project for what we are and things like that. So here's kind of what my overview is of what I'm planning to talk about. I don't know if there's a bunch of you out there that know what Sterling X is, but I thought we'd level set on it. Is there people out there that know Sterling X? Okay, that's pretty good. That's actually more than I thought, so that's good to know. And then I'll go into the upgrading, how upgrading works and how it deals with it and the challenges and how we solve those challenges in Sterling X as well. And then I'll leave you with ways you can get involved, ways you can kick the tires on the project, things that you might want to be doing next and how you can contact me or anybody else in the Sterling X organization to help you with the project. So real quick, I won't spend a lot of time on this, but what is Sterling X? It is an OIF confirmed project, so it's an open source project that's here. It's a complete infrastructure for running clouds at the edge. So that includes Kubernetes, that includes OpenStack as well, and all the components around it. When we work on scale, so the project is all about scalability, reliability, geographically distributed solutions, distributed architecture of the clouds, as well as how to maintain those clouds and how to make them reliable. Lastly, we are proven, we are in 5G networks today across the world, so we are proven commodity. We're not just a telecom solution, we're an edge solution where if you run low latency applications, if you have a low latency need, you want to put your cloud at the edge, you want to manage Kubernetes or OpenStack functions at the edge as well. There's a lot of other use cases that could be it just so happens that telecom is one of the leading candidates right now. And then just to give you an overview of the platform and what's in the platform, it is a single ISO, a single ISO distribution that includes everything from the Linux kernel, the services that the Linux kernel uses, the pieces in the middle that you see there that are marked fault management and host management, software management, all those. We call those our flock services. And that's the bread and butter of StarlingX and the services that allow us to manage the individual nodes within the cloud, configure the individual nodes, handle to make sure that there's no node that's down or fault failure. So that is kind of the configuration pieces of our flock services. And then above that stack is your virtualization stack, whether it be Kubernetes first, it is a Kubernetes native solution. And then we allow you to run other applications, whether they be your applications or customer's applications, or if they're OpenStack, we can actually run that. The OpenStack managing services are fully containerized as well. So I mentioned scalability. And again at the edge, scalability is very important. We scale from a single solution, a single cloud server. Take a single server, it's a full stack, full cloud. Kubernetes master runs on it. It's fully managed and independent on its own. And then we can scale, as you see from our first HA model, which would be a duplex. We call it, it's called a duplex. And that has two servers. And then that will also scale to where you have more of a traditional rack solution where everything's running on its own server. The one thing about Sterling X is that scale can occur in service. So you can deploy it as a simplex or a single server solution. And you can grow it based on your needs out there in the deployment land. You just put more servers out there. And you can grow that solution and scale it as you want to scale. The challenge, one of the challenges is obviously with edge solutions and edge solutions is how you manage it and how you keep track of what's going on and even distribute things down to it that need to be synchronized across the platform. So Sterling X has something they call the system controller or the distributed cloud manager. I think of it as a cloud, a cast manager, right? It's a cloud manager. So it is a Sterling X cloud as well. It's running Kubernetes. It can run workload as well if you want. And what it has in it is an additional feature that allows for it to manage edge clouds and orchestrate edge cloud functions that edge clouds need to orchestrate as well as anything that needs to be synchronized between the edge clouds or that needs to be done. That can all be done again from the system controller level. So like I said, I think of it as a manager of the clouds. All right. That's my overview. So hopefully it wasn't too long. Let's talk about upgrading. So upgrading is of course different than other things because upgrading or updating is you're trying to do it while you're in service, right? So there's a lot of things to think about when you're in service and how it handles. Today, Sterling X, the way it works is it actually does these steps to upgrade the system. The first thing it does is the system controllers or the manager of the clouds, it will perform an OS and services upgrade. It'll keep the Kubernetes version, keep the Kubernetes system running, but it'll perform the OS and services upgrade. Once the cast manager or the system controller is upgraded, then it will ask the edge clouds to perform their upgrade. It orchestrates it. So it's not doing the actual upgrades of the edge clouds. It's telling the edge clouds to do their upgrade. That's the difference with the distributed way it handles that. Okay. After that, then we're preparing, Sterling X prepares for upgrading of Kubernetes and there's challenges around that because your workload, in order to stay in service, may not be able to go from one version of Kubernetes to the other without its applications needing to be updated. So what happens is Sterling X updates the applications. Those of you that raised your hand on Sterling X knows that Sterling X also has, and I'll get into it a little bit, they also have applications that you can deploy through Sterling X. We have our own applications that are part of Sterling X. Those components are also upgraded as well during that sequence. And then lastly, after all that's done and everything is good, then Kubernetes can be upgraded and the plugins can be done with it. And all that's handled at an orchestrated view. So from the system controller, it's orchestrating the edge clouds to perform those upgrades. The edge clouds are performing the upgrades on their own and reporting that information pass or fail, success or failure back up to the system controller. This kind of is a diagram depicting what I was trying to say. Again, from the system controller point of view, the system controller says, hey, I need to do this upgrade or update. It does itself first, and then it tells the sub clouds to go down and do the orchestration of the upgrades of those clouds. One of the other things that the system controller supports is the ability to do what we call edge cloud grouping. So you can create specialized groups that allow you to only upgrade certain groups at a time during the particular upgrade in orchestration event. Down at the cloud level, so at each cloud level, depending upon the deployment of the cloud level, but at each cloud level, the cloud level then does its orchestration independently on its own at this level. And this is the process that it takes through. I won't spend a ton of time on this, but essentially we upgrade the standby controller first. Once the standby controller has been upgraded, then it switches to the standby controller. That becomes active. It updates. It's the old controller. So now you got two controllers running with the new software and the new update. And then once you have that, so you know your HA is still fully available, then you'll go ahead and start orchestrating the workers. And the workers can be done in parallel. They can be done simultaneously in parallel or sequentially based on what needs to happen in that cloud. If the cloud has an HA application running on it, then you would probably not want to do them all at the same time. You may want to do them sequentially in that manner. So it kind of summarizes what upgrades are about and updates are about for us and what Sterling-X thinks about when it is. So Sterling-X supports the upgrading with minimum to no service disruption. And I say minimum to no. And the reason is because in order to be no, there's certain things that have to happen. In some cases a server reboot is required. That's going to cause that server to go offline for a while. If your application is running on that server and it doesn't have an HA component to it, then of course you're going to lose a little bit of service. If your application is HA, then it wouldn't be affected by that. Redeployment of an application during an upgrade is not required. So we maintain all the states. We maintain the deployment scenarios and the Kubernetes information for it. So as we're doing the upgrade, when the upgrade is complete, we restart the service and everything else. It's just a matter of the services being able to run with that new version. We minimize as much as possible pod restarting and stopping and doing things like that. We try to make sure that that's minimized as well. And then lastly, I already touched on that, but that is that for zero loss obviously applications have to deal with some things as well for HA. So let's talk about the challenges. And some of these challenges, when we're talking through them, they're both at the deployment scenario and the challenges we're solving for, the project is solving for edge deployments and also from a project perspective, the challenges we're having with the fact that we're this overall integrating project of pulling in other upstream open source components like Kubernetes and things like that. So at the distributed cloud level, so when you start talking about clouds, edge clouds versus traditional clouds that are in a data center or something, you're talking about thousands or hundreds of thousands. You're talking about a bunch of clouds and managing just the volume of those number of clouds is enormous, right? And the way we've solved that is by using that distributed cloud and the system controllers to distribute as much functionality down to those clouds, but to have an overarching monitor or manager that's looking after it so you can easily tell it what to do and tell those clouds what to do. The second problem with edge is, you know, again, its network is spotty. It could be microwave, it could be fiber, it could be shared with other workloads that's got higher precedents over it to take over it, so it's got a quality of service thing. It doesn't, the manageability of it. So when you're talking about the edge clouds, the importance of having them be independent and making sure you distribute the appropriate control at the right level of it Sterling X has done that because they're standalone clouds at the edge. They can scale the applications, they can restart things if they have to, they can report failures and things, but it is at the edge. So one of the things that we make sure we do is we push that control where it makes sense. The other thing we do is we don't do the updates. When the updates are occurring, we prepare the updates on the edge cloud first in a staged area. So we stage it in a repo, and then once we know that that repo has been successfully replicated down to each edge cloud, we will initiate the upgrade. That's outside the orchestration. So that's another orchestration mechanism step that takes place independently. It can be done, you know, over time and everything else prior to you actually executing the upgrade. So when you execute the upgrade, the controller on the sub-cloud is performing the upgrade and he's using the data that he needs to perform the upgrade locally. So he's not going across that network. He's off of that network. And then the last one is, again, you have all these edge clouds. Edge clouds sometimes overlap each other and they use that for a redundancy or a capability or they need to have at least a few of those edge clouds in a particular region up and running still. So you can't just hit the button and say, hey, I want you to upgrade all because when you try to upgrade all, you know, we're going to simultaneously start upgrading them and you could bring down service at a larger scale. So with Sterling-X, what happens is we have this grouping capability so you can create multiple levels of groups. Those groups will take over and perform the upgrades and you can run the upgrades on those groups so you can prevent any kind of outage like that that's outside of that. Upgrading the flock services. So the OS and the flock services. So, you know, some of the stuff that's happening with Sterling-X project today is causing challenges that we kind of brought on our own is Sterling-X just because of what we needed to do. These are some of the examples. So we are moving from CentOS to a Debian version. That's one of the things that this version is doing which means that there's, you know, obviously differences between CentOS and Debian in the configuration, the files, the layouts, the package manager, all kinds of things that we have to account for. So in an upgrade, you want to have that seamless, right? You don't want that to be something that causes any service outage. And so what Sterling-X is doing for that is we're as part of the upgrade process, we're taking extra care to make sure that we've captured all the data and we can even seamlessly migrate from that OS, from the two OSs. The second thing is OS tree migration. So Sterling-X is moving to OS tree migration and OS tree is great. There's a lot of good things with OS tree. But from an architectural viewpoint of Sterling-X, there is challenges around how we deliver upgrades and how we deliver patches versus how we used to do it to move into the new stuff. So again, that's something the project has gotten involved in and is taking care of. And we're making sure we move to that so that the same process that used to be used in the old Sterling-X versions for the same thing is handled and the differences between the OS tree or non-OS tree versions are transparent. Sterling-X uses something called Armada which is another service integrated project and that project went end of life. Armada manages our Sterling-X applications. So we have a certificate manager, we have manager service, we have a Windows Active Directory service, it's all part of the Sterling-X project. We have different storage functionality services, all those. So managing the end of life of a particular application that Sterling-X uses, again, was requiring us to make sure we find a suitable substitute in the upstream, which we think we found a really good one in Flux CD. And then obviously during the upgrade as part of the upgrade is migration, the metadata, migration, all of the parts from Armada over to Flux CD. And then air-gapped installs. So it's an edge deployment. It has all the networking stuff you're talking about. We've talked about it. It doesn't necessarily have connectivity to the internet and it has spotty connectivity back to the core. So Sterling-X has to handle air-gapping and air-gapped installs. And I kind of alluded to that before, but we have an orchestration mechanism that pushes the cache down to the edge clouds prior to anything starting. Validates it and then makes sure that it works and hands it out. And lastly, Sterling-X, we're trying to be active with as current as we can and stay current as possible. And one of the things about staying current as possible and things like that is a lot of times when you have a Linux kernel distribution that we've picked on or chose, it doesn't have the latest drivers for some of the latest hardware. Network cards, accelerated cards, things like that that are cutting edge. So Sterling-X in the project, it is actually pulling those drivers and the firmware, integrating those components as well into the project. So as part of that Sterling-X integration effort, the project has to take into account again these external drivers that we need to make sure are there and available. From a Kubernetes perspective and the challenges at the field and the challenges of doing upgrades. So again, self-inflicted a little bit but Sterling-X is moving to OS Tree. OS Tree has a mechanism to protect certain areas of the disk and make it read-only. So like the old way of doing things too or SIM links when you want new versions of Kubernetes, you can't do with OS Tree. So the project has moved to a MountBind scenario similar to others where we install the new versions of Kubernetes and then use MountBind to move to the new version or back to the old version as needed. The reason we had the application step is because I remember earlier I talked about the application step where the step to upgrade applications. As you move through Kubernetes or any other project things get deprecated, things are new and so in order to stay in service during an upgrade or an update you have to have a chance for the application to either be updated or handle the situation where Kubernetes is upgraded and it can run still. So that's why we have to be careful in our orchestration and careful in the way we lay things out so that we make sure that it keeps service, continues service. And then the big one is release gap and this is pretty much touched on for the rest of the presentation but Sterling X releases twice a year. Kubernetes releases every four months and then there's the plug-in releases specific to Kubernetes what that means is Sterling X wants to maintain the most current version of Kubernetes as it can for every release most likely that's a chance that we're going to have to upgrade two versions of Kubernetes at a time in one upgrade just by virtue of six versus four months. So we have to solve that, Sterling X has to solve that problem and because we actually upgrade multiple times if there's two versions of Kubernetes we would do two upgrades of Kubernetes but we would get the service up and running and then upgrade again to the next service with it up and running. And that's because of the next challenge which is today Kubernetes doesn't support skipping, right? You have to go the minor release to the next minor release and it doesn't have a concept like OpenStack does where you can take the control plane and do two-step upgrades and then bring it back online. And then lastly Sterling X does some things around KubeLit in order to differentiate between platform services and application services and it's all about low latency and providing the maximum two of the applications. So Sterling X does have to modify KubeLit right now and we have to build KubeLit as part of our release as well as the Golang as well. As we do upgrades with Kubernetes we've got to worry about the KubeLit mapping as well as Golang. Plugins are more the same but essentially what we see in the community right now is Calico is our CNI that we use. It doesn't really have a formal release. It supports three versions. We want to use the latest as much as possible and so we're challenged with pulling the right KubeLit Calico version. Maltis has no formal statement that the project has seen and it's the same challenge. So as we pull in these new plugins we've got to take a look at them and we've got to figure out if they'll work with the version of Kubernetes that they'll work with and how well they'll function with an in-service upgrade. And even some like NetApp they actually restrict you. You can't upgrade to an unsupported version of it. So some of the plugins are doing that where they won't even allow it to load. So it's a concern that you have to worry about that the project takes care of to make sure that it's all working. So in the end Kubernetes gets upgraded with the latest and greatest supported plugins as well. So hopefully that interests you and piqued your interest enough about let's go figure it out on my own. So methods for you to get involved or ways you can learn more. These are the events that Sterling-X is doing today this week. Come visit more. Come visit us. Come more sessions. If you just want to see it you just want to install it if your hands on type person. Syngin mirrors it, mirrors the ISO so you don't have to build anything. And you can download it immediately. Docs.sterling-x.io actually has an install guide that includes installing it on bare metal as well as virtual. So you can install it virtually just to see what's going on as well. And then you know traditional mailing lists and other things that you need to know. If that really excites you and you get excited about that then we have ways to contribute. So these are all the links into our places. Where the bugs are. Where we have thought leadership and new ideas and where we're starting early, early thoughts. Our storyboards for our actually designing implementation. And the last one is the wiki page for meeting the groups, meeting the leads of the groups, reaching out to them. Maybe there's an interest in only one or two things that you have so that goes. And that's what I got. Thank you. And I got a few stickers. I don't have enough for the room but you can meet us at the stab. If you're interested in stickers you can come get some. I'll leave them right here. Ha ha! Ha ha ha! So roll backs for updates. It's fully done within the project itself today. Roll backs for Kubernetes upgrades. There's a path to which you can still roll back within Sterling X. But there is a point where as you proceed too far which is really to answer your question specifically once both controllers have been upgraded to that Kubernetes you roll back isn't there yet. We do agree that that's something that needs to be worked on. It probably belongs outside of Sterling X so that others can utilize that capability as well. But yeah, absolutely. Yep. Yeah. So the question was is moving the movement to Debbie and will it still support the real-time kernel that it has today? And the answer is yes it will. It'll have both. So it's optional. You get to choose that as part of your installation method if you want to use what you want to use. Open stack upgrades? Yeah, so today open stack upgrades I believe I believe today open stack upgrades is done through the they're all done through the Armada and it's going to be through Flux CD so it would be an upgrade of their pieces of their pods. Yeah, so because it's Kubernetes we can update each one and we can work our way through it how we do it. I don't believe today we have live migration on Sterling X. It's not supported. I don't believe live migration is. But I can confirm that we have one guy out here. You know? Yeah, okay. I don't know with Q-Bidge. No, I'll be interested to talk to you about it. I mean there could be very much similarities. The system controller does know a little bit about Sterling X, right? So it's it's all API based certainly and the flock services that I talked about in the middle, those are all rest API based components as well and so there are pieces of the system controller that knows about that and there's also dependencies like for instance we manage in some cases we can manage the user names from the system controller and synchronize those down to the sub-clouds so the same username and passwords work across if you wanted to deploy it that way certificates, there's certain certificates that are shared amongst the entire certificate so there may be some overlap I don't know, happy to talk to you about it for sure. Control plane? No. No, but I'm not sure I fully you have a reason you're asking this question so remember that the edge clouds are all independent, they're all independent Kubernetes Masters and OpenStack scheduling systems so the only thing we're doing from the system controllers down across the control plane would be initiating commands that we want the sub-cloud to do and getting information back from the sub-clouds is what they're doing. Yeah, that's all, there's pieces of that that get pushed back up into the control, system controllers, absolutely but it would be an audit trail based on only each individual cloud it wouldn't be an audit trail that connects with all the other clouds right now, today. Yeah, no, thank you for being my stand-up guy. No, it doesn't. It doesn't mean that. Where we got it when Sterling X was first brought out and started developing the concept of it was edge computing is something different than core or different than something in a data center. It can run in a data center we have the same amount of scaling capabilities you may not even run distributed cloud in that case or the system controllers in that case you would just stand up a cloud like you would at any level. I think where the project is gone is that's table stakes because even at the edge you might run an edge with 200 nodes, right? So what we've migrated to solving as well is beyond just a single cloud let's get them out to the edge, let's get them so they can be on a single server and scale and then just the challenges associated with that. A lot of those you wouldn't run into if you put it in a data center for sure. And it is a single ISO so all you do is when you install it you provision it a certain way. So you provision it as a regular cloud and it wouldn't be an edge cloud. You just say it's a regular cloud that's standalone all by itself. You provision it to be a system controller so that it can do the system controller functions and then you provision it to be an edge cloud and the difference between an edge cloud and a normal cloud is it says this guy up here at the top, he's going to ask you to do stuff, he's going to manage you and he's going to send you stuff. The system controller afterwards? Switch it. Switch to it? Oh yes, absolutely, yeah. Yeah, absolutely, that's a good point. So absolutely, we call it re-homing but the ability, yeah, you can actually stand up a sub-cloud so the question is can you move a sub-cloud? You can stand up an edge cloud, it's under system controller A, you stand up system controller B and you say hey I want to move that sub-cloud over, it works, yes. We can also, we're also working on newer things as well around even now you got edge clouds, you might have edge clouds and you want to combine edge clouds so there's features coming out around that as well with the project. Yes? Can this last two people? Yeah, absolutely, so the question is how do we handle when we're upgrading the old versus the new? Yeah, and that is a challenge because in some cases you may be changing the API or changing things but we handle an N minus one pretty much for the life cycle until you get everything upgraded. It could be months, it could be years, we don't offer the, like it's not necessarily you're going to stay it at different versions but you do have time, there's no urgency to do it. Can we move something? No, absolutely not. I mean we even have to do that internally to the node itself. We upgrade the standby controller, the standby controller needs to know how to deal with the old version. We've got many versions so there is an upgrade, an N minus one concept that we do support. Yeah, we've tested it to a thousand. We've tested it to a thousand and we've tested it to 500 simultaneous updates at a time. But it, you know, in the real, it's going to go, it'll get more. There'll be more. I mean there's bandwidth and other limitations that we've got to take into account but, you know, correct. That's correct. Yep, so if you had a 2,000 deployment scenario for Edge Cloud you'd instantiate to system controllers today.