 Okay, sound check. Can you guys hear me? All good? Loud and clear? Thank you. Sound check. Very good. Okay. Okay. Hello, everyone. Thank you so much for joining today. So in this session, we are going to talk about standing eggs at scale, specifically some of the challenges we encountered, how we solved it, and some of the other lessons that we have learned. So before we get into the slides, the first thing we want to start with, obviously, we'll do our introduction. So my name is Ramaswamy Subramanian. A lot of people butcher my name. So to keep it simple, I go by Ram. I'm from Wind River. From the Standing Eggs community perspective, I'm a Technical Steering Committee member. I'm also the project lead for Distributed Cloud, as well as the Flock Services project. Thank you. Thanks, Ram. My name is John Kung. I'm also from Wind River, and I'm also the Starting X, a Flock Services tech lead. And I'm also an active contributor to Starting X since inception. Okay. So before we get into the slides, probably a quick show of hands. How many of you have used the Stirling X? One. Okay. That's good. Hopefully after this session, you guys are excited to try Stirling X and then play around with it, and then hopefully you'll be able to contribute. Okay. So Stirling X, if you don't know anything, it's like an infrastructure software stack, which is predominantly used for cloud-native workloads. So Stirling X is a generic platform which is capable of supporting multiple domains. So communications or telecommunications is the primary domain where Stirling X is production. It's deployed in a production environment at scale. There's lots of functionality or the capabilities that is being used in the telecommunication environment. So as part of this production environment, we have quite a lot of learnings or experiences that we have understood, and then we have implemented quite a bit of software enhancements to make sure the software is highly scalable. It is capable of supporting lots of different capabilities. But when it comes to the solution or the Stirling X capability, it is not just limited to the telecommunication network or telecommunication environment. It is suitable for energy, enterprises, emergency services, automotive industry, industrial, healthcare, agriculture, robotics, aerospace, manufacturing and retail. So the solution is very generic and then is capable of supporting multiple different use cases. So since we are targeting or we are trying to evolve towards an environment which involves multiple different domains, obviously the solution needs to be scalable. So the most simplest or the most easy way to deploy a Stirling X is called a AIO Simplex environment, which essentially means you bring up a server which has got some CPU, some RAM, some storage. You install Stirling X, you get a cloud-native environment on top of which you can run a whole bunch of cloud-native applications. So the next capability where if you want to have some type of a redundancy, some type of a high availability architecture, the next model is two-server model where majority of the control services have high availability. What that essentially means is for whatever reason, if one server goes down, the services will still be operational. The second server will be able to take over and then it'll be able to provide the necessary services. And depending on the application requirements, if there is a need for additional resources, additional compute resources for deploying much more resource-intensive applications, so that is where this standard configuration with controller storage comes in a picture. So the difference between this and this is your controller services, they are still highly available, they are redundant, but when it comes to the compute function, it is fairly distributed into multiple different workstations which is identified as workers. But when it comes to the storage, they are still co-located or they're still in the same server along with the controller services. And there could be use cases where the customers are depending on the application workloads, they may request for, okay, I need more storage, I want more compute capacity, and then let's try to isolate the controller infrastructure services to a couple of servers. So that is where this configuration comes in a picture, which we call it as the standard configuration with dedicated storage. So all these configuration, they are more or less located in one single environment or what we call as a data center type of environment. But based on the evolution that you are seeing, based on all the distributed computing that are being requested for multiple different use cases, Styling X provides its capability, what we call it as the distributed cloud capability. So what we mean by that is there is a center cloud or a system controller which is managing all these fleet of edge clouds, and majority of the application workloads are running in this edge clouds. So for the rest of the session, we are going to focus more on distributed cloud. So we have quite a lot of details related to distributed cloud, and then that is where we are going to focus and then spend more details. So the first and foremost thing that we want to start with the distributed cloud architecture. So the important thing that we have to notice here is called the edge clouds. So these edge clouds are geographically distributed among multiple different locations. These are not edge clouds. They are full clouds. They're just not workers. What we mean by that is the edge clouds have full autonomy in terms of all the decisions that it needs to make to provide the cloud-native application, whatever it needs to be, execute successfully. So they have the local control plane, which is highlighted here, and then in terms of what kind of edge clouds that could be deployed, so we do not have a limitation. So in the previous slide, we spoke about multiple different configurations, like deploying in one single server, high availability, the standard configuration, with the dedicated storage. So all those configurations can be deployed at the edge cloud site. At the end of the day, it depends on the use cases and then the resource requirements that is needed for running the applications. So based on the application requirements, whoever is deploying this network, they can decide how far they want to scale the network and then what type of edge clouds they want to deploy. So once you have multiple edge clouds which is distributed across multiple different servers, the next important question is, okay, how do I know what is the status of my edge cloud? So that is where the central cloud comes on a picture. So probably the first thing that I want to highlight is the dashboard. So the dashboard essentially means you have a single pane of glass view of your entire distributed network. So using the dashboard, you have full visibility into what is the status of the edge clouds? Are they operational? Are there any alarms? Are there any events? Are there any updates that needs to be made? Everything is visible at the central cloud level. So once you have that visibility, let's assume you're able to successfully deploy the next set of activities that are needed is, okay, life cycle management. So what I mean by life cycle management is how do you manage the software updates? How do you upgrade the systems? How do you make sure if there are any updates that needs to be applied, all those capabilities are available. So those are provided by this life cycle management, the software upgrades and updates, and then we also provide a firmware orchestration as well. So all those capabilities are centrally managed from a central cloud, and then the central cloud has got all the intelligence of whatever it needs to make sure it notes what sub-cloud it is talking to, what operation is being executed, monitor the status of all the operations, and then provide that feedback to the user so that the user knows how the system is behaving. So once we have that, let's say, on these edge clouds, once it is deployed, we want to deploy any applications, let's say containerized applications. So we provide this centralized container image registry as more like a location where all the images are stored, and then the sub-clouds try to pull all those images from the container image registry. Obviously, if you want to deploy new sub-clouds, the deployment capabilities are provided in the central cloud as well. So from a styling-expective, we want to be highly secure and make sure everything is managed from a central location. So that is where the certificate management, the identity management, everything comes into picture. In addition to all these things, obviously we provide the system-wide level infrastructure orchestration, which is used from the central cloud to manage or to orchestrate all the different operations that gets executed on multiple different edge clouds. So in terms of the cloud native application that could be deployed, I mean, obviously Kubernetes, which provides the environment for running a containerized application. In addition to that, we also support OpenStack. So again, OpenStack is running as a cloud native application on top of Kubernetes, and then using that, you'll be able to deploy your VMs, whatever that is needed for OpenStack-based workloads. Okay, so the next question that we get asked is, okay, this looks fine, it looks cool. Okay, how do I go about install? If I want to deploy an edge cloud, what are the steps involved? How easy is it for me to deploy the edge cloud and then bring the system operational? Okay, so that is where zero-touch provisioning comes into picture, and then John will take over all that. Okay, thank you. So as Rah mentioned, in order to deploy at scale, we need to start at the beginning. So how do we deploy an edge, a bare-metal edge server, and turn it into an edge cloud? So what we're calling zero-touch provisioning, perhaps I should contrast that with an experience I had about 10 years ago on a different product. In that case, when we wanted to deploy a new geographically distributed system, we had to actually go onsite after the network planning has been done already, actually wired in, but we would have to work in probably a cool, windowless room for a day or two just to get up the initial access point so that then we could move to a windowed room and do the provisioning. That would probably take a couple of days just to bring up an edge site. With starting X, we have capabilities now that enable us to perform this much, much more automated hours and minutes instead of days. So essentially, the capability starts with an edge cloud server with a device that supports the Redfish management protocol. Basically, it exposes a REST interface once it's connected up and wired to the network. There are some issues, of course, that we have to overcome when we're trying to deploy this type of system. It's geographically remote, so do we actually need to go physically onsite after the initial exposure of the REST interface? The network itself, layer three network, could have latency, could have bandwidth limitations, and it could even have error rates in them because of the geographical distance between them. Operations, as alluded to earlier, there's also cases where there are steps to bring up. An edge cloud could be complex, and you'd have to check for a certain state before proceeding to the next one. So all that could be very time-consuming, even if you had all the steps correctly. So what Distributor Cloud in starting X offers us is when we want to add an edge server, all we need to do, if we have to still it down, it's a sub-cloud-add command. We use the DC Manager services on the central cloud to issue a RedFish command. But that RedFish command is issued with a special boot image that the central controller creates. Basically, that gives it enough of the installer so that it can set up the proper interface for bootstrap. And after that point, that is pushed through RedFish API and basically mounts and pulls the packages required for installation. In the install phase, it's pulling down the packages or OSG repository required to bring in the rest of the software because at the early stage, we don't need a very large boot image, maybe 80 megabytes or so. Whereas here, now we're pulling, maybe gigabytes of data being pulled across. And then also back at the system controller at the central cloud, it detects the completion of the install phase. So it's all coordinated by the central controller. There's no manual intervention at this point. We can observe or the distributed cloud central controller can observe the progression. And basically, once it detects that the interface is up, it can move on to the bootstrap phase. At that point, it'll bring up the essential services and pull down container images. Those in itself can be several gigabytes being pulled across a Layer 3 network. And at that moment, our system is deployed ready. So that can take about, in an example, say with 50 milliseconds, about an hour and a half, let's say. So that's pretty good. But we had features thereafter to improve it even further. And also to just account for a particular use case where the initial install, we would start, one option is for the very initial install, we could start with a factory installed pre-staged ISO with container images stored on a persistent partition with the OS tree repo that can be pulled locally during the install and the container images that could be pulled down during the bootstrap phase. So those could be gigabytes of data being pulled locally for the initial install. And only in the optional fallback scenario would we then need to pull back all the services on the central controller and it could be complete to deploy ready. So just comparing the pre-stage install to the remote install case, it's about three times faster, about 22 minutes versus 75, I'd say. In either case, it's much, much better than what we were doing 10 years ago in a different product. All that would be managed. Okay, so since we are dealing with lots and lots of servers, so there is a need for orchestration and then try to eliminate as many manual operations as possible. So from Staling-X perspective, we call it as a multi-level orchestration because the orchestration is not just limited to the central cloud but we also expand it all the way to the sub-cloud. So there are different capabilities here. So probably to start with the zero-touch edge. So Staling-X provides fairly simplified mechanism for deploying a distributed cloud logically and then intuitively. And then it's multi-cloud. What we mean by that is all the different life-cycle operations that get executed from the central cloud, it knows or it has information about the geographically distributed environment and then it is able to orchestrate all the different operations. So the biggest capability is the single pane of glass. Since you are managing a distributed cloud environment, having a single view in terms of how the network is looking like and then how the different orchestrated operations are getting executed is very crucial. So we provide the single pane of glass view. Obviously the security is very, very important to make sure all the different operations are secure and then there is no breach anywhere. And edge-over-edge enabled, what we mean by that is from the central cloud itself, you are able to manage all these resources remotely and then try to orchestrate all the different operations. So we have a walk-through, more like a flow in terms of how an orchestration would happen within an edge cloud. Okay, so now that we've brought up the edge cloud, in terms of day two operations, a typical day two operation might be a software update or we call it day two, but it could be many, many days later, a software update or firmware upgrade, for example. So how do we do this? Again, without needing to be physically at the site. So when the system is brought up after the bootstrap and deployment phases, starting-axe flock services are running at the edge. It provides flock services as an aggregation of multiple services. They include such things as configuration management, host management, host management, for example, we can reset and power down hosts within the edge, services management for high availability within the edge. So fault management, for example, just to alert, alarm. All those services are within the flock server, just lumped together here in flock services API or in fact separate APIs. But in this example, an update request comes in. Let's say it's a firmware upgrade. Now a firmware upgrade can be very complex. It can involve multiple device images and they need to be done in a certain sequence. That sequence is managed within starting-axe. So that is managed within the orchestrator within starting-axe, the VM orchestrator. From there, it's able to reach out to service agents and also coordinate with Kubernetes for the multi-host case as necessary to drain the node so that it's ready for an upgrade in case we need to take it out of service. And finally, the request to the service agent to actually perform the service of the firmware upgrade or software update. All this will be coordinated by the orchestrator. It knows what order to do the device images once it's been uploaded by the update request. So that is all within the edge. The box on the right is within a single edge cloud. By exposing that REST interface, the system controller has command and control at the edge. So this next illustration is ready for at the system controller distributed cloud level. The distinguishing service here is the DC manager service. It's responsible for managing the states and services at the sub-cloud. So in this example, for example, the user back here wants to do an orchestration strategy. For example, they might have a group of sub-clouds. Let's say they've got... There's a number of nodes at the edge. Many, many edge clouds at the edge. It could be 1,000 as supported by starting X today. There could be anything from... In this example, with our canonical use case, the 5G radio towers, but as the keynote presentation Jeff mentioned yesterday, there's a very broad set of use cases that could be applied. But in this case, say, for example, the administrator wants to do a sub-cloud, a set of sub-clouds to orchestrate. Let's say he wants to just do 100. We can select a group of 100 through an orchestrate strategy. That goes in through APIs on the central controller. Depending on the type of update, it may need to proxy it to persist in a HA managed file system. The reason it needs to persist that is because these could be, for example, in the previous example, firmware images that it needs to send to the set of however many sub-clouds it wants to orchestrate. That in turn sends it to the DC managed service, which then reaches out to the sub-clouds through the update mechanism that was illustrated previously. And it's able to orchestrate this and monitor the progress as it progresses through the states. So each sub-cloud would happen in parallel based on the orchestration model. I mean, the orchestration we can select singly applied or parallel apply, and we can even select the number to be parallelly applied, all depending on your, typically network bandwidth limitations, things like that. So as it's going through the update, the edge clouds may pull back from the shared services on the central controller for certain information. So that's that. All that's good. The single-panel gap class concept that Ram had mentioned is we're able to provide from the starting X horizon GUI basically a summary view, overview of each sub-cloud state. So in this example, we can see every sub-cloud is, the availability of the state disk is green, the deployment states, so we track all this within the DC manager. There is one here that's showing a degraded state, and if we were to click on this button here, we can get the alarm details. So in this case, we can see, okay, there's a minor alarm at this sub-cloud, and we can even drill in to see what the exact alarm is. There's also within a sub-cloud view, there's all the different resources that are orchestrated by the DC manager that we can see as well. So at the edge, I mean, we can threshold a lot of the data that's incoming, and it can dynamically learn available sensors to decide what to backhaul, like what to send back up to the DC manager. So that's an overview of some of the capabilities. Okay, thanks, John. Okay, so in terms of the scale, the starting X, the latest release is 8.0. So with a system controller, we are able to manage and support up to 1,000 edge clouds. So that's the scale that we support today. And in terms of, obviously, with so many sub-clouds, there are a whole bunch of life-cycle management operations. So over the years, with all the different releases, we have consistently worked on improving the scalability of the different operations. So as an example here, for example, if we take edge cloud install, in release 4.0, we supported only two parallel edge cloud installs. And then we increased that to 50 and 5.0, 50, we maintained the exact same capability in 6.0, we increased it to 107.0, and then we support, again, the same 108.0. If you look at all the different operations, we have consistently worked on improving the scalability of the system as we tried to scale the number of edge clouds that is supported by a single system controller or a central cloud. So in release 8.0, we increased two new capabilities, which is backup and restore. So the backup, we are able to support 250 parallel edge clouds back up from a single system controller. And then for restore, we support 100 parallel restore from the system controller, the central cloud. So if you notice here, there is a pattern here. Just a question. Is it remote backup and restore? Yeah. So the scalability, we have the capability to store the backup at the subcloud level or we can transfer it back to the central system controller or central controller. Okay. So this could also be used for subcloud restoration after some, let's say, software. Yeah, whatever failure happens. But these numbers are not near a thousand. That's true. These are parallel operations. At what single instance when you run the command, you have 250 parallel, for example, if you want to do backup. So... Okay, so... Quick time check here. I think we have three minutes left. We still have a lot of content to go through. So we'll go through that. And then if you have any questions, we'll be more than happy to answer here or in the middle of both. Okay, so I'm not sure if you guys are noticing the pattern. If you look here, probably there are features in 9.0 in the future where we want to scale that number as well. So once you have a network which is fairly distributed with lots and lots of edge clouds, the amount of time that the operator, the user has to spend in managing or maintaining the network, we want to shrink that as much as possible. Okay. So here there are quite a few details in terms of what are the challenges we encountered and then what are the solutions that were implemented. The most common thing or most common way of solving a scalability challenge is add resources, add more CPU, add more RAM, add more storage. But when it comes to styling X, we went in the opposite direction. What I mean by that is, what can we do from the software itself? How can we make use of the existing resources that is available to the system in a much more optimal manner to scale the number of operations that are supported? So there is a common theme here. One of the themes is like optimizing the algorithm, fine-tuning the file system, the storage space that is needed, and then fine-tuning the different resources that are used by a specific service by refactoring, redesigning. So those are some of the things that we did here to scale the number of parallel operations that could be executed. And then we have few more operations for other life-cycle operations. Even here pretty much the same theme, optimizing algorithms, implementing caching, fine-tuning the thread pool, eliminating thread contentions, and those type of things. So those are some of the things that we implemented to scale the system to support quite a high number of parallel operations. Okay? So here in this example, we have two, I would say distributed clouds, which are geographically distributed. So there is one distributed cloud here, and then there is another distributed cloud here. So in StylingX, we support capability what we call it as a re-homing in Edge Cloud. So for whatever reason, either as part of disaster recovery, let's say for example, this Central Cloud, it goes into some kind of a failure condition. Since the styling, based on the StylingX architecture, the Edge Clouds are fairly autonomous, and then all the applications, whatever that is running, since we have a local control plane, even for whatever reason, Edge Cloud is gone. The operations are the applications that are running on the Edge Cloud, they're still operational, they're able to provide whatever service that is needed. So in order to accommodate these type of failure conditions for the users or the operator to have visibility into what is going on in the Edge Cloud, we support a functionality which is called re-homing Edge Cloud. So what we mean by that is, let's say for example, we have this Edge Cloud, to move the management aspects of that particular Edge Cloud to, let's say, this Central Cloud. So, I know in the picture it looks very simple, but under the hood, there are a lot of changes that happens as part of the re-homing operation, which includes changing the network configuration on the Edge Cloud so that it knows which is the new Central Cloud it is talking to. We also change the secrets and the certificates, whatever that is needed for communicating with the Central Cloud for accessing the registry are pretty much all the different operations and then the Central Cloud is able to successfully communicate with the Edge Cloud to provide that operational view of the Edge Cloud and then how the Edge Cloud is operationing. So all these operations it's very transparent to the application workloads. Pretty much all the configuration changes, whatever that is happening, it's all managed at the Starling X infrastructure layer from the application perspective there is still the exact same uptime it's fairly transparent and then everything is working as expected. Okay, I think we are pretty much at the end of the presentation. So if there is one thing that you take away from this slide all you have to remember is it's Starling X is one Cloud it's any scale. So Starling X is extremely scalable based on the different verticals where you would like to deploy Starling X for the different use cases it can operate on as simple as one single Farad server or if you need high availability we support that or if you need much more sophisticated, complicated storage or computer requirements then we have a distributed environment which can be hosted in a data center or it could be in an Edge Cloud as well. So if anybody is working on any other verticals where you have some use cases or some functionality the chances are Starling X already supports it so we encourage you to explore to try Starling X if you want to have more discussion please do get engaged in the Starling X community discussions and any feedback you have or any suggestions or any contributions to the Starling X project we welcome from everyone. Okay so with that we'll wrap up any questions maybe I'll start at the back. Definitely we depend on the switches we also based on the Starling X deployment there is a certain level of networking requirements that we prescribe so we definitely expect those configurations to be configured on the switches and then the routers to make sure there is proper connectivity, proper routing and then everything is working as expected. No. No. Okay go ahead. If you schedule sub-clouds but only a fraction is online at the moment will they update whenever they become online again? For us it's a requirement for the sub-cloud to be managed and online so there's an administrative state as long as it's managed it will be allowed to be updated but also for the more it needs to be online so when it's offline it's an error state that that sub-cloud is considered non-manageable, actually non-reachable. Okay so and also for example if you schedule a software update to a newer version of software then you have to redo it when the actual site is online. Yes it should be online when that strategy is created. Go ahead. Are you able to check the physical integrity of the service you rolled out at the edge of the hardware level? So at the software level I think we manage at the software level in terms of the hardware level apart from the alarming of the different things that we monitor we do not have full capability to see what is the state of the hardware. So there is a certain amount of monitoring that we do as part of Starling X beyond that if there are any advanced scenarios of advanced error checking we don't have that capability. Yeah for example we're monitoring memory CPU, disks, all the different resources utilization if they're beyond a threshold we raise an alarm. Optionally we could enable sensors so depending on what IPMI on that board supports all those sensors could be thresholded and sent back northbound. Okay sorry go ahead. Do you have a secure boot with TPM root of trust in the central cloud to confirm that the edge hasn't been tampered with from the operating system standpoint? Yeah during the install we can pass in a certificate that ensures that the image that we're sending across is correct furthermore the image that we create that we deploy is a signed image that gets into it. So during the operation the central cloud is able to confirm that the certificate isn't stored in plain text on this or... So when you say tampering just to make sure we understand properly so are you saying somebody is logging into the edge cloud they're playing around with some configuration changing files and those type of things okay so from the central cloud perspective we do not allow manual editing of the files or anything like that so we store a record or a copy of the configuration that the edge cloud was deployed in and then for whatever reason if something has changed when the next reboot happens we go over all that and bring back to the golden copy or the golden record of how the edge cloud was deployed so that's how we manage it. Let's take this offline I think we need to discuss more and understand what we are referring to and then we can try to provide more details okay thank you sure sorry I can't hear you so if I go back here so here when you spoke about the standard configuration with dedicated storage so depending on the amount of resources that you need for storage as part of the network dimensioning or the planning of your deployment you can add additional storage servers so those can give you the amount of storage whatever that is needed for the applications that you are trying to run okay sorry sorry you had a question so what we do is we follow the redfish standards as long as the hardware vendor is capable of supporting the redfish standards we are compliant with the hardware vendors needs let's say some custom APIs or a custom response then that would be more like additional development and starting to support or to integrate with that hardware yeah our redfish interactions is in a containerized image so from that we can determine the version that we need to support as well and possibly release a separate containerized image okay I think we are running out of time so thank you so much everyone for all your questions comments and then for your patience for listening to us if you still have any questions we will be around at the wind river booth where we can talk more about more functionality more clarifications or anything that you guys need to know about styling X so with that we will wrap up thank you so much everyone have a good day hope you guys have a good time