 Hi everyone, I'm Nitesh Malhotra and I'm a senior software engineer in the Azure for Operators team at Microsoft. Along with me we have Jonathan Innis who is a software engineer in the Azure Arc-enabled Kubernetes team at Microsoft and is also a core maintainer of Orchestra on GitHub. So let's start with the definition of release orchestration. Similar to how Kubernetes is an orchestration system that manages the lifecycle of containers, release orchestration applies the same concepts to applications. An application may be defined as a collection of containers working in unison to implement a set of features. What does application lifecycle management mean or LCM for short? LCM includes reliable rollouts of applications. This applies to installing the application, updating the application as new releases are made available and also deleting the application in a robust and predictable manner. A reliable lifecycle management system must make provisions for auto remediation on failures. This means that the system must not require the user to interact with the system when failures occur. It should be capable of rectifying these failures on its own without human intervention and for failures that cannot be reversed by the system it must alert the end user or operator. Other nice to have features maybe to provide visibility into the state of the application release process for the entire lifecycle of the application and to follow safe deployment practices. So let's begin with a case study which should be of interest to our current audience. So let's look at the management of network functions or NFS for short that run on a service provider operated Kubernetes cluster running somewhere in their data centers. Network functions are not always operated, deployed and managed in isolation of each other. Network functions implementing parts of a 3GPP release-based 5G core often operate in conjunction with other network functions implementing other parts. For example, deploying a single session management function might depend on other network functions being in place and running and maybe some other foundational BAS elements to support the network function. In short, operating a single network function depends on the presence of other applications. Think of the dependencies as layers that support the operation of a single of one or more network functions. For instance, one or more CNS may depend on infrastructure or platform components like a security layer made up of open policy agent, a networking or traffic management layer, for instance, using a service mesh, observability and telemetry systems like Prometheus Grafana, Geiger, storage layer made up of databases and other miscellaneous components that are used to operate the network function. So let's talk about Helm, which is a cloud native application management system and in a way implements the features for application orchestration. So Helm uses the concept of subcharts to grab dependencies for an applications by packaging everything as a singular unit. Helm parses the parent chart and subcharts and creates a common pool of Kubernetes objects by resource type across all the charts. It then renders all resources by type from all charts and installs them in a particular order by resource type. When a chart is reversed, the order by resource type is reversed. So a Helm release for Helm is a true atomic unit of deployment in the sense that whatever is in the Helm package dependency tree gets flattened by resource type and is not treated as a node in a dependency graph at all. While Helm can model a dependency graph of a parent chart and dependencies in subcharts, this relation at Helm release time is not very sophisticated at all. The main issue is the lack of control when a particular order of deployment of subchart installations is required or runtime conditions need to be met during release. Why did Helm choose to implement package dependency resolution this way? There are a few possible explanations for this, flattening the dependency tree by type and then creating resources across types by chart is efficient, fast and it is reliable. Also in the ideal world, pods and their replica sets are either perfectly stateless and don't care about the release state of other components to come up correctly and or they employ other mechanisms that are supported in Kubernetes and Helm to address some of the scenarios not addressed by the default execution order in Helm. Despite its deficiencies, both Helm and Kubernetes provide workarounds to address the challenges. Using Helm hooks, Kubernetes jobs and init containers, you might end up with a carefully crafted and working Helm release for a specific combination of components and conditions. It is not easy, almost impossible to generalize from such a crafted Helm chart of various components to accommodate a different permutation of components and conditions as is required for various deployment scenarios of a complex network function. Other possible alternatives include popular general-purpose frameworks like Spinnaker, Terraform, Ansible or even custom scripts to deploy applications. However, these frameworks are missing all context into the Kubernetes cluster as they run externally. None of these strategies provide a holistic orchestrated view in terms of a dependency graph. They're different states wherein success and failures are difficult to track in a unified way. Additionally, they might have their own dependencies on other Kubernetes resource type that are not easily mapped. So you can see that using the first set of mitigations requires chart modifications. The general purpose framework is not cloud. Now let's take a look at Orchestra which is designed to address all the shortcomings in Helm or in Kubernetes and is a release orchestration system for a group of applications. Orchestra is a cloud-native system to manage the life cycle of a group of applications or Helm charts by building on top of Helm which is capable of managing the life cycle of a single application. The orchestra controller is a Kubernetes operator that acts on an application group custom resource type. This resource serves as a declarative manifest containing information for each application in the application group and dependencies among those applications and their subcharts. Orchestra leverages some popular CNCF projects to achieve the goal of release orchestration like Argo workflow as its workflow engine to manage dependencies, Flux CD Helm controller to automate the Helm operations, Chart museum as a staging Helm registry for decomposed application charts and subcharts and kept in use for continuous evaluation and quality. An application group resource is a powerful construct that provides a unified view and definition of intent and the status of orchestrated releases. It is possible to orchestrate a set of unrelated Helm packages without making changes to these packages that would be required when using Helm hooks, Kubernetes jobs or init containers. The unit of deployment for orchestra based Helm releases is not based on a single parent chart but a workflow definition with a custom resource type that models the relationship between individual Helm releases making up the whole. Application group allows structuring an orchestrated set of releases by grouping the releases either through defining a sequence of non-related charts and or charts with subcharts where subcharts are not merged into a single release but are executed as a release of their own inside a workflow step. Rather than executing a Helm release from a pool of resources ordered by resource type as is done by Helm while losing all context of an actual dependency graph, Argo enables a DAG-based dependency graph with defined workflow steps and conditions to transition through the gap. Argo also provides detailed insights into the graph and its state through a web-based dashboard. Helm releases matching the transition in the graph are executed by the Helm controller shipped as part of the orchestra system. Kept in an optional component that ships with orchestra can be leveraged to perform continuous evaluation and quality k-space promotion while transitioning through the workflow DAG. Mission critical applications like 5G network functions weren't a need for reliable zero down time in-service upgrades in view of the fact that some of these mission critical applications may be deployed in an air gap cluster operated by a third party provider. The vendor of the application must ensure that these applications are fully automated and self-managed freeing the service provider from having to learn how to manage each of these applications and their associated components. On top of fulfilling the requirements of an orchestration system, orchestra implements defense in layers. The first layer which may not necessarily be part of orchestra itself is to leverage existing rollout strategies like standard Kubernetes recreate or rolling upgrade strategy, canary blue-green deployments by leveraging the traffic management feature of service measures. By incorporating captain into orchestra's ecosystem the next layer of defense depends on the continuous evaluation of the application being deployed. This is similar to the rollout strategies but rather than limiting it to the health or functioning of a single application we can evaluate the performance at a system level. This matter is since changes in one application in the application group can have direct or indirect impacts on the other applications. Quality gates in addition provide a mechanism to hand over manual control for promoting an applications to the end user operator as and when required. An application group custom resource is the atomic unit that orchestra acts upon. An application group spec contains a list of helm release specifications for each application that make up the group like the location of the chart in an upstream helm registry, the overlay values that must be applied to this specific release and other such details. Each application in the application group declares a set of dependencies on other applications in the group. The defined dependency order is parsed and used to generate the workflow DAG on reconciling the resource orchestra submits an ARGO workflow resource type containing the application group craft. Orchestra caches the stages required helm charts in a local repository for which it uses chart museum. This involves decomposing the parent chart and its sub charts into their own releases or helm charts. The actual helm releases as per workflow steps triggered by ARGO and are executed through a helm operator which is part of orchestra. Through a series of animation let's demonstrate what occurs during the execution of a single workflow node since orchestra operates on a Kubernetes custom resource type it can easily plug into any cd system or be deployed directly into the cluster using kubectl or customize. Let's zoom into a single step of the workflow DAG which could execute a parent release or one of the sub charts. Each step of the DAG is executed by an executor template run as a kubernetes spot by the ARGO workflow controller. The first executor is responsible for deploying the application. Here we show the helm release executor which on execution applies a helm release custom resource by parsing the application spec derived from the application group manifest. The applied helm release resource is then picked up by fluxcd helm controller. Helm controller is responsible for carrying out the operations like install, delete or update based on the helm release spec. Once the helm release resource is successfully reconciled all resources associated with the helm release are deployed. As the helm release transitions into already state the pod may start exporting metrics to Prometheus. These metrics come in handy when performing continuous evaluation using captain. Once the helm release custom resource moves into a status condition of success the helm release executor pod exits and the workflow progresses to the next chained executor. Here the captain executor triggers an evaluation to be performed by sending a cloud event to the captain control plane. Captain controller in turn triggers a preconfigured test harness to initiate sending traffic to the application pods that were deployed as part of the helm release. With Prometheus configured as the metric source captain control plane starts evaluating the user-defined SLOs and results from the testing harness. Once all SLOs are satisfied the captain executor pod returns success and the workflow transitions to the next application node. With that I'm going to hand off the controls to Jonathan who will walk us through a demo of using Orchestra using a simple set of applications. Thanks Natish. As Natish mentioned my name is Jonathan Ennis and I am an engineer on the Azure Arc enabled Kubernetes team at Microsoft as well as I'm a maintainer of the Orchester project. And so for this portion of the presentation we're going to take a look at a demo of how you can actually go about configuring all of your microservices and all of your complex application logic and describe the dependent relationship between those things using Orchestra as well as getting the life cycle management of when you're doing upgrades. If things fail while the upgrade is happening how they how we go about rolling back those applications that you deployed so that we make sure that things are always running in a smooth way. Also we'll take a quick look at how we're planning on actually enabling advanced monitoring on a rollout for some of the application logic in the future and what the roadmap looks like going forward for Orchestra. So with that let's take a look at the demo. So I'm going to pull up the example book info application group that exists within the Orchester project. As I kind of mentioned application groups are the Orchester idea of how we group together applications and how we describe the dependent relationships between the applications. So here we have a book info application group and it has two applications within it. It has the ambassador application which is the emissary ingress ambassador application and it has the book info Istio example chart application. And here we have described that the book info application depends on the ambassador application to roll out successfully. Additionally within the book info application we have various subcharts and we're able to describe the dependent relationship between those subcharts as well not just at the application level but at the subchart level. So we've broken up the book info application to four different subcharts as well as the parent book info chart. And so these subcharts depend on one another as well. So here the product page is dependent on the reviews to roll out successfully and reviews is dependent on details and ratings to roll out successfully. And then additionally within the spec you can specify things like target namespace and then values as well. And this is essentially how you specify your application group. So if we now move to look at the cluster itself let's first take a look at what gets deployed when we deploy the orchestra chart. When we deploy the orchestra chart and onboard to orchestra we get the orchestra controller as well as we get two argo pods that are monitoring workflow containers. So under the hood orchestra is using workflows argo workflows to actually monitor the rollout of these charts and so we need argo to do that as well as we have a chart museum to stage charts and a flux cd home controller and source controller to actually reconcile home releases that will deploy as part of the workflow. So with that we can actually apply this example to the cluster and show what happens when we apply these application groups. So with that we've created the book info application group. If we pull up and the application group on the cluster we can see that the workflow is currently reconciling. This is in a progressing state and it's it's currently rolling out. And one of the nice things like I said under the hood we're using a workflow this thing is running. One of the nice things about using argo workflows is we kind of get the argo UI as packaged with all of this the features that come with the orchestra and so we can actually go over and take a look at the workflow as it's rolling out and the the dependent relationship logic that exists here. So here we see that the ambassador application is currently reconciling it's currently rolling out and then after this it'll roll out the book info application. So we can kind of see the same logic occur on the cluster itself if we take a look at the pods. We'll see that the ambassador application is currently in a running state as the charts have been deployed and if we go back and look at this dag graph we see that the ambassador application is completed and now it's moving on to roll out the book info application. And so it'll do that on the cluster completing each of the steps rolling out the sub charts and then once it's completed we'll see that the workflow has completed and that the application group is in a ready state. So this is going to take a couple minutes for everything to roll out successfully. We can monitor things as they roll out. Additionally if we take a look at the application group status fields we see that we get the information of the helm release reconciliation status. So for this application the ambassador application we also package the helm release reconciliation statuses within the condition object and we'll see the same for book info once it succeeds. So take a look back at the pods in the cluster. We're still running rolling at the product page. The product page is completed so now we're doing the book info parent chart and once the parent chart completes and runs looks like it's completed. We'll see that here. So we see that the workflow has exceeded if we check the workflow in the cluster we'll see the same state and looking at the so we see succeeded here and looking at the application group itself. We'll also see that the workflow and reconciliation has succeeded and everything is ready. So from the state of the cluster this is kind of your classic day one operations you're rolling out all your charts you're laying down your infrastructure and then you're laying out your applications and you kind of have this dependent relationship between your charts. But let's say we want some reliability in our life cycle management and we want to be able to actually make sure that when we're doing upgrades and rollouts that things are upgraded successfully and they don't break in between. We also offer that with requestor as well. So first we'll look at the success state. So let's take a look at an upgrade scenario. So if we take a look at an upgraded book info application group here we'll see that we're adding an application in this case. So let's say we needed more application logic and we need to add a new application on top of what we already have. Here we're adding the pod info application and the pod info application is a chart that's made by the flux maintainers and this is dependent on the book info chart to rollout and in this case we've removed dependency on ambassador for book info. So book info is actually going to roll out first and then pod info is going to roll out and then we'll see ambassador roll out an update. In the case of book info and ambassador in this case there's no updates that are happening so the rollout will occur pretty quickly and pod info will get added to the set of applications in the cluster. So let's take a look at this happening. We apply this samples and we'll see that their workflow is kicking off here and the containers are creating. If we take a look again at the UI we'll see that the workflow has updated to the newer application group workflow so it's rolling out the book info application first and it's just going to validate that these things are at the correct version and that they're in a ready state so again this will move forward pretty quickly and complete pretty quickly and these are in a container creating state. Parent chart is rolling out and we see that now the pod info step is running so pod info is getting created on the cluster here at the bottom and once that one is succeeded in reconciling and in a running state that one will complete and then it'll move to the ambassador piece and once the the ambassador piece will also kind of do some checking of the version but it's all going to be fairly consistent so it should also complete fairly quickly and we see that that step completed and so this workflow again succeeded on the cluster and again if we check here we'll see the workflow is in a succeeded state and then the application group should be updated in a succeeded state as well so we've just added an application as kind of a day two operation and added this application to the cluster and described the dependent relationship between that application and some of the older applications that we've deployed so finally we want to see what happens if things fail on rollout and what happens if something breaks the chart doesn't work appropriately and we want to roll back to the previous state so here we have an invalid book info application group and this one is valid from the perspective of the actual spec itself however there something is going to break as the application group rolls out so in this case we have the same set of applications and actually here just for the sake of showing we want to upgrade the book info or the pod info application to a newer version however so in the same order as before book info is going to roll out actually in this case pod info doesn't have any dependencies so these two are going to roll out simultaneously and then ambassador is dependent on this pod info app however in this case we're also trying to upgrade ambassador and upgrading between these two versions it's not going to work because of some immutable fields that are going to exist between the two ambassador charts and so here when we try to upgrade it's going to fail and not get into a ready state and so we've also reduced the release timeout here which is the time that it hasn't for it to get right into a ready state before it tries to it debandons it and rolls back and so within a minute this thing will not upgrade appropriately it will fail and it should roll back to the previous state where the pod info application should be at the previous version which in this case was 5.2.1 so we're trying to upgrade to a 6.0.0 so let's go ahead and apply this and watch that in action so again we're going to apply this to the cluster if we take a look at the workflow again we see that these are now running simultaneously because the dependency relationship is different again and so it's going to take the time to check the book info versions the pod info versions in the case of pod info it's going to upgrade so if we look at pod info it actually already upgraded because this container is newer this pod is newer so it's actually already completed and it's going to try and roll out the ambassador application now and so as it rolls out the ambassador application we're going to actually see if we look at the home releases on the cluster we see that this one is actually in a failed state the upgraded retries have been exhausted in this case because it's failed to upgrade and so ambassador eventually is going to fail this step and again if we're going to have to wait around a minute for that to happen but in a minute we're currently 44 seconds once the timeouts hit we should see this fail with red x's so we see this aired out this is failed with red x's so the workflow itself has failed and actually if we take a look at the workflows now we see that we've kicked off a book info rollback workflow which is going to essentially just deploy the previous succeeded version so in this case we see that the dependent relationship for this book info has what is the current application group spec that we had applied were book info and pod info rollout in unison however in the previous spec they roll out in order and so here we're going to redeploy book info to the older version and actually if we look at pod info here we'll see that the current version is 6.0.0 however as soon as rollback occurs we'll go back to the previous version that succeeded so we're going to have to kind of wait for book info to to roll out and complete this product page rolls out and now we're doing the book info parent chart we'll wait for the book info parent chart to complete and now we're going to look at pod info and roll out the older version of pod info and so on the cluster we should see that pod info has terminated and a newer version has rolled out and then if we look at the home release we'll see that this home release should be on the older version so looking specifically at the home release we see that we're now on version 5.2.1 so that's kind of the story for orkester we offer you the ability to describe your dependent relationships between your applications at the at the high level microservice application level as well as if you have sub charts that you need to describe dependent relationships between we offer you the ability to do that as well while also giving you rollback scenarios of things break as your application groups roll out if there's some failure at the application or at the chart level we'll roll back to the previous version so we're looking going forward as I mentioned is we're looking to describe more steps within a workflow stage so that you can actually monitor things as they roll out so for instance in this case the home release is actual rollout of the chart while on top of that we'll give you the ability to for instance with captain to monitor SLAs and SLOs of the home release chart as it rolls out so that you we consider something failed if the application itself is unhealthy because it doesn't meet some SLA that you defined so we're giving you these kind of complex life-cycling scenarios so that you can ensure that things are safely rolling out as especially when things are already in production in day two scenarios so that is the demo once again we want to thank you for listening to our talk if you're interested in the project at all and you're interested in understanding how we have put everything together you can feel free to look at any of the issues that we have on github at github.com slash azure slash orchestra if you're interested in looking at the examples or trying out some of the things I just described you can reference our documentation or github the documentation is at the link there once again thank you and we appreciate your time