 Hello, everyone. Welcome to our session on challenges and solutions for orchestrating applications to multiple data center and age clusters. My name is Kathy Zhang. I'm a senior principal engineer at Intel, leading the company-wide contributions to the Cloud Native Computing Foundation. I have over 10 years of experience in the cloud area. I co-authored several CNCF white paper and the specifications. I was recently elected to the CNCF Technical Oversight Committee. Along with me is Ritu. Yeah, Ritu. Want to introduce yourself? Hello, everyone. My name is Ritu Sood. I'm a cloud engineer slash architect working at Intel, and I've been working in the cloud area for over the last seven years, and I have contributed to a few of the open source projects related to cloud. Here's what we will cover in today's presentation. First, we will touch a little bit on the current industry trend on geo-distributed computing and multi-cloud. Then we will go through several use cases. After that, we will talk about the requirements and challenges to support multi-cloud. Last, we will deep dive into some existing solutions, sharing their architecture and functionality, as well as analyzing their pros and cons. The industry is moving from a centralized cloud computing model to a geo-distributed cloud computing model with many workloads running on the edge clusters. There are several drivers behind the geo-distributed computing trend. One is the need for low latency response time, which requires the application to run closer to its data sources. Another is the need to reduce the bandwidth cost associated with information passing between the public cloud data center and the edge device. To avoid the network bandwidth cost and network congestion, more and more applications run locally at the edge sites. In addition, some legal and privacy considerations also require the applications to not leave an enterprise premise. As shown in the red diagram, it is now very common to have some applications that require a lot of computing resources to run in the public cloud while having other applications that are latency-sensitive or privacy-sensitive run at the edge site. And dedicated network channels can be set up between public cloud and edge cloud. And this slide shows the three use cases for geo-distributed computing. Of course, there are many other use cases. The first one is a 5G use case. In the 5G scenario, there are tens of thousands of distributed 5G units that need to run at the edge sites. There are thousands of centralized 5G units that need to run at the central office site. Then there are 5G control components and the UPFs that need to run at the regional data center. And there are also some compute-intensive 5G applications such as analytics that can run in the public cloud. As you can see, 5G components need to be deployed in a geo-distributed way. The second use case is telco's CPE use case. In this use case, the container network functions need to run at the far-age universal CPE platform. The traffic hub and the chassis services need to run at the network edge. And other compute-intensive applications such as network analytics can run in the public cloud. The third one is retail enterprise use case. Similar to the second use case, some CAF, which means container network functions, and the chassis services need to run at the branch office. While the traffic hub and some other chassis services need to run at the enterprise data center, while analytics applications can run in the public cloud. As you can see, all these use cases require geo-distributed computing. To support this geo-distributed computing, we need a multicaster orchestrator that can automatically deploy components of a complicated application to one or more age clusters. Now I'm going to hand over to Ritu to work you through the challenges and the available solutions. Thank you, Cathy. So let's walk through the requirements and challenges of multicloud orchestrator. The obvious requirement is to deploy these microservices with one click across a large number of clusters. And some of these microservices may require customizations based on which cluster they are being deployed. Another role that the orchestrator must play here is to make decisions about the best cluster to run the workloads based on the requirements like hardware, latency, cost, localization, etc. Some of the clusters are push-based and some are pull-based, also known as a get-off space clusters. Another aspect is tenant management and to deploy these microservices in a multi-tenant aware environment. Another very important aspect is monitoring these microservices once they are deployed and also monitoring the cluster resources where these are installed. Deploying the microservices is just the start. These microservices require multiple infrastructure services and these services need to be configured on the cluster based on the microservice being deployed on the cluster. For example, there can be service meshes which need to be configured like SDO, Linker, etc. And another example is configuration required for service discovery across these clusters. All the Kubernetes clusters are not equal with respect to resources. So, because of this, some of the microservices may need to be brought up with different profiles on different clusters. For example, a 5G UPF may need to be assigned with 16-number cores if it is being deployed on one kind of an edge and two cores if it is being deployed on a different kind of an edge. And deployment of these microservices is just one part and some of these microservices may require configurations after they are deployed and the challenge is to configure these microservices in an atomic operation. Many microservices across many Kubernetes clusters are to be configured. In some cases, microservices are expected to be configured directly and in some cases they may be configured via RESTful APIs or other means. Some of these configurations may need to be done over unreliable networks, so having a workflow-based approach is required to help with retries, timeouts, etc. So, during this presentation, we will keep these requirements in mind and then see how different solutions come across on these requirements. So, I'm going to break up these requirements a little bit more and going to talk about each of these requirements one by one. In the first stage, before any microservices are deployed, the orchestrator needs to onboard clusters and tenants. Some of these clusters can be shared across tenants. Some clusters are GitOps-based. The orchestrator should be able to onboard different types of clusters, for example, Azure Arc, Anthos, Flux, or any on-prem clusters, etc. Each tenant requires an ability to use tenant-specific OAuth2 servers. The tenants are required to be created across clusters and we are calling that as a logical cloud. The tenants are required to be created with some RBAC rules and permissions. Next step is to onboard the microservices with the orchestrator. Along with the microservice, there are some profiles also need to be onboarded, which will help to deploy these microservices on different types of clusters. The placement policies are also required to be defined so that the orchestrator can make the best decision to place the microservices on the best possible clusters. Some of the microservices may have placement constraints like platform capabilities, latency cost, and the orchestrator should be able to take these decisions to find the best cluster for the microservice. Some microservices may require application or some microservice may require getting distributed to different clusters. So, all these things need to be taken care by the orchestrator. So, for an example, let's say there is an application which needs to be deployed on all the clusters with label USFEST and needs to be deployed on only one cluster with a label US East. In the case where there is a choice of clusters, the orchestrator should be able to take a decision based on the best possible cluster to deploy the microservice. The next stage is workload customization, and these need to be done based on the type of cluster that a microservice is deployed on. For example, the image repository to be used or applicas or storage class may need to be configured based on the cluster where the microservice is being deployed. Enabling inter microservice communication within or across clusters is another important function, and enabling of this may require with or without mutual TLS and also the management of multi-cluster DNS. So, once all these configurations are done, the life cycle management of the application is also required where an application can be started, terminated, updated or rolled back, and all those functions need to be done by the orchestrator. Another very important requirement from the orchestrator is that it should be extendable because there can be different technologies, for example, being used for connectivity, like there can be different kind of service measures being used on different clusters, or there can be other infrastructure elements that need to be automated for these microservices. So, the orchestrator should be extendable so that it can work with any of these new technologies and new requirements can be easily met with the orchestrator. So, once the applications are deployed, there is a need to continuously monitor them, and this is to be done across these clusters, and there is a requirement to get like a comprehensive report of the status of these applications across these clusters. There is also a need for collecting metrics on the clusters, and based on the metrics, some closed loop policies may need to be taken. For example, if the load of one cluster is too high, then some microservice may be needed to be moved to another cluster. Also, like I mentioned before, some microservices require some configurations after they are brought up, so the orchestrator should be capable of taking care of that also. So, now that we have looked at what are the requirements or challenges that an orchestrator needs to have to fulfill the use cases that we were looking at, next we are going to look at some of the projects that are trying to solve these challenges, and we will go through their architectures and look at their pros and cons, and keep these requirements in mind and see how they do against these requirements. So, the first project that we are going to take a look at is QBatch. QBatch is an edge solution that was initially built mainly to manage smart IoT devices at edge and to run microservices on top of them. These devices generate a huge amount of data, and it is possible to send the data to the cloud, and it's not possible to send the data to the cloud to analyze and send the results to the edge. Doing that will make the latency very high, which is unacceptable in managing some of these devices. Security is also a concern, and these network edges may have connectivity challenges and also limited resources. So, QBatch mainly tries to solve these issues. It is a remote node architecture, which means that the remote edges are treated as nodes of the Kubernetes cluster. So, as you can see from the picture on the right hand side, the edge node is treated as a node for a Kubernetes cluster, and it has the QBatch has three distinct components, a cloud, edge, and a device layer. So, the edge core, which is running on the edge module manages the life cycle of the pods, and it is actually a lightweight QBlet. It helps to deploy the containerized workloads or microservices at the edge node. Edge core is optimized to run with a very small memory and CPU footprint, like 1VCPU and 128KB memory. And as we can see, we can have a mix of cloud node and edge node on the same Kubernetes cluster. There is a web socket-based communication between the control plane and the edge nodes. And we can see on the edge node, there can be a number of device models that can be supported using QBatch. There is edge mesh technology, which enables you the ability to access edge-to-edge and edge-to-cloud and cloud-to-edge microservices across different networks. It provides simple service discovery and traffic proxy functions for microservices. So, this slide was taken from KubeCon 2020 to Europe. As you see, there is the cloud core here watches for any changes to pods, and the pods need to be labeled in a certain way, and cloud core watches for these pod changes in the HCD. And at that point, the cloud core sends a web socket-based notification to the edge core, and the pods are sent to the edge core where the edge T, which is a lightweight cubelet, will create the pod and then run the pod on the edge. And also, the pod state is persisted on the node, and this is required so that the edge can continue to operate, even if there is a connectivity lost with the control plane. So, as we can see, that this is a very specialized implementation, and this is suitable mostly for IoT use cases. The next project that we are going to look at is Argo CD application set. So, Argo CD is a declarative GitOps continuous delivery tool for Kubernetes, and it is implemented as a Kubernetes controller, which continuously monitor running microservices and compares the current and life state against the desired state in a Git repo. And if a deployed application deviates from the target state, it is considered out of sync, and Argo CD monitors these differences and tries to sync the life-sake back automatically so that what is running on the cluster matches with what is there in the Git location. This is the regular GitOps principles. So, Argo CD also has multi tenancy and single sign-on support inbuilt. So, but for what is interesting from a multi-cluster prospective in Argo CD is Argo CD's application sets and application set controller. An application set controller allows to automatically generate what is known as an Argo CD application. So, with application sets, you can define multiple clusters on which you want to deploy your application, and you can define a template on which can be used, which will be used to create these applications. So, application set is responsible for reconciling the application on these various clusters. The controller generates one or more application and based on the contents of the template field of the application set. There is also dependency management within an application which is available in Argo CD, and it meets the first requirement that we had of a single-click deployment of applications or microservices across clusters. So, Argo CD meets those requirements very well, but it stops there. The next project that we are going to take a look at is Kermada. Kermada is a continuation of Kubernetes Federation project used to be known as QFED, and it's an open multi-cloud, multi-cluster Kubernetes orchestrator. It supports deploying multiple applications to multiple clusters. Cluster mapping, it has the concept of cluster mapping policies. So, these are known as the propagation policies, where the user can specify how they want their application to be propagated to different clusters, and it has a concept of override policies where a user can define how they want their application to be customized for a cluster. And the placement is based on cluster affinity, taints, tolerations like node, port, taints, and tolerations. And Kermada also supports multiple multi-cluster application connectivity, and in the upcoming releases, they will be also adding the STO-based multi-cluster connectivity. From the original five requirements that we saw, Kermada meets the first requirements very well and provides service mesh that is STO for cross-cluster communication. But if there is some other service mesh technology being used, then that support will not be possible with Kermada. So, this is just what we were talking about, how Kermada has these propagation policies based on the propagation policies, various clusters can be decided, and once the clusters are decided, then you have overrides which are applied based on where the application is being deployed. And in the end, a work resource is created, and this is one-to-one mapping with the member cluster. So, this is what gets deployed, which this work will deploy the actual resources to the clusters. Okay, so the last project that we are going to look at is AMCO. AMCO is a Linux Networking Foundation project, and it was open sourced under Linux Foundation about a year back. AMCO was built from ground up, keeping all the requirements that we discussed at the beginning of the presentation. AMCO is an intent-based architecture. It supports deploying multiple microservices to multiple clusters. It supports various types of edges on clouds, on-prem data centers, public clouds, and also has support for GitOps-based clusters like Flux, V2, Azure Arc, and Google Anthos. AMCO is a highly extensible architecture with entry and third-party controllers. Users can specify placement intents, and AMCO makes intelligent selection of clusters to place workloads based on criteria like hardware capabilities, latency, cost, etc. Currently, AMCO comes with hardware platform-aware controller, and other type of placement controllers can be easily added into AMCO as it is very extensible. Tenant isolation is provided using logical clouds. Logical clouds are logical clouds, group, and partition clusters for multi-tenants. This partition is made by creating distinct isolated namespace in each of the Kubernetes cluster where the application is going to be deployed, which is part of the logical cloud. RBEC policies and resource quotas can be applied to the logical cloud. AMCO supports customization of resources based on which cluster the microservice is going to be deployed on, and AMCO has a comprehensive status monitoring framework for all the resources which were deployed by AMCO across all clusters. AMCO provides a consolidated status of the application across clusters and provides a notification framework where an external application can register to get GRPC notifications when an application is either deployed on all the clusters or any cluster or is ready on these clusters. AMCO provides an intent-based distributed traffic management for cross-cluster communication. Users can specify which microservice needs to communicate with which other microservice. These intents will be fulfilled by registered subcontrollers based on the cluster on which microservice gets deployed. AMCO out-of-the-box supports STO controller for mutual TLS-based cross-cluster communication. Based on the cluster, the microservice is getting deployed on the relevant STO policies will be automatically applied. So as mentioned before, AMCO is extensible so other controllers can be written and will work seamlessly with AMCO to provide automation of cross-cluster communication if some other service mesh technology is being supported on some clusters. Other controllers can be added for automation of other infrastructure services. AMCO takes care of resource customization based on cluster types. AMCO also has support for a temporal engine and that can be used for the purpose of configuring the microservices after they are deployed. So some of the use cases which are made possible with AMCO are the use cases that we saw at the beginning of this presentation, the 5G use case, SASE-based use cases, NEH computing application, distributed apps spread across multiple clusters and multiple edges. AMCO is capable of supporting all these different kinds of applications. So in summary, there are many solutions which are available in the market and some of these solutions are specialized for some use cases. We also saw how AMCO meets many multi-cluster requirements like dynamic application orchestrator along with application automation of infrastructure services, selection of Kubernetes cluster matching with various criteria and also the fleet deployment of application with auto customization based on cluster capabilities. With this, we come to the Q&A section and thank you everyone for listening. Thank you everyone.