 Hi everybody. Welcome to our talk about clusterless architectures. This is a new design that's coming out of the virtual cluster space and it allows you to utilize clusters of clusters seamlessly to run workloads wherever you need to. Faye and I will be talking to you about this and we're both the maintainers for the project along with some other folks. Faye works at Alibaba and I work at Apple. Why we're talking about this and why this is important. One of the things that as you start to use Kubernetes, you start to realize that when you started, you initially started with at least a single cluster and you started to build out your architectures and you deployed workloads into that single environment and you started to see some problems. This is something that I feel like everybody experiences in their Kubernetes journey. Where folks usually go is they turn to a multi-cluster style. The whole idea behind this is it allows you to scale out your architecture. Taking from a single cluster which has scale targets that are defined by scaleability and moving that into how do we make it so that we can run 4x the type of capacity that we currently have. Being able to do that and scale your workloads horizontally gets difficult. On top of that, customers really want to start to look at things like how to deploy and run workloads in specific regions and areas where their data might be located. If you're thinking like bachelor workloads where you're processing a lot of data, you don't want to run that from completely opposite ends of the country, for example. In this environment, you also get to a place where you're really taking, you're bringing in a lot more compute resources and everything gets more difficult to actually manage and aggregate across those. What you end up doing is start to look at that multi-cluster space and then bring in things like HA strategies. How would you actually increase the availability of your workloads? How do you set up regions that you can actually fail over into when something goes wrong? I imagine everybody here has experienced issues with etcd having key space problems and you get a corrupted key and you end up with an entire cluster that just doesn't function anymore. How do you actually recover for those things? On top of this, we get into a new world of how do you actually build out the security compliance when you have now distributed workloads across different environments, as well as the autonomy of those workloads going forward. Now, on top of that, there's an ever growing space that's really important here to acknowledge, which is the hybrid and multi-cloud space. So workloads that might have been deployed into on-premise data centers are now trying to expand in to get the flexibility that you get out of the multi-cluster world. And this really brings in the new aspects and what Fe and I work on most closely is the multi-tenancy space. And so what this actually brings on brings on a lot of challenges. As you can see, there's at least five bullets here that underneath them have a lot of components that go into them, but there's a lot of tools already in this space that folks have been working on. So things like life cycle management, how do you actually go and create new clusters? Are you using public cloud tools like EKS and GKE and ACK to actually deploy those things? Are you using open source tools like what Fe and I work on with the Cappy space or are you using open container management? Are you using any of the off-the-shelf vendor tools? Going on top of that, there's then the governance of those. So now that you have distributed workloads across many clusters, how do you do centralized management? How do you make sure that the security policies are set up and they abide by what your security teams within your companies actually need? And really making sure that they're customizable, but they don't expose you to anything too far. On top of that, you have to look into a vast space here of monitoring and tracing. There are many tools that we have to work with nowadays, dashboards to visualize across those clusters, where the data is actually going to live and how that functions. Now, what we're really going to be talking about in this talk is the fourth and fifth sections. And because there are already tools expanding across those other challenges, we really wanted to look into how to make this, how to abstract this, and how to make it so that you can have better workload management, but still, in essence, give you the same sort of experience we have. Now, the tools that are currently existing and that folks have been using, things like CubeFed and Argo CD, they bring on that same level of abstraction, but they give you some tradeoffs in terms of what you're actually going to be doing at a granular level. And then the last thing is scheduling. Now, if you have multiple clusters, you need to figure out how you're going to actually schedule those workloads across them. Is that something manual where everybody knows about every single cluster that is potentially accessible to them, and they go and manually decide, I want to deploy to this cluster in this region because of this specific thing? Or is there an automated system that does that work for you, thinking about things like the GitOps world and trying to deploy to some cluster based on some decision in a declarative script? So, entering cluster lists. Now, in a multi-cluster environment, the idea here with cluster lists is that we want to achieve really the single cluster user experience. So, everything that everybody knows and loves about Kubernetes today, being able to take that and say, what does this look like across many clusters without really introducing new APIs and making it so that every single off-the-shelf tool that you currently use could potentially be deployed into that same exact experience. You deploy some operator to do maybe Prometheus. How do you make it so that user using the single cluster experience can deploy that operator and have it naturally translate into a multi-cluster environment? Now, there's some caveats to what we've designed so far and where it is in its current landscape. It doesn't really work as well for redundancy for availability of workloads. It also doesn't suffice exactly today for heterogeneous hardware where you have different SLAs and different types of models for the actual hardware that you're running on. But it's really great for when you just have a lot of the exact same worker cluster and you're just trying to run more of what you already have in that single cluster. And realistically, it's trying to, in essence, minimize the integration cost by truly treating this as just another off-the-shelf Kubernetes deployment. And so this is where we are. We're going to be talking about the design of how cluster list functions for us. And then we'll go into the architecture, the implementation and some of, and at the end we'll summarize kind of what we've all talked about. So cluster abstraction, what's really important here and where we started with the project is basically looking at how to take a single cluster or multiple clusters and abstract them into that single cluster space. And what we wanted to do is really introduce no new workload clusters or new workload controllers into those worker clusters. So taking a look at the lowest level abstraction that's in Kubernetes, things like pods, and how would you actually spread that across those clusters? And this is something that we already are seeing other open-source tools doing. So TencelCube and Lyquio are basically doing this already. But they do this using tools like VirtualCubelet as their abstraction. So they can run a VirtualCubelet, which then addresses one or many clusters under the hood for actually deploying those workloads. Now in our world, we started with the tool chain that we've been working on most heavily, which is VirtualCluster, and trying to bring that same level of abstraction to that environment. So it's something that you already are seeing, and we're just extending this into tools that we have today. Now to take an aside for a second, there's an important conversation that we need to have here about the abstraction level that we're talking about, and that's specific to pods versus workloads. So as you look at the entire space of this, not just the TencelCube and those kind of workloads, there's other things in here. There's OCM, OCM, there's Argo, there's Carmada. A lot of these tools are doing it at a different abstraction. So they're requiring you to have a specific CRs or custom resources that you actually deploy with that define the entire resource that you're actually deploying. And so it ends up in a place where you have to do very specific policy-based replication of those objects and synchronizing those workloads objects to the downstream clusters that deploy them. What you can kind of see in there is you end up having controllers running in those worker clusters that operate on those workload objects and make sure that those are scheduled as you'd expect. And where this actually leads you to is not having any pod objects in your user's control plane. So I as a user might be using one of those tools in deploying a CR to deploy a full workload which defines everything and it goes to a specific cluster and gets scheduled. And so there's the difference between pod and workload granularity there. So things like with pods, you get no new APIs. You get to use the standard off-the-shelf Kubernetes resources that we all know and love and have been using for many years at this standpoint. We're not fragmenting those resources in any new ways. And we define this at a better utilization perspective because it's singularly focused on the pod as that unit of Kubernetes to be scheduled. And this kind of breaks down in one regard, which is if you have a centralized control plane that is the user's control plane that's actually creating those objects and if something goes wrong with that and that's now disconnected and something goes wrong with the pods on those lower level clusters, there's really no orchestration piece that's continuing to orchestrate them or to continue to keep them up. Now from a workload granularity side of things, it does require you to run those extra controllers that go and operate on those, but it gives you the benefit of distributed orchestration. So if something goes wrong with the top level user control plane, that underlying CR could continue to run even if something happens. If the cluster has a failure and it comes back up to a healthy state, it can continue to run those workloads. But you get the negatives of having to run those new APIs to assist with the actual propagation of the resources. And it's kind of inconvenient to actually debug those workloads because now you have a difference between what's in the user cluster and what's in the actual workload cluster or the worker cluster that's actually running those things. So that brings us into scheduling. So now, as you kind of can understand where we're coming from with the architecture, you can kind of look at the many different outlets that we could have had for building a solution like this. So there's single tier scheduling and if you're familiar with the mesos world, a lot of these concepts will kind of come into play. Basically, if you have single tier scheduling, which very much looks like what the Kubernetes scheduler has, you have all of the nodes allocated to a single scheduler and it will pick and choose where the workloads are going to run along those nodes. This also limits your abilities at the end of the day. A single scheduler is going to have scalability limits for how many pods it can actually handle, how many nodes it can actually handle, because it's going to have to do checks against all of those nodes in the cluster. And it's going to have to make sure that all of those, every component of the environment is up and ready to accept workloads. Now, take a step down, we go into two level scheduling. And so this is things like Tencel Cube, which is basically taking at the top level user cluster and creating a first tier scheduler. Again, this is very similar to the Mesa architecture, where we're going to delegate the tasks along so that each thing can do it very well. So the first tier is going to basically go through and pick in essence a cluster that the pods could be dispatched to. So where in our entire environment, the many clusters that are all connected to the single user facing control plane, where does the workload actually get scheduled? And then once it picks that, it drops it onto those lower level worker clusters to go and decide what node is it going to do? So we distribute the decision making to make it faster, faster scheduling, but it does introduce some failure modes that need to be accounted for. So if you go through, you can imagine a workload, you deployed a pod into your worker cluster, and that now tried to get scheduled to a cluster that at the same exact time had a job that was scaling up, and it ran out of capacity. And maybe you didn't have cluster autoscaler set up, or maybe those controls aren't something that you have available to you. And now that pod can't actually schedule in the new cluster. So those are kind of the downfalls of that type of architecture. And then the last thing here, and this is the last decision kind of making that had to be done is distributing that scheduling decision to many schedulers. So coordinated schedulers talking to each other and deciding and doing resource requests and reservations against those sub schedulers. And this is a really interesting architecture because it pretty much reduces the chance of any of those race conditions, but it requires a lot of coordination. So the top level scheduler needs to know about every single potential sub scheduler, and they need to be doing requests against them while the actual pod is being scheduled, which incurs a significant latency to your actual workloads. So enter where we mostly work again, which is virtual cluster, and trying to basically come in from the multi-tenancy side of things. If you're deploying into a multi-cluster world, you're likely dealing with a lot of different tenants and trying to make sure that all of their workloads get scheduled into somewhere. So problem being multi-tenancy is a legitimate concern when utilizing a large amount of aggregate cluster resources. We have tons of CPU and memory now. We need to aggregate them all together and expose those to a subset of users so they can actually deploy their workloads onto it. And again, since we work with virtual cluster, we kind of started to take a lens towards that implementation, which really abstracts out the user control plane, as we called it earlier, and we refer to it as the tenant control planes in our space. And this is really deploying a Kubernetes API server, controller manager and etc, completely off the shelf tools into your cluster and exposing that to the users of your system. So each individual tenant, however you want to scope out the architecture, they get access to one of those dedicated control planes, and it kind of solves the hard multi-tenancy space by isolating customers against that so that you can now really focus on making it so that the supercluster is really good at running workloads, and the top level tier gives the flexibility that everybody wants out of Kubernetes. So if somebody wants to deploy a CRD into that cluster, they can do that, and it's not going to affect the lower level workloads. And so that originally started with being scope-specific to one supercluster, but it kind of became a natural progression to, well, can we make this run on multiple clusters using that same exact syncing mechanism that we have in place. So as that actually functioned, it functions by adding another interface in the middle. So we talked about the two layer scheduling and how that could function. This in essence is building out a top tier scheduler, which operates off of the resources that we already in most multi-ten environments use today. So things like namespaces and resource quotas. So a top level scheduler can go through and look at the amount of resources that you're requesting for a specific namespace, and it can start to make those first decisions. So phase one of this is going and doing the creation of the namespace and at the same time picking what cluster it should be deployed into based on the amount of quota that you've requested for that single namespace. That can be a very fast decision again, because it's just picking based on capacity, and it's not trying to intermingle the amount of the pods that are potentially deployed into it or any of the other decisions in there. And then once you've picked a cluster for the namespace and that has capacity to run those workloads, it goes on to the pod creation. So it in essence goes through and will annotate the pod and the namespace with the cluster name that the workload is supposed to be deployed into, allowing the downstream sinkers to pick up just specifically those workloads and allowing the workloads to then get scheduled. So what kind of architecture of the entire solution for the cost list? As Chris mentioned before, the entire architecture is the extension of the existing virtual cluster architect. In virtual cluster we have a sinker which synchronized the object between the tenant control plane and analyze super clusters. As the natural extension we will see that we will create a sinker for each worker clusters and we have a new schedule which watch for the worker cluster capacity changes and the objects we create in the tenant control plane. Again, in the VC model the sinker can watch for multiple tenant clusters. So the same will apply for the scheduler as well. So overall if you look at the new architect is that we on top of the virtual cluster we have modified sinkers and then we have a new first level schedulers. In practice both the thinker and the scheduler we are normally managed by separate meta clusters but for simplicity I'm not going to show that in this figure. All right, next I'm going to talk about some of the implementation details to realize the cost list for hype. As I mentioned before, the first thing is we need to enhance the sinker to support selective object synchronization. The why we want to do that because in theory you can simplify the solution by just copy all the tenant objects with a namespace in all the underlying worker clusters but not the part object just to prepare the part will be scheduled to any of the worker clusters but this way introduce some unnecessary all heads of storing the objects. If the part is not running in those clusters they are completed waste. So to resolve that problem we enhancing the sinker to support selective synchronization and the decision is determined by the placement results which is specified by the scheduler. So if you look at the right figure in this example assuming the namespace A in tenant control plan T1 has the quota and the quota size conflict so that they have two sizes and the scheduler has decided the scheduling decision with the placement results we put one size in C1 one size in C2. On the other hand for the namespace B in tenant control plan 2 it already has one size and is scheduled to C1. So the sinker from C1 we are synchronized objects from two namespaces but the sinker from for C2 we are just synchronized the namespace for one namespace which is the namespace A in T1. Again so the sinker we are synchronized all objects except the part to the underlying worker clusters and we make sure the part will be synced to only one worker clusters which is the target which is targeting the clusters determined by the scheduler. Out of all implementation the most challenging part is to implement the scheduler cache. Unlike the traditional Kubernetes scheduler you only need to watch for one EPS server in this for this scheduler it has to watch for the status of multiple clusters both tenant control plan and MLR worker clusters. There are a lot of failure points in these settings for example the tenant control plan can be offline one worker cluster can be offline or one node in one worker cluster can also be offline so we need to make sure when rows failure happens the scheduler cache is still consistent we put a lot of efforts to make sure it happens. And another general problem is that whether the scheduler should watch for the all the node events coming from underlying worker clusters it can be a huge amount of network traffic we really want to do that because Kubernetes usually generates node harvest and we kind of list node list API calls we to reduce overhead we choose to periodically scan the clusters to collect the node status to and compute the available cluster capabilities. The downside of doing that is that there is a certain delay with the underlying cluster capacity change. The scheduler will be aware of those changes with some delays which can cause some raw scaling decision but we have some remediation process can accommodate that effect. In terms of algorithm for the actually scheduling algorithm it is pretty simplified algorithm as of now. For namespace quota slice scheduling we pretty much do two things first we try to choose the minimum amount of clusters can satisfy all the slice requirements the reason is that we want to reduce the amount of clusters that synchronize the non-part objects from those namespace. And we've just used a simple first feed algorithm to pick the clusters. For phase two partwise scheduling among the candidate clusters the algorithm is really simple we just use a simple first feed or run robin to find the target clusters. Know that there are large room key calls in the scheduling domain for example we can leverage a lot of from the upstream scheduling schedulers scheduling capability such as support affinity and anti affinity or even spread policy. So if we have that implementation online we can pretty much support some scenario that will be currently not that we cannot support for now let's say for the redundancy of the kind of aspect. But this area is too big a cost. Although the algorithms are simple but we still can guess the insights from the design such that the namespace quota slice scheduling is not in the part creation query path and our scheduling algorithm makes sure for part of scheduling the number of candidates cluster should be small so this scheduling overhead can be immediately. Overall by doing this we can roughly achieve system wise part of scheduling throughput with all the unschedulability which is so potentially you can leverage all the schedulers running inside of the end line work clusters to schedule all the paths that are sending to the top level user facing a candidate contributor. I want to talk about some other features which is very specific to this multi cluster scheduling. The first one is rescheduling because the placed candidate clusters for the namesets can be offline at any time. And unlike the traditional Kubernetes they have no controller to evict the path in case of node failure we don't have that controller ready yet for this multi cluster domain. Instead for now we are just allowing people to manually revoke the scheduling decision by removing the scheduling results from the namespace annotation. The schedule we are kicking and rescheduling the namespace to find the online cluster to support those workloads. And in case there are some any stating objects stays in the work clusters the single way dvgc. Another aspect of in terms of the application workload runtime it is a service support. Clearly the native Kubernetes doesn't support a service across clusters but if you choose a load balance type of service and point the traffic to your global load balance which support much clusters I think the internal working may still work. Another thing is that the current architecture should work for other existing multi-cluster networking solutions such as the CDM and cluster mesh. I think our schedule our overall cluster design support those kinds of solutions for the network service across clusters. Then we have implemented the features that I mentioned before and come with prototype. Due to the time limitation I'm not going to give you a full life demo but you can find the demo in this link. Essentially in this demo what I do is I create one tenant control plan and with two worker clusters and I manipulated the quota as specified in the default name space and the quota slice size so that we have two slices and the scheduler scheduled each one slice in one worker cluster. Then I create a deployment with two replicas called Golan and the scheduler real schedule rows paths in two separate worker clusters and if you look at the screenshot you can see that from the user face control plan you check the path. You can see QPause is running and if you check each two worker cluster named root 1 and root 2 only one part is running one worker clusters. Again for more details please go through this link to see a full demo. So in summary so multi cluster management solution becomes popular and they usually choose different workload abstract models and choose different scheduling strategy to schedule those workloads and normally they bring new API so really integrating those solutions with existing solutions will be a non-trivial effort and for a raw solution normally requires manual capacity planning which means the users have to have full knowledge about the resource usage of the underlying worker clusters so that they can make sure their scheduling policy can work well, make sure those workload can still run in the end line worker clusters. On the other hand in this time we propose clusters. It is the solution focusing on primarily on the scheduling and multi-tenancy aspects of the multi cluster management space. We implement the cluster by extending the virtual cluster framework and we can achieve the scalable scheduling throughput and the entire solution is pretty easy to integrate because we don't introduce any new API to management to manage the workloads. All right this is the end of the presentation. I'm happy to take any questions. Thank you.