 Hi everyone, welcome to this presentation. I'm excited to see so many people interested in boring stuff like defining APIs. So we tried to keep this fresh. And this is more about welcoming everyone to the conversation. We just want to build new APIs to make things easier for multi-cluster. So let's start by saying that defining APIs is hard from the name itself. So I think you all came to introducing inventory cluster and cluster feature API. The names have gone through many iterations. And there is actually a poll at the end of the presentation on where we are setting new names to the API. So every time we run a Sieg multi-cluster meeting, we propose new names. So welcome, again, to cluster inventory and no feature group API talk. It keeps changing, so maybe at the end of this talk is going to be a different name. My name, Carlos, my co-speaker, Ryan. I'm from NVIDIA. He's from Microsoft, and we will try to keep this light and interesting as possible. Who am I? First time in Unreal Live, KubeCon. I did presentations during the pandemic, but for those who never had seen me, I've been around the container industry for seven years or so. And now I work at NVIDIA, the cloud-native team, basically trying to enable GPUs into Kubernetes and make things easier for all of you. And Ryan? Hello, my name is Ryan, and this is also my first KubeCon. It's great that the first KubeCon can get to talk to so many people here. Who am I? I am currently working in Azure Kubernetes Service. I'm a principal SDM, and in my previous life, I worked on KubeVela and OAM as a founding member. I don't know how many of you remember OAM. And way before that, I did a PhD in grid computing. Again, I don't know how many of you know what grid computing is, but that's not the point. OK, so today I'm going to talk about a multi-cluster. And in this talk, I'm going to first give you a brief history of multi-cluster. And then I will present the motivations. And then we are going to have some deep dive into the two APIs that we are proposing and developing. At the end, there's a Q&A session. Before that, let's do some data-driven presentation. How many of you are working with Kubernetes clusters? Hanzang, oh great, this is KubeCon nice. How many of you have only one cluster? You can leave. This is not talk enough for you. How many of you have more than 10? Well, still a lot. 100. Oh, OK, 1,000. OK, talk to me after the hot talk. You are my audience. I'm just kidding, you're all good. Yeah, so it's clear that a multi-cluster is something that we're all dealing with it. And it is something that almost starts when the Kubernetes or the clownative movement starts. Because we all foresee that there are going to be multi-clusters. However, another thing is, interestingly, I think I've been to at least two multi-cluster talks every day ever since in this KubeCon. And in this morning, there's a multi-cloud talk. And it was from between Google and Microsoft. I was like, wow, they are two working together. And then I went there. It turned out to be another multi-cluster talk. And lots of people actually cannot get in. I waited outside for like 10 minutes before I get in. So if any of you were not able to get into that talk, these two slides basically zooms through whatever they talk. So you are not missing anything. So OK, so when we talk about a multi-cluster, the first thing I think comes to my mind is cluster management. So Namely basically creating, deleting, upgrading your clusters. That's the basic. And I think many of you think you're really the good audience here. You know that when you have maybe one or two or maybe five clusters, they're cute. You're just babysitting them, upgrading, and put every patch. CVEs, they're cute. But then when you grow to 10, 15, they're like becoming teenagers. They rebel. They're all stuff. Then you get headaches. So that is kind of the first problem. Or the cluster management phase. In this front, I think there are some community solutions and some proprietary solutions. For example, you have Terraform. You have, I think, what else you have? You have a cluster API. And for the proprietary ones, AWS has cloud formation. And then there's Rancher. That is not really a cloud provider. But they show me a demo somewhere. They look pretty good. You can see all your clusters. You click, click. Each one gets upgraded, happily, in that demo. And then you have GKE. GKE, I don't know if there's any GKE people here. GKE has a rollout sequence. Basically, they take care of the rollout. And Azure has Azure Kubernetes Fleet Manager. We have stages, groups, all sorts of things you can think of to make sure that you can upgrade your whole fleet safely and happily. At least that's the goal. And then we have this cluster configuration. It's kind of similar, but a little bit different, such as you need to rotate your certificate. You need to do capacity management, access management. GKE also has a pretty very detailed, what's that called, config sync solutions there. And I think Terraform and other cluster APIs can solve that problem sort of. And plus, you're just throwing whatever your favorite script languages. That part, I think, is relatively OK. But we all know that. Again, just some poll. How many of you guys have upgraded a cluster and then your application stopped working? And how many of you have nightmares or lose sleep before you want to click that upgrade button on your cluster? So the reason I ask this is the key is actually not the cluster upgrade. It's the application. The cluster could be upgraded successfully or thought of, but actually your application stopped working. So the application management is that a real deal. We don't just create a cluster for fun. Some application needs to run on that. So the multi-cluster application management is the next forefront that we are trying to tackle. I would now repeat, again, if you have been to those multi-cluster talks, they are going to give you 10 reasons why you want to do multi-cluster applications. I'm not going to repeat those. Just assuming that you have these needs. So the first thing coming to my mind or everybody's mind is workload scheduling. You have an application on multiple clusters. Then you need to decide where to go, when to go, how many of them are going there. And there are a lot of community-based solutions out there. That part, I won't say it's solved, but there are solutions there. And then there's traffic splitting. This is mostly I'm talking about north and south traffic. If you have a workload, unless it's like a silent guy just sitting there doing nothing, someone needs to talk to it. And then there's this traffic splitting. I'm not aware of any real good solutions out there. It still works for everything, but it's not always work. Anybody who has deal with it still knows. It's great when it's great. But then every cloud provider has their own solutions, global load balancers, traffic managers, front or whatever you call it. You can put them all together. Maybe there's some customized solutions. And then there's cross-cluster communications, basically east-west. I'm probably aware of this or not. There's a multi-cluster service API that is in the multi-cluster sync. We are in the middle of trying to revive it. It's being V1, R4, 1 for maybe three years. But that is a community API. So we are talking APIs. We are happy with just APIs. And then the real deal, I think, for the multi-cluster application is really disaster recovery. I think, again, I've been to at least two talks about multi-cluster every day. The application manager or those vendor companies, they have to do that because, for example, there's a Bloomberg, right? Bloomberg, they have to have those disaster recoveries. Any serious business have to have a disaster recovery story there. Otherwise, their managers, their boss are not going to be happy. But this disaster recovery is not just put two applications onto a different cluster and then they automatically work. There are actually tons of manual steps if you're not careful. You need to find the capacity. Someone just think, if you have traffic splitting going to cluster one and cluster two when cluster one goes down, if the traffic just goes through cluster two, it works. It won't work. Cluster two doesn't have the capacity. It will just overwhelm in this cluster. If you don't have cluster capacity management there, it won't work. Auto scale need to work. So all these things. And then you also need to carefully do the traffic splitting, right? And then finally, there is actually not the least. How to enumerate everything. There's batch scheduling. It's different from workload because the batch drops a lot less. It's not long-lasting. It's more short-lived and only works and then go. And then you have more chances to balance the workload or the utilizations between clusters. For workloads like a traditional service, you stay there. It's happily running for days or even for months before you can move it. In a serious enterprise, normally they don't want you to touch the applications. They are very careful. But batch is a total different story. So those are the application management. And then we have this hybrid part, like the fleet management. I don't have a better word. Naming is hard. You will know why. Naming is definitely hard. I just call it fleet management. For example, there's a identity federation. What does that mean? When you have your application running in one cluster, and then normally it needs to talk to some dependency services, especially in the cloud provider world, then you need credentials. But when your application is into multiple clusters, how do you make sure that they would talk to the same service or talk to a global service using the same identity? That's also not a given. And another super difficult problem is I don't think there are good solutions out there. It's observability. There are two parts. One is the cluster observability. That part probably is more solved. That means you have hundreds of clusters, like you guys do. You need to know their utilization. Any of them are dead or happy, not happy. Should I bring some more cluster into that? Because the capacity is really low in one region. That's the observability of the cluster level. And then there's, I think, that's the part is I'm not aware of any good solutions out there is the application observability. If you have your application onto multiple clusters, how do you get the idea of the application actually is running? You don't have that one glass of pan over there that you can see everything. You can go to each cluster. Each cluster you can see. But you don't have the high level picture of the application. That's another very difficult problem. I think I spent a little bit too much time on this. Then I combined these two, the last two together. And then there's team management. So you have a fleet of clusters. You have a bunch of teams. I guess most of companies, you have different departments or different teams. Then you want to basically assign each team to maybe a part of the cluster or many different ways. But you need to do quota management. You need to do assess management. Who can use what at what capacities? Those are also hard problems to solve. OK, I kind of spend a little bit more time on that. I think the takeaway is pretty straightforward. This is a very huge problem domain. And the solutions out there are not nearly good enough to solve these problems. And we are working with the SIG multi-cluster projects. So I don't know. Again, this morning's talk about the KUBIFED. There's a KUBIFED V1 and V2. Both died pretty sadly with different reasons. If you want to know, because for the time being, I'm not going to get into details. These two are actually implementations. So these two actually come with a full-fledged implementation that you are supposed to be able to just install on your clusters and work. Sadly, as I kind of alluded to, the problem is just so difficult, it's hard to find those solutions. And then the SIGs start to learn our lessons and then focus mostly on APIs. So here we have Work API. I'm not sure if anyone have heard of that. Good, great. Work API is kind of a primitive way for a building block for the member cluster or workload clusters talking to a management cluster. If any of you know KUBIFED V2, one of the problem of KUBIFED V2 is for any customized resources, you kind of have to wrap it. So for every customized resource, you have a federated resource, which I would say, I don't know why people start to think that's a good idea. So we'll overcome that. So now we don't have to, basically, we are doing a JSON format of that whole blob of data and treat it this way. And then there's about API. Honestly, I'm not sure what it is about. And then there's MCS API. That is for the east-west. There's import, service, export. That's the API that I think we are reviving it. We're trying to push it to V1 because it is actually useful. So anyway, I think it's pretty straightforward or easy to see the state of the art is everybody still treat their cluster as a pet. You pet it. You make sure it doesn't die. You look after it. And there was tons of customized solutions. Again, I attended talk in Bloomberg. I don't know if someone from Bloomberg is here. They created a cluster with two APIs. One API server talks to a Postgres SQL database globally replicated. And then one API server talks to the normal ETCD. After talk and QA, I still didn't figure out how these two APIs works in one cluster. And then there's Elastic. I don't know if any Elastic person is here. They basically get rid of API server and ETCD at all. They just rewrite the whole control plan. And the only thing they make sure is the controller runtime still works with their customized server. It's not even API server anymore. I don't know what it is. So those are the customized solutions. That works, I think, for their specific use case. But what we are looking at is from the second multi-cluster point of view is we are trying to build something that is community driven. And more generic than some very specific solutions. OK. So I hope by now I don't really need to get into the motivation part. Everybody probably is already very motivated to get a common solution here. So here are some of the notable open source multi-cluster projects here. One is OCM. I probably, most people have heard of that. That's from Red Hat. And the ClusterNet. ClusterNet is another CNCF project that is mostly coming out of Tencent. And then there's the Fleet Manager. I kind of built that project. It's open sourced, but not too many people know that. And then there are dozens of, I probably one of you guys have built something similar already. There's Kube, AdMoral, and something else. That is from Bydance. They recently made a big deal of open sourcing it. And then the interesting thing is this. There's Q. I don't know how many of you know what Q is. You probably have heard of that. And there's KubeVela and then there's ArgoCD. I guess everybody knows ArgoCD. And they are not strictly speaking a multi-cluster project at all. But they all need to deal with multi-cluster. So Q has a multi-Q. Actually, I'll get into that just in the next slides. So now with all these different projects, basically OCM, ClusterNet, and the Fleet Manager, the maintainers of these three projects, we come together and just hash out something. And we're like, yeah, this is just so fragmented, so many different piece-making solutions. We are trying to build this community up, get some common ground. So here are some principles we are trying to adhere to. The first, we want to make sure that multi-cluster is still cloud-native or Kubernetes-native. We are not going to reinvent API server, get rid of ETCD. No, we are going to stick to the upstream. If Kubernetes gets into Postgres SQL server, that's OK. But we are not going to do that. And also, we are going to focus on APIs because we already have so many solutions out there. We don't want to reinvent a wheel. And also, because as I mentioned, I spent like five minutes on those problem spaces, there's no way we can tackle all of them. So we want to start very small. You will kind of surprised how small it is. We want to start from a very specific problem. And that problem is, how do you represent a cluster inventory? Very straightforward. So from OCM, cluster net, flea manager, we all have a broad resource. Basically, it's a CRD representing a cluster inventory. And the name is like Managed Cluster, Member Cluster. But the interesting thing is that from the Kubevela, I don't know how many of you know that, it's an application model. It's an application model. But application model, as I said, you have multi-cluster applications. How do you represent that? They also need a way to represent a cluster inventory. And Q is basically scheduling jobs, again, scheduler. But same thing, if when you start to want to scheduling into multi-clusters, they need a way to represent a cluster. In that case, they call it multi-Q cluster. And now we are like, yeah, everybody has that. Why not just have a common interface instead of everybody has basically reinvent different wheels, like wooden wheel, gold wheel. We try to get this one single wheel. That's the whole point of this motivation and the whole point of this talk, kind of. And yeah, so the real benefit of that is if we have this common API, then two things can happen. One is we can all start to build on top of that. And then the second thing is the more important part is the third-party applications, like Argo CDQ, could integrate with any of those multi-cluster projects without tied up to those projects. For example, actually, I have an example just later on. But really high level, this request is coming from Argo CD. Argo CD needs to basically get ops and put their application. They have a CRD called application set. So they want to put their resources onto different clusters. But they do not want to integrate with OCM, ClusterNet, Fleet Manager, and a blah, blah, blah, a dozen of them. They want to integrate with just one of them, with the common APIs. And then it works. Everybody works. And then in this way, we can build up the community. And again, I'll show how this works in the next few slides. The non-go is we are not going to provide a standard implementation just like a KubeFat v1, v2, v3, whatever. And we're not going to define implementation guidelines. How do you implement that? And we're not going to, at first, at least in this API, we are not going to offer any scheduling-based functionality guidance. That's the next step. But for this API, it's purely just cluster inventory. OK, with all that, I'm going to get into the details of this cluster inventory API. Here is an example, very simple examples API, if you can see that. So basically, first thing is that naming is to be determined. So cluster, inventory, inventory, cluster, cluster, whatever. Naming is to be determined. But what it has is, for the spec, you can see it's a very simple spec. It only has two fields. Display name and a cluster manager. Here, the display name basically is for human to read. Because we have this name uniqueness, it's a long story. And then there's a cluster manager. The cluster manager basically means what is, in our case, will be OCM or fleet manager or cluster net. Or you have your own favorite multi-cluster projects bring it in. If they support this API, they can be this cluster manager. But the real important part is in the status. We have a version, which I think is pretty obvious, self-explanatory. And then we have properties. The properties, basically, as you can see, it's just the key value pairs. Then that means you can put in anything. Basically, if it's a string, you can put it that in. And then you have conditions. Currently, we only have two conditions, health. And it's actually control plan health. What's literally translating to that is all your components in your control plan have the health check passed. The container have this health pass. And then there's a joint. Whether this joint, basically, is a heartbeat. OK. That's it. That's the whole API. I know what you guys are thinking. After 15, 20 minutes of hype, that's it. It's kind of like a, I didn't expect any collapse here. I thought it would be a bruise. Yeah, it's like a giant nothing burger. What's this? I'll show you why this is important. So here is the powerful where it becomes powerful. So here's the link is an algo workflow issue. So the issue is basically, say, we want to use cluster inventory API, whatever name it is. Because the algo workflows, I don't know how many of you are familiar with the algo workflow. Actually, I'm not. But just looking at this, if you can see this example, you can see that there's a name. The name, the inputs, is called cluster inventories. So that means this. And the name, the cluster inventory name is called cluster one. The powerful part is you don't have to explain anymore. It's just a cluster inventory. Argo would know what the cluster inventory fields are. For now, it's like nothing. Actually, it's not really nothing, but it's limited. But you don't need to talk to. Argo doesn't have to integrate with anything else. With that, they already have this common interface to integrate with. And then inside that in arguments, if you can see that in arguments, it says the inputs.clusterinventories.cluster1.name. Basically, you can template this fields into their CRDs and into their arguments. So immediately, they could use this templating value within their workflow. And in the future, algo workflows has the potential to leverage the cluster inventory to do a scheduling, which is almost always the next thing people are thinking about. This is an example. Let me see. Yeah. OK, so again, currently, it looks like a nothing burger. But this burger will get bigger. We have plenty of draw maps ahead of us. The first thing, we actually had a first unconference talk upstairs two days ago about credentials. The really nice part is, as soon as we bring this cluster inventory API into the CIG multi-cluster, different projects notice that. They realize that this is a good API to integrate with. So the credential ask was coming from Q. Actually, I didn't know Q before that, but they came up and then I realized, oh, this is actually a very important application there, or important project there. And so we are going to talk about, how do we add credentials? Push model, pull model, spec versus status, sequence, reference model, read only. You get it how difficult it is to just add one or two fields into that API. That's why we currently have this pretty much schema-less schema-less APIs. And also, the key part, I think, if I can get back to this, is the real value is in the condition and the properties. I want to make sure that we get that. The properties currently is very loosely defined. But in the future, we would like to really define the real properties, such as allocatable memories, CPUs, especially nowadays GPUs. How do you find the GPUs in a cluster? That's a very common problem for everybody. GPUs are like weighing gold. You cannot find them. And if you have one, you want to use it 100%. How do you do that? For a meta-schedular point of view, you need to know where the GPUs are, how they're doing. Cluster health, that's another unsolvable problem. If I think most of the cluster means would hope you can have one single signal, say, this cluster is healthy, which means not just the cluster healthy, but all the applications are happily running there, we need something like that. And finally, we have this roadmap. For the time being, I'm just going to zoom past it. Mostly the first thing is we want to have more consumers and providers' integrations. So for the three members in the founding group of this project, we are all going to integrate. We are the providers, like OCM, ClusterNet, Flee Managers. We are all going to support that. But we need more consumers. So we now have multi-Q. We have Argo. And if any of you are working on some projects, we definitely would like to hear from you guys. And then the final one, as Carlos mentioned, naming is hard if you have your phone. You are coming at a really good time that we are having a survey. We really spent three sessions in the multi-cluster meeting just to discuss what is the proper name. We come back and forth, English grammar. We get into Marine Webster, whatever. Precisely define what is a class infantry. We cannot agree. So that's why we have this naming survey. So I think after this talk, you kind of have enough background of what this is about. Hopefully, you can have enough opinion to vote. There's Cluster, ClusterRecord, ClusterProfile, ClusterMember, ClusterDetail. You get it. Yeah, OK. Hopefully, you web out to your phone and vote. OK, great. Now I'm going to hand it to Carlos to talk about why we have this Node Feature Group API that will be very helpful for the Cluster API. So Ryan mentioned. I'm going to be very quick. Ryan mentioned it, also having customers and provisioners. So the idea of the Node Feature Group is to have a provisioner to the Cluster inventory. Sorry, my head is wrapping around names. So the Node Feature Group is going to be fitted by NFD. For those who might not know NFD, that's a quick one-on-one, is a project to create labels, annotations. And it has grown to the point that it is now supporting topology-aware scheduling with the topology updater. And now we also added a garbage collector. Basically, NFD provides an API for you to discover all the features in your cluster in a per node basis. So it would advertise a Node Feature CR one per node. And how can you configure what NFD is going to discover is that we already have an existing feature that is called the Node Feature Rule, on which you can have much features on like you only want to discover a specific memory, CPU, kernel, like NFD is very extensive and is very feature-rich. So based on these APIs that NFD already provide, we want to propose the Node Feature Group API. And it's basically a way to create groups of nodes or groups of systems based on this rule feature set that NFD already has. So we are proposing the Node Feature Group API that basically will look like this. So it would be another API, so an extra CRD that NFD is going to handle. And by creating rules, you can create lists of nodes in your cluster. Why this is important, as Ryan was kind of pointing to other talks, I've been in some talks where they talk about multi-cluster as something easy and something like I was in a multi-cluster and a multi-Q talk where they say that you can deploy a job on any cluster and just kill one cluster, create another cluster, and keep on running. Some business logics are not that simple. And sometimes you want to know which clusters has specific features and which clusters have specific configurations. At some point, your business logic will also want to have specific drivers, specific kernel configurations. So you want to know where those nodes with those specific configurations are, and even to the point to data locality. You want to have a list of the nodes that have the data already downloaded so your workloads are going to run faster instead of trying to download them. So this API is going to be feeding the cluster inventory. And how it's going to be feeding the cluster inventory? Basically, we propose two models. The push model on where the cluster inventory is just going to have a pointer to the node feature group API on each worker cluster. And then the controller, the scheduler on the manager cluster is going to use that to then make decisions. But this doesn't play along with some proxy or some network configurations. So we also propose a pull model on which NFD or the worker cluster itself is going to be pushing the NFD API, so the node feature group, into the manager cluster. And then you can make decisions there based on your worker clusters. Like this, the pull model is more designed for those proxy configurations, those VPN configurations on which some nodes are going to be isolated, and they only have network permissions to pull but not get push information. So basically, what we propose here is that the node feature group that is going to be handled by NFD is going to be a feeder, a provisioner, to the cluster inventory API. That way, your business logic, because there is Carmada, there is Q multi-cluster, and other projects that are coming into the ecosystem, they can then just use these APIs as information to make an informed decision. And they can make a smart scheduling routing by using these two APIs. Again, this is an invitation for joining the node feature group and the cluster inventory conversation. These are proposed APIs. They are still not implemented. I think the CAP just got accepted for 1.30. And also, the node feature group is going to be into the next NFD release that will come maybe like two weeks after KubeCon. So these APIs are still in alpha, and we want feedback from the community. This talk was mostly to bring the conversation to KubeCon to get feedback, because we already got a customer that is Q. But we want more customers, more people saying what works from these APIs and what really we are putting out of effort and really doesn't work. So we want you joining the conversation. You are welcome. So this QR code will take you to the SIG multi-cluster. Please join the conversation, go to the meetings, and say, what do you need extra from the API as the Q community already did? So the Q community joined the meeting, and they are already proposing changes to the API so Q can use these APIs to make smarter batch scheduling decisions. So that would be a nice thing to have. Q that is a SIG upstream project, and it's going to be using these APIs to make batch scheduling better for Kubernetes. And thank you for coming again.