 Good afternoon everyone. Welcome to our talk. My name is Sahiti Ailu. Here is my colleague Arun Krishnakumar. Both of us are engineers from VMware. Today we are going to share our experience on building cluster API infrastructure provider for a multi-tenant cloud platform. We will also be talking about few of the challenges that we have faced along the way. Lessons learned and discoveries made around few of the problems that we have identified that one could see in their environments and also design patterns around cluster API usage in multi-tenant cloud platform. Lastly, we will be covering on how we have actually built Kubernetes as a service layer with underlying technology of cluster API. With that, let's get started. So agenda for the first half of the talk, I will be covering on cluster API internals and will also be giving you some resources on how to get started with the implementation. For the second half, Arun will be covering on design patterns around cluster API usage in the context of multi-tenant cloud environment. And lastly, we'll be covering on Kubernetes as a service layer on multi-tenant cloud. Okay, before getting into the details of cluster API, I would like all of us to have a common understanding on what multi-tenant cloud is. So cloud basically delivers infrastructure as a service to its tenants in terms of compute, storage, networking, while providing strict isolation to its tenants and also the security. So then who are the tenants here? Tenants could be an end individual user, but in this particular case in our cloud platform, this multi-tenant cloud platform, which is VMware cloud director, that's our products name. So the tenant is an organization. It's an enterprise level company with group of users, and there can exist multiple organizations within this cloud platform. And these users can request for Kubernetes clusters. And this is the solution that we have built at large, that is a Kubernetes as a service engine on top of multi-tenant cloud platform with the underlying technology of cluster API. Okay, so what is cluster API? I'm going to quickly breeze through this slide. The cluster API is a Kubernetes project to bring declarative Kubernetes style APIs for cluster creation, configuration and management. So the idea here is that so end users would run these commands, which are kubectl, traditional, familiar commands to create the cluster on an existing cluster. So this existing cluster with cluster API components installed is what is called management cluster. And the children clusters that are created as a result are called workload clusters. So that's the difference between management and workload cluster. So management cluster is cluster with core CAPI components installed. And this is where API definitions of workload clusters are also defined and workload clusters are intended to host the modern workloads. Okay, so that's the end users point of view on what cluster API is. Now let's see the developers point of view on what cluster API is. So in simpler terms cluster cluster API is comprised of one main component that is core CAPI provider and three pluggable components infrastructure provider bootstrap provider and control plane provider. So all these providers have their own set of responsibilities to adhere to meet and few orchestration rules to adhere, which are dictated by the core CAPI provider. So at high level, the infrastructure provider responsibility is to create the necessary infrastructure required for the cluster on the chosen cloud environment. And this is what we are interested in for this talk. And bootstrap provider's responsibility is to generate a script that can convert any machine into a Kubernetes, either control plane or worker node. So the control plane provider's responsibility is to manage control plane nodes and deal with upgrades, etc. Okay, so the other idea here is that user would come in, apply a cluster manifest file on this management cluster and the resultant workload clusters get created on the chosen infrastructure provider. So for this talk, so this CAP VCD is what we have built cluster API provider for VMware cloud director and VCD is the acronym that stands for our multi tenant cloud platform that is VMware cloud director. Here is the sample CAPI manifest file. So cluster, so basically it's a hierarchical structure of API objects, and the cluster is at the root, it's the root element of the other objects and it holds owner refs to further objects associated with other other providers. Note these VCD cluster and VCD machine template custom resources, these are associated with our infrastructure provider that's CAP VCD and you'll be actually replacing those with whatever the infrastructure CRs that you'll be coming up for your infrastructure provider. So the sample CAPI manifest basically says I want to Kubernetes cluster with one control plane and one worker one worker node with so on so settings on so on so plus cloud provider. Let's go a bit deeper now to see how cluster API works. What enables smooth interplay of core CAPI and all the other providers that we have just talked about number one hierarchy of API objects and number two cluster API contract. So hierarchy of API objects, the diagram that you have seen here is the pictorial representation of what we have just seen in the previous slide. The cluster API manifest file and everything that's in blue here are custom resources associated with core CAPI and other providers and that are in green are associated with the infrastructure provider, which is of our interest. And when you apply that YAML file, this is how the resultant API object hierarchy is going to look like. And all of these resources are being watched by their associated controllers to bring so they are basically doing continuous reconciliation attempts to bring the their current state to the desired state. So similarly, we have these sorry. So we have these infrastructure and information objects that are created as a result. And these are also be are supposed to be watched by the respective controllers like infrastructure and information controllers. And this is what we are supposed to build as part of the cluster infrastructure provider. So these CRDs and their associated controllers at the minimum is what would make the infrastructure provider. Okay, so now that we have a fair understanding of what hierarchy of API object is going to look like. Let's understand cluster API contract. So as I've mentioned before, all these controllers from the providers are responsible to do certain things and all of these need to adhere to certain orchestration rules dictated by the main component that is core CAPI. And these controllers are supposed to interact with each other by variables called well known fields. And let's take an example of how cluster controller and how in infrastructure controller interact with each other in the sequence diagram. So both of these controllers are watching watching their associated CR custom resources here. So the cluster controller is the first one to act here. It sets the owner ref on the infrastructure. So basically it says I kind of own you that an infrastructure controller job from that point onwards is to create the basic infrastructure for the cluster creation to proceed further. Like it can create a load balancer on networking setup that's unique to your own cloud environments and it also need to ensure that control plan endpoint is either generator or specified by the user. So once the control plane endpoint is generated, it sets itself to as ready and then cluster controller would consume that control plane endpoint and marks itself ready and it generates the cube, configs secret so that end users can access begin to access the cluster. Okay. So information controller is mainly responsible for creating nodes. So in the previous slide what we have seen is infrastructure controller that was the one that created the basic infrastructure necessary for all of these controllers to proceed further and information controller. Main main job is to create the necessary infrastructure and at the same time bootstrap controller generates the necessary script to convert these machines into Kubernetes nodes. So bootstrap controller generates the bootstrap script and store set in a well known field in a data secret that is to be consumed by machine controller machine controller kind of copies it to another field, generated by information controller and information controller at this point provisions the necessary infrastructure using that bootstrap secret. Basically it takes out the cloud in its kit from that secret and runs it to convert the machine into a Kubernetes node, either control plane or worker node node. And if it's a control plane node, you would also see another controller in the picture that is KCP controller. So once that is all done information controller marks itself ready, and then machine controller also marks itself ready and it waits for the node to join the cluster. So the bottom line here is that again, the CRD is associated with these infrastructure machine and associated controllers would what make up the infrastructure provider and this is what we need to implement as cluster API infrastructure provider. Okay, so now that we have a fair understanding on how internals work, let's get started with the implementation and implementation should become relatively easy and it should all the understanding on the cluster API internals should also help you to debug and troubleshoot when things do not go as expected. We have used q builder command q builder to actually create the project layout and the scaffolding q builder is the framework that generates API Kubernetes APIs via custom resource definitions. And it will also generate a lot of boilerplate code for you. You can just jump in and implement the business logic. Okay, so now let's assume that we have build the infrastructure provider for your for your own cloud environments. Now so now how do we get these infrastructure provider implemented on the clock Kubernetes cluster. So basically we need to set up the management cluster and we can achieve that by cluster cut tool that helps setting up the management cluster and generating cluster manifest files. So this is the command that you need to run for your infrastructure VCD is our platform. So basically this command pulls the content from your GitHub repo and installs those components on the management cluster. So what this also means that you will have to update or edit cluster cut all code to include your infrastructure provider as list in as part of the huge list of other providers that cluster cut all currently supports. And these are cluster cut all generate commands to generate the sample cluster API manifest which you can run on the management cluster to create workload clusters. And next. Okay, so now we have this management cluster fully ready with all the components installed user would come in and create the workload cluster. Now are these workload clusters ready. Not yet. So ready as an ready to host modern applications. Not yet. So we need CNI to enable container communication CPI to set provider ID on the notes note that CPI is kind of a mandatory requirement. From core cappy. So it expects cloud provider interface to be installed on your workload clusters. So, so we need to have the CNI and CPI installed on the workload cluster to be called fully ready. And CSI is to enable state stateful deployments for persistent volumes and we use CRS cluster resources definitions for installing these components. So if you're planning to build cluster API infrastructure provider, you should also plan to implement cloud provider interface for your cloud environment. Okay, so admission controllers and multi version API. So now that we have the basic implementation of cluster API infrastructure provider, you can make it more robust by implementing these admission controllers defaulting and validating webhooks. Basically, they let you write custom code to either set some default values on the resources and validate before the data is persisted in the city database. And next multi version API support. It's a big topic in itself for this talk. I'm just going to go over the need and the few resources to get started. So at some point you'll have to think about bumping up API version. And what this means is it becomes a necessity on the infrastructure provider to be backward compatible with older API versions. Again, so when user request for a older API version, Kubernetes API server is supposed to return the object in that API version. However, your stored version could be much ahead and Qube API server needs to do the necessary conversions between the desired version and the stored version. And these conversions need to go into conversion webhooks so that Qube API server can actually call them to do the necessary conversions. So you can create the scaffolding for these webhooks also using Q builder and we have used the same. So few of the lessons learned. So the Docker provider is an excellent starting point to read through and modify the code, which is what we have actually used in the beginning to familiarize ourselves with the infrastructure provider implementation. It's very simple and bootstrap controller. It generates a cloud in its script and ginger template. We had to do some tinkering to adjust it to our needs and load balancers are kind of a first class component of the infrastructure and I have men already talked about CPI and do remember to Oh, where is it? Yeah, for this. Yeah, here. Sorry, sorry, this thing needs to go in here. So we need do we need to remember to set the cloud provider as external for the Q blood configuration. Yeah, and so lastly, on this page. So these two labels take do take note of these labels these need to be set on your CRDs. This is an important step. This basically tells cluster core component core cappy component to use which API version of your infrastructure provider and it becomes even more important to set these two fields. When you have multi version API support in your infrastructure ready. So we actually hit this issue where we had this multi version API support ready and we some we kind of forgot to set these labels and our newer API version resources are somehow getting reset to the older content. So we had no clue why thanks to cluster cluster API folks who helped us debug and it's a simple change, but this is very important thing that you need to remember. Okay, lastly, auto scaling kind of comes for free with the cluster API. All you have to do is download this and run this command with cluster API as a cloud provider and set some annotations on the machine deployment object. So there are more references to here and your clusters are auto scalable. So with that, I'll hand it over to Arun. Thank you. Thanks, so yeah, let's move from here. So as you see, there is a there are lots of gotchas. The cluster API documentation is one, but as she mentioned the labels and so on are. It's tough to find them in the documentation. The documentation is like a 350 page book if you print it out and the labels are mentioned in some particular cases and it is tough to debug these as well. But the cluster API community is very supportive and we made use of their help quite a bit. So now let's reiterate on revisit how the VMware how the cloud provider multi-tenancy looks like how public I mean how in general any multi-tenancy looks like and how VMware cloud director fits into that model. So in VMware cloud director, we follow the principle of the Google private cloud equivalent, though the other mechanism is also possible. You can have a set of organizations or one and or two and so on. So you, for example, suppose there are two organizations are going and out to the cloud is a partitioned into these organizations in terms of compute network storage and so on. So each of them is like a set of virtual data centers from the cloud. They are carved out and each org has their own set of data centers and their own set of resources. Now the IDP is also carved out in the sense that org one users. They do not really know the existence of org two, for example, and org one administrators also do not know the identity of how to there is an Uber cloud provider who is at the top who can potentially see everything, but the identity. I mean the IDP is unique across all of the orgs. Now we this is a very rich multi-tenant system. Now we wanted to figure out what our goals were for VMware cloud director and what cluster API brings the brings forward and how do we marry the two pretty much. So on our goals are that tenant users or organization users, they should be able to create cluster API in a self service manner. So self service manner is started because we'll talk about that in the in a little bit. Self service is very important for us. The second part is to bring out the features of our cloud platform VMware cloud director into Kubernetes clusters. So we have very strong user isolation and quota systems and roles and rights. Whereas Kubernetes has its own RBAC system. The quota system of Kubernetes has a lot to be desired. I mean it has one authentication mechanism and it has a separate or the mechanism and the two don't talk to each other pretty much. So, I mean, and yeah, we are left wanting that. So we wanted to be able to actually represent our user in the Kubernetes cluster. So the third part is the same thing about representation. So we want to ensure that our tenant users can be represented using their own IDPs in the Kubernetes clusters. So they must be able to authenticate. They must be able to create a cluster. But within the cluster, they must be able to show their identity. And the fourth thing is that we wanted to administer cloud policies on the user from the cloud side, but they should also flow into Kubernetes operations. So if a user has access to create only say 10 VMs, we want to ensure that the nodes that the Kubernetes nodes when they are auto scaling, they shouldn't scale beyond 10 they should stop at 10. And that should be a policy which is enforced by the cloud and that should automatically be used by the clusters. So on then we actually came up with a set of questions. Now, how do we actually first of all satisfy the network requirement. So one hidden thing of cluster API or it is potentially evident now is that the management and workload clusters all need to talk to each other all of the time. There needs to be a network connectivity. Whereas you saw that the orgs were disjoint with respect to network in the other case. So how do you actually have network connectivity between our management cluster and workload cluster. Maybe you cannot start off with one Uber management cluster. The second thing is the who creates the management cluster and manages it. So potentially there must be somebody who does the life cycle of the management cluster and keeps it secure. If the management clusters, for example, handles a thousand workload clusters, right? There's also a skew of versions between the management and workload clusters. So at some point, if you want to upgrade the workload cluster, you will also have to upgrade the management cluster. So there is some amount of management task and the administrator of this cluster needs to be Kubernetes savvy. So that is on the cluster management side on the user management side. How do the users create workload clusters in a self service manner. So the user needs to know that there is a management cluster and they need to go and ask for some access to it and do it. So it's not very self service. Basically, there has to be somebody who is on the management cluster side who is helping them out. How do we enforce the tenant boundaries on the user side? Basically, how do we represent the user? So that is the next two things. And finally, how do we audit the user actions on the cloud side? The user may go and do some Kubernetes actions, but they have to be audited with respect to their own user ID. So to solve the user aspect, we made the user a first class citizen inside Kubernetes. So what the user would do is they would be able to ask for a token from the cloud and get a refresh token and embed it as a secret in their token. So what I'm talking about from now on is sort of patterns in multi tenant clusters, which exist in other providers and other systems, but they're not really documented or there is no clear acceptance of those as a pattern. So basically we are talking about what we did and these seem to be the common patterns which are being used nowadays. So one of them is representation of the user using their secrets. I mean in AWS it is mounting the secrets and so on. So that way you can actually enforce the policies of the cloud director on the particular user. So the user would just embed their secret into the cluster which they create and they use the token in secrets. So we had to build a refresh token methodology in order to get this because we didn't want to expose the user credentials directly. That is like a token which can be revoked pretty much. So the other thing is the network boundary. We have seen that there can be multiple organizations. As a result, there can be multiple management, I mean there needs to be one management cluster per organization. And there has to be workload clusters connected to that particular org itself. So this wall is pretty much a network boundary. And since management and workload clusters need to talk to each other, they have to be within the same networking space, which is the same tenant. The user would essentially use their access token and do it. However, this also needs a namespace level multi-tenancy for which we take Kubernetes itself and the reason is in the next slide. So as you can see, like Sahity mentioned, the tenant can apply Kubernetes YAML to the management cluster and get a workload cluster. However, they should not be able to view the other clusters. So their access should not flow into this. But how do you manage that? Because pretty much the tenant needs some access. So here we are, and the third part is that the management cluster should not have a long time access to the clusters credentials. So we solved that using the namespaces. So what happens is that each tenant gets their own namespace, each tenant user gets their own namespace. And they are able to access only objects within that namespace and they can create workload clusters within that namespace. However, the other part is that they need to have short expiry, I mean, refresh tokens which are short expiry times. And I'll come to that next when we are doing a self-service Kubernetes or if you are doing this in the complete schedule. So this namespace pretty much gives exactly this tenant user access to cluster objects and VCD cluster objects and so on, a subset of objects. So now let us quickly look at how the workflow would be like. So you have, suppose there is an Alice management user and a tenant Alice user and this user wants to actually go and create a workload cluster. So first of all, discover that there is an Alice management cluster. So there is no discovery and somehow we have to get to know of it from by word of mouth. But once this particular tenant gets the cluster, gets to know of the cluster, then they actually figure out who the owner of the cluster is, the management cluster admin, and they ask for a management cluster access. So this is a human operation which happens. This management cluster now creates a namespace for this user and gives them a particular cube config, which is the way to access this particular cluster. And now the tenant, so as you can see the management cluster admin's work is done. They can just go out of the picture. Now the Alice management cluster is available. The namespace is available. The user is still there and the cube config is still there. Now they want to create a cluster. So they will create a cluster with the VCD token embedded as a secret. The VCD is our cloud provider refresh token, this particular token. And they get a workload cluster created using that they can monitor the lifetime of this creation and they can see it getting created. Now again, so seeing it getting created is a bit of a tricky thing because they don't have access to capacity logs. Okay, so it is not, I mean, we have to make the logs multi tenant in a particular way by which they can actually see if there is an issue with the creation and then they can go ahead and create it. Now, once that is created, so as in they should not be able to see any other users logs. So that is where the logs also need to be multi tenant at that particular point. Once the cluster is created, they want, they want to actually get the cube config of that cluster, the admin cube company, and that is the means by which they own the cluster, right? So they would issue a get cube config. It's a small script, which they can run, which would essentially take the secrets from the cluster and you get back this particular workload cluster. And things are done there later when they want to update or upgrade this particular cluster, they again need to come up to this particular management cluster and make those changes and their operations are satisfied. So this is the whole set of operations. And as you can see, there are many things in this life cycle and it's not very self service based right you actually go and talk to a person and do many things. There are other things with the management cluster as well, which we need to talk about in a distributed system. And that's the blast radius and security aspects of it. So you have a management cluster which is handling multiple workload clusters. And you have multiple workload clusters where the users have actually stored their secrets, their tokens. Now what happens if there's a network partition on the management cluster side. Now, the workload clusters cannot be managed at that point, you cannot upgrade or update them. So they are sort of waiting for the management cluster to come back up or the network system to be there. Or what happens if the management cluster is compromised in some way or becomes evil. Now all of the tokens are compromised. I mean, even if they have shortly tokens, they cannot use the management cluster anymore in order to upgrade their clusters. So they are in a sort of limbo state where in they can we can always revoke the tokens that is fine, but then you can't go and do anything to your cluster. You can just let the current workloads runs workloads run even auto scaling will not happen at that particular point. So in order to actually solve this issue, we actually came up with something. I mean, we actually began to use something called a self managing cluster. So this is an old concept which has existed from the beginning of cluster API. However, it is not used as a common pattern nowadays. However, this has helped us quite a bit. So you have the same system wherein you have the management cluster, you have cap VCD binaries, and you have the workload cluster records the CRDs which society was mentioning in flaka in front cluster controller and so on. And the user can actually apply resize and upgrade commands on this management cluster and things work. So what we do is we and things operate on the workload cluster. What we do is we actually install cap E and cap VCD on the workload cluster and we run the cluster. It'll move command. So the movement can be namespace. So you can exactly move those records of this cluster into the workload cluster. All of the labeled objects which I mentioned, those would potentially move once these are moved. As you can see this link between the management cluster and the workload cluster is lost and you can actually apply all of your commands on this. So what we have done is after the cluster has been made self managing. So this sort of a cluster is called a self managing cluster. Once it becomes self managing, you can just auto scale it by just using its own record so you can do a cubicle up. You can change the number of nodes for example, worker nodes from three to five. And you can apply the ML on this particular workload cluster. It'll scale itself up. Likewise, you can it can scale down as well and you can also do a upgrade operation. So you can actually let it upgrade itself from one version to another. It's completely self managing. The only caveat is for deletion for deletion. We need to take the help of another cluster and we haven't aware on that. And we began to use this on self service Kubernetes cluster. So now we want to have one SAS layer by which you can create a cluster in a networking space where you can't have access. So there is a networking space somewhere which has 192.168 dot something. For example, in 10 and Pepsi or 10 and co. You have a 192.168 network here. Now there is a particular Kubernetes as a service layer and a Pepsi user. He cannot access the network of tenant Pepsi directly. He is on his laptop. But however he can access API surface and they want to have in he or she and this user wants to be able to run a particular command. By which they can actually create a cluster here. So the way we do that is we make use of the bootstrap cluster mechanism. So basically we create a VM on that. So by using the cloud providers API. I mean by using the cloud API, we create a VM on that particular tenant and we actually create a small bootstrap cluster in that. And then we create a workload cluster based on that. And we move the objects. We make it self managing. I mean we install the Cappy capacity and we make it self managing. And then we destroyed the bootstrap cluster at this point. The Pepsi user can actually go and manage their workload clusters on manage their cluster on their own. So there is no intermediate layer. There is no management cluster. There is no overhead of another person coming and administrating one cluster and there is no scaling off request. So what happens if one management cluster is handling 10,000 other workload clusters? How do you actually ensure that it scales up and works? You don't have any of those questions. You have each Pepsi user, I mean each user accessing their own Kubernetes cluster. And the case Kubernetes service layer at this point is just a very thin wrapper. And we ultimately came up with that and completed, I mean we have finished the implementation of that and we are going to release that as CSE 4.0. I mean other users can also come and create their own self managing clusters. Now it's extremely distributed as in it's embarrassingly parallel, right? So there is no central authority which is actually trying to manage multiple clusters, but each cluster is managing itself. All of the requests are directly flowing to that. And shout outs, yeah, thanks a lot, Cappy community for the help and Jane Swarm and some of some other partners actually give us some requests and make us make Cappy city better and implement features better. So there are some references here and QR code, but if there are any questions we can take. So these are the common patterns. The reason why we have mentioned it is when you go to the community and ask for it, there is no fixed set saying that, hey, this is the way to do it. It is like some people have done it this way. Some people have done it that way. And so this presentation is a very commonly used pattern to do any questions or comments. Otherwise we are sort of at time or beyond time.