 Welcome to the cluster API provider talk. I'm Ashutosh, and I work as an engineer at VMware on cluster lifecycle team. I would like to invite Ankita and Richard to introduce themselves, and then we'll go forward in the talk. Hey, folks. I'm Ankita. I have been working with VMware for the past two years. And I have been an active contributor in cluster API and as provider. I am also acting as a maintainer for cluster API provider EWS. Richard, you want to introduce? Hi, everyone. My name is Richard. I work as an engineer at SUSE, specifically on Rancher and cluster provisioning. I'm also one of the maintainers of the cluster API providers for AWS, GCP, micro VM, and RKE2. OK, so this is what we're covering in the agenda. So we are going to see a quick intro of cluster API. We'll see how cluster API provider works, and then we'll get dates on cluster API provider, what's going on recently, and what's in the roadmap. But before we go further into this, can I see a raise of hands on how many people are using cluster API so far? OK, that's great. And how many of you have recently learned about cluster API, like any new audience here? That's cool. OK, so with this, I'll hand over to Ankita to give us an intro on cluster API. OK, so let's get started then. So the project was originally started based on the motivation that the cluster lifecycle management is difficult. Historically, there have been many provisioning tools, depending on where do you want to create your cluster. And there hasn't been much of consistencies in the user experience. So cluster API is a solution for managing and automating the lifecycle of your actual Kubernetes cluster using the Kubernetes style declarative APIs. Cluster API is like a virtual Kubernetes in a box tool that provides you all the tools and components you need to assemble your own Kubernetes cluster. It's like a DIY project for the Kubernetes enthusiasts with endless possibilities of the cluster creativity. So the cluster API, with the help of its providers, will create all the necessary supporting infrastructure that you may need, so things like virtual machines, load balancers, and network configurations. And it will also handle the bootstrapping and configuration of the Kubernetes cluster on the infrastructure. So as mentioned before, this is done using Kubernetes style APIs in a declarative way, like we are used to do with the managing workload clusters inside the workloads within a cluster. So extensibility is core to cluster API. It should be relatively easy to add support for new infrastructure environment or the Kubernetes distro or bootstrap mechanism. Core to this extensibility is the concept of providers, which are interchangeable to meet your specific needs. Just like a master chef follows a recipe book to create a deletious dish, cluster API provides a recipe book of templates that define the ingredients and steps to create your Kubernetes cluster within any infrastructure. You can follow the recipe, customize the ingredients, cook up your own unique Kubernetes cluster that suits your own taste. We'll follow more on providers shortly. With the provisioning side of cluster API maturing, the project has started to build more higher order functions on top of the core and moving towards state of operations. Cluster API has supported cluster templates for many years now, and recently it has added a new feature known as cluster class, which is more powerful and flexible way to cookie cut your clusters while still allowing all the customizations. This feature alone will then allow you to reason about a set of clusters that are present inside a class. The process of provisioning a cluster is rarely self-contained, and in many instances, there could be the wider provisioning landscape where the action needs to be taken at various stages of the cluster lifecycle. So a new experimental feature has been introduced recently known as runtime hooks, which would allow you to plug the extensions at the various defined hooks into the cluster lifecycle stages. So we have weekly cappy community calls on every Wednesdays and separate calls for various other providers as well. So you can find these details in Kubernetes community calendar. If you want to dive deeper onto very specific areas of cluster API, then I would recommend you watching the series. Let's talk about videos created by 3.0 Fabrizio, our maintainers of cluster API, and it's all of labeling YouTube. I would like to hand over to Richard now to talk about providers more. Thank you. Thank you, sir. So what is a provider in cluster API? First and foremost is a Kubernetes operator, which is independent from core cappy. It implements an infrastructure or bootstrapping specific functionality that is used together with core cappy to manage the lifecycle of a Kubernetes cluster. The provider that you create will adhere to a contract that is actually implemented via its CRDs. The extensibility of cappy that Ankit had just mentioned comes from these last two points. Traditionally, there were three provider types in cappy. The most widely available one is the infrastructure provider. As the name suggests, this type of provider is used to provision any base infrastructure that is required for creating the actual cluster in a specific target environment. The infrastructure provider doesn't provision Kubernetes itself. It actually relies on the help of a bootstrap provider. Now, a bootstrap provider is used to actually create the Kubernetes cluster on top of infrastructure. And there are generally two parts to bootstrap in. The first part is, what commands do I actually have to run to create the cluster? So I think of kubadm init or kubadm join. The second part is, how do I actually run those commands on infrastructure when a machine first boots? So this is generally something like cloud init or ignition, so it's two things together. And then we have a control plane provider. And this is used to represent and manage the lifecycle of the actual Kubernetes control plane. And this in itself can then use bootstrap and infrastructure providers to accomplish that goal. More recently, Cappy has introduced two new provider types. So when you provision a cluster, you often need to install certain workloads, perhaps a CNI, into that newly provisioned cluster. Previously, the Cappy way to do this was using something called cluster resource sets. However, there are some limitations to cluster resource sets. And so a new, more extensible way has been introduced. And this is the add-on provider. There is only currently one add-on provider at the moment, and that is for Helm. But more will be coming. Next, there are situations where you need to manage the IP addresses for your cluster that has been created. For example, you may need a VIP address for a load balancer. Or you may need to manage the IP addresses for bare metal machines. For these types of scenario, a new provider type was created for IPAM specifically. There is a reference implementation that has been created by Deutsche Telekom that uses a pool-based model. And you can see an example on the screen now of how you would declare that pool. So moving on to the CRDs and the specs. So Cappy has a number of CRDs that basically represents, logically, the parts of a Kubernetes cluster that you're going to create. These are all highlighted in orange. The cluster logically represents a cluster as a whole. And it contains general configuration information such as things like the pod and service side of blocks. But it's really important to remember that the cluster is the root and the owner of all other resource kinds. And collectively, they actually form a tree of ownership as depicted on this diagram. Most of the core Cappy resource kinds will also reference resource kinds from a provider. And as you can see from this example on the screen, the infrastructure ref and the control plane ref are actually referencing the resource kinds from a provider. And this is a common pattern. You'll see the pairs of these resources. So moving on to machine deployments. Now, this represents a set of working machines with a specific number of replicas. Using the machine deployment will actually result in individual machines being created from a template. So it's a bit like a cookie cutter in this respect. Like a deployment with pods, the machine deployment will manage the life cycle of the machines and will orchestrate things like upgrades. Then we have the machine pool. Very similar. It's used to manage a set of working machines, but generally the pool is backed by a specific infrastructure service. So think something like an auto scale group in AWS or virtual machine scale sets in Azure. Now in turn, these actual services may also enable some form of auto scaling as well. This feature is still experimental in Cappy, so you have to explicitly enable it to opt in and use it. Now we can actually move on to the resource kinds for an infrastructure provider. All of which are in pink in the diagram. So the infra cluster in this diagram represents the base infrastructure that is required for the cluster in a target environment. This normally relates to things like setting up the network or any security groups and things like that. So set up the base before I do anything. It doesn't contain anything related to machines as these are covered by other custom resource kinds. For a given machine deployment, control plane, or even a machine pool, we may need to add a template that provides the infrastructure specific definition for a machine. And this is where you would specify configuration like instance types or the images to use. This is then used to create instances of infra machines. Again, the template is used as a cookie cutter to stamp out new cookies. So in this example you can see on the screen, the Cappy machinery itself will generate instances of GCP machine from this template. A bootstrap provider will define resources that expose the configuration options for bootstrapping Kubernetes itself on a machine. The template will be used to create specific instances of the bootstrap configuration for control plane and working machines. In most instance, there are some caveats to that. The bootstrap configuration when reconciled will result in a command or set of commands that will be executed on a machine and these will either create a new Kubernetes cluster or join the machine to an existing one. A control plane provider will define a control plane kind. So the same thing that we've just seen. And it's responsible for creating and managing the lifecycle of a Kubernetes control plane. You will often see that control planes and bootstrap providers come in a pair and this is especially true of QBADM. Now that we can move on to updates from specific providers, over to you, Ashutosh. Thanks, Ashutosh. Richard has discovered very well how the mechanics of CR and CR is work in providers. So I want to start talking about cluster A, cluster B, cluster A provider, Azure. And just to repeat, it is again a Kubernetes operator or put simply a controller manager that reconciles various custom resources to provision resources on Azure and finally help you get a Kubernetes cluster. It also consists of a web server that validates your CR inputs and puts same defaults on it. For example, replica countdown, machine deployment, stuff like that. We saw more exhaustive diagrams in the previous slide, so we wanted to keep it more closer towards the infrastructure side. And you can see on the screen, we have Azure cluster identity, Azure cluster, Azure machine template, Azure machine and Azure machine pools. There are a couple of more, but I just keep this five as it is very important to understand how it works. And Azure cluster identity is one of the CRs that has all the information so that the Capsi controller is able to talk to Azure APIs by authenticating. And Azure cluster contains data and information mostly around cluster wide things. For example, networking and security groups. And Azure machine template can help you get Azure machine info that can be finally translated into a physical VM on Azure infrastructure. This is how an Azure cluster identity, Yaml looks like. So basically you're gonna apply this when you want to create a cluster. And we have some Azure nuances like client ID, tenant ID, which can help which you need to put a fill in there to be able to authenticate and the type is service principle. There are various types to be able to authenticate. For example, manual service principle or user assigned identity. And going forward, this is how an Azure cluster resource will look like. As you can see, it has all the networking details like what is the control plane subnet, node subnet and stuff like that. And also it's one important thing to notice that you can see that there is an identity ref that actually refers Azure cluster identity. So for each workload cluster that you are going to create, you can have separate identity. And this is the way you can get some level of multi-tenancy like for different workload clusters that you create, you can have different secrets. And this is machine template. You can see we have details around data disk, request disk, et cetera. And this is the template that's being used to finally create an Azure machine CR. And once Azure machine CR comes into HCD, the reconciler is going to reconcile to finally create a physical VM on Azure. I've covered this while trying to explain all the SAMLs. So just to mention one more point is like machine deployment and cube ADM control plane has also referenced to this machine templates that which it has already covered. Let's go a little more deeper. If you can see all these small boxes, so when you hear that Azure cluster object is being reconciled by Capsi operator, essentially these are the small modules in Capsi that gets triggered to finally get created on Azure infrastructure. For example, virtual network and subnets, you create a virtual network is created and then subnets are created for control plane and worker machines where the nodes will be using that. And similarly security groups, you can use security group to filter network traffic between Azure resources and in virtual network. And that gateway is again one resource. This is one of the recommended way to set outbound traffic from your cluster. And there are many more that I don't think I'll be able to cover here, but you can take a look about this in the documentation. So this is essentially what Azure cluster reconciliation means. Similarly for Azure machines, we have public IP inbound naturals, network interfaces. So whenever a reconciliation for Azure machine happens, these are the things that also happens in the background by the Capsi operator. For example, if you need a public IP for your nodes, you can configure that. And similarly inbound naturals, it defines rules for inbound traffic to the nodes and network interfaces, the basic infrastructure VM networking here. And virtual machine, the box that you see the satchel part, which finally gets translated and created on the Azure infrastructure. One thing you just wanted to note here is the creation of VMs, it happens via image builder. So you already have the VM image defined. So when the VM boots up in the infrastructure, you have all those bootstrap provider set up via cloud in here, let's say, and all those commands will execute there from QBADM to finally get you a Kubernetes node. Couple of announcements that I wanted to make, especially around cluster API provider Azure. So there has been a lot of tracks in around managed Kubernetes and CAPI is going very well with this path. Capsi released managed Kubernetes to graduation from experimental in 1.8.0. So you'll shout out to all the contributors. I can see Mike here, so thank you. Also Capsi has started to use auto-free cloud providers for all of its upstream tests because entry cloud provider has been deprecated now. You can also use flat card container Linux for your workload clusters if you want to. And there has been support for VMSS flexible orchestration that's also made in. Very recently the proposal for Azure service operator got muzzed. So what this means is Azure service operator tries to decouple identity and management from Capsi. So for example, if you want to create a virtual machine, you are just going to create a custom resource and Azure service operator is going to take the responsibility of how it will authenticate to Azure API and then finally create the VM. So as far as Capsi operator is concerned, creating a VM for example will just mean creating a custom resource. Also going forward, Capsi is going to support workload identity and it is expected to be delivered in upcoming release. So workload identity is a keyless authentication way. It's based on the YDC protocol. So essentially in this setup, your management cluster or the Kubernetes cluster itself become the identity provider and you have to add a federated credentials on Azure. And whenever Capsi pod comes up, there is a webhook, workload identity webhook that's going to inject a token on your pod and the authentication will work. To learn more about how workload identity work, I would recommend looking at the document. Also this is important because the previous ways of authentication via AD pod identity has been deprecated and it's going to be out of support very soon. I would like to hand over to Ankita for providing updates on AWS. Thank you. Thank you. So like Ashutosh explains, cluster API provider for AWS is also a Kubernetes provider or an extension that simplifies the process of managing the life cycle of Kubernetes cluster on AWS infrastructure. But that uses the cluster API resources as a building block, as a base to create the Kubernetes cluster. So it abstracts away the complexity of AWS resources such as EC2 instances, VPCs, load balancers into its own custom resource types, allowing the users to define and manage their clusters using the Kubernetes-style declarative APIs. I would discuss few of the resources which are like code to the cluster API project for AWS. I would here now say it as Kappa. So AWS cluster is one such resource with resource which defines the desired configuration of many of the components of AWS, like if we take example VPC, then it takes care of managing the life cycle of VPC, cider blocks, subnets, route tables, man gateways, and other network configurations. And also reconcile all these network components to create a networking network infrastructure that is required for the cluster to function. It also takes care of managing the life cycle of security groups that defines the inbound and outbound traffic rules for the resources in VPC. It also manages load balancers to help distribute the traffic among the various worker nodes. And it also reconciles the bastion node which is created in the public subnet and provides the first entry point to the cluster through the SSH access. We also make use of S3 services for the ignition support. So here AWS cluster resource type actually reconciles the S3 services which is used to store the sensitive user data in encrypted format. The other such important resource type in Kappa is AWS machine. So AWS machine resource also creates and manages EC2 instances in Kappa based on the desired state specified in the CRDs. This includes configuring the EC2 instance type, AMI ID, SSH key pair, IAM instance profiles, security groups and tags. It is also responsible for registering the machines to the API server load balancer. We can also make use of spot instances to reduce the cost of computer sources by utilizing AWS spare capacity for lower price. So we had recently many of the new features introduced in Kappa in the previous releases. I would not go into detail for each of them but I would like to deep dive into the external resource garbage collection which was also introduced recently. So you can go and check out about the other features in the cluster API AWS book. External resource garbage collection. So based on the use cases, users can choose to deploy the network load balancers in their AWS clusters. Since network load balancers is user managed, if we try to delete a workload cluster, the cluster deletion would fail because the network load balancers still exist and bound to the VPC that is created by Kappa. To overcome this, external resource garbage collection is introduced in Kappa, which would make sure that the resources created via any external cloud controller manager are cleaned up when the cluster is marked for deletion. The AWS cluster resource is marked for deletion via an annotation such that AWS cluster controller watches over those resources created by the external cloud controller manager and then combines them and deletes them in a specified order such that we don't face any dependency issues. This is purely an experimental feature as of now which could be enabled via cluster AWS ADM-2 and which is a generic tool used in Kappa for creating and managing cloud formation stack. So from here, I would hand over to Richard to talk about cluster API provider GCP. Thank you. Thanks, Ankita. So moving on to the provider for GCP. Compared to the last two providers, this provider has historically been really unloved, languished for a very, very long time. I don't know if Carlos is in here, but without the help of Carlos, the provider probably would have died. But I'm happy to say on a more positive note, there is a lot more interest in the provider in 2023 and people are adding new features and so it is actually improving and actually getting to a decent state. By far the biggest new feature added this year is support for GKE. Now, adding the support means that within Cappy, the three main managed Kubernetes services are now covered for provisioning. As part of the GKE implementation, we also had to add support for node groups, which we have done via the machine pool construct. Like a lot of these new features, it is experimental, so you are gonna have to explicitly enable it when you create your management cluster. And as it's experimental, we need help in adding features, so if you're interested in helping out, this is a great area to contribute and start your contribution to Cappy. So if you wanted to create a GKE cluster with Cappy, the first thing you need to do is create a GCP managed cluster, and this is where you specify the general operating environment and GCP configuration. So things like the project to use or the network to use. It doesn't actually have anything in there for GKE at this moment in time. The GKE service is then actually exposed as a control plane provider via the GCP managed control plane kind. So this is where you will specify all of the GKE specific configuration. So for example, here you would enable autopilot or select the release channel. And finally, we need a way to add a node group to the GKE cluster. This is done using the GCP managed machine pool, and it allows you to specify things like the scaling configuration, for example. Just a word of warning, you will need to specify one of these, so when you initially create the cluster, you will have to have one of these, otherwise it won't provision just like GKE if you're using it through the console. We also have an LFX mentee called Fong, and he's specifically working on adding open telemetry to GKE, and the idea of giving us greater visibility into the reconciliation process. And a quick detail on that topic, so the LFX Mentorship Program is an excellent way for each of us to share our skills and experience so that we can help build the next generation of engineers. So I would encourage CNCF projects to all consider submitting projects to the Mentorship Program. So back to CapG. In addition to the GKE support, you can now specify credentials on a per cluster basis. So this basically means you no longer have to use the same credentials to provision all clusters. So if you wanted to do that, you would first have to create a secret with the credentials, so basically the JSON file. Then you have to reference that secret from your GCP cluster or your GCP managed cluster using the credential ref. Again, this is optional. If you don't supply this, it will use the default credentials that you supplied when you created your manager cluster. And lastly for me, there are two other updates we'd like to share. So the add-on provider for Helm, so I mentioned this is the only add-on provider at the moment, has been making great progress, and it enables you to install Helm charts into a newly provisioned cluster. The project is looking for more contributors, so if this is something of interest, please stop by the repo or in the capital slack and ask questions. For those of you that like using RKE2, perhaps for FIPS compliance reasons, there is a new bootstrap and control plane provider for RKE2. It's currently at alpha level, and as it's at alpha level, we need the help of the community to drive it forwards. So that's me over. Back to Ashutosh. We need help. So with this we have reached at the end of the talk, but the project needs help, so please come join. Help us in ways you can. Documentation, NISOs, PRs, and yeah, let's make cluster API boring and get the best tool for lifecycle management of clusters. This is something that we wrote like writing skills, product skills, coding skills, and whatever. If you're lost, just hit us on Slack. There are a lot of folks here or join the Slack channel and we should be able to help out to get started. You can help us in documentation or writing solution reference architecture or start with a good first issue. We also have help wanted labels on the issues to get started for the new contributors. And I put a Slack channel for CAPI and other providers, but if you're interested in all of the providers, I think you can go to CAPI, core CAPI channel and navigate from there. And these providers also have office hours depending on what provider it is. The cadence is weekly or bi-weekly for CAPSY. It is weekly now, so you can join office hours, ask questions and provide feedback. This is it we had for today. And if you have any questions, feel free to ask. I think we are over time, but I will be here so we can catch up offline. So thank you very much. Thank you.