 Last year, Clustrapia has been used rapidly, and we have seen a lot of usage. And there are a plethora of Clustrapia providers for you to choose from. But what if none of the existing providers suit your use case? In this conversation, we will learn about the different provider types. And also, we need to evaluate whether we really need to write a new provider. If your answer is yes, we can walk you through how to build your own Clustrapia provider. Over to my co-presenter. Hi, everyone. My name is Richard. I'm a principal engineer at Weaveworks. I'm currently one of the maintainers of the AWS and Micro VM Clustrapia providers. This talk was how to build your own one the easy way. It's turning into how to do it the hard way. So we'll see what we can do. So I'm just a recap on what Clustrapia provider is. So it was originally designed with the premise that actually provisioning clusters and that the life cycle of those clusters is actually difficult. There's been many ways to provision clusters, depending on your target environment. And very little has been done to provide consistency from a user experience point of view. And this is really where Cluster API comes into it. So if you've been building or provisioning clusters for a very, very long time, I'm sure you've played with things like Puppinetees, Cops, and various other things. So Cluster API provider tries to make that experience. And the way that you provision cluster is consistent. And so it does this by having this concept of providers. And providers are essentially the parts that do the infrastructure or operating environment specific operations. And they talk nicely with core Cluster API. And you perform your operations against the core Cluster API types. And so it handles that aspect for you. So in this session, we were going to walk you through some of the main topics in designing, developing, testing, and then releasing your provider. I'm doing this all from memory from my slides. So as well as, I guess, the consistency in it from a user perspective, the Cluster API also brings in higher level functionalities. So as well as just the pure provisioning of infrastructure and then Kubernetes on top of those, it has higher order functionality like automatically scaling or automatically doing the upgrades of Kubernetes versions, for example. And the other area is automatically spreading machines across failure domains because you don't want all the machines in the same failure domain. Because if that rack goes down, then your whole cluster goes down. So there's lots of functionality built into Cluster API like that. So we're still having issues. So the core to Cluster API is this concept of a provider. And a provider can seem scary, but essentially a provider is just a Kubernetes operator. So if you've built a Kubernetes operator previously, then you should have no problem building a Cluster API. Is it? Oh, the anticipation is amazing. So if you've built a Kubernetes operator in the past, then you should have no problems. Yes. Fantastic. Let's do it. You still want to see any of them? OK. How are we done? How do we get the? We still have a lot. So yeah, we still got problems. So I'm back to entertain you. So where were we? So yeah, we were talking about the provider is basically a Kubernetes operator. And as such, today is going to be a number of custom resources for your provider. And those resources have to adhere to a contract. And that contract is dependent on the provider type that you are building. So this will cover provider type shortly. And hopefully, we'll have the diagrams for then. So then along with those custom resource definitions, very fantastic. So I should have started with this actually. So a couple of quick questions. Put your hands in the air. Who uses cluster API already? Many people? Fantastic. Who actually contributes to CAPI or a provider? Brilliant. And who's thinking of building a provider? I'm actually quite surprised. So yeah, that's really good. So this session was mainly focused on people for the last question, who want to actually build their own provider. But hopefully, there will be some useful for everyone else. So let's skip forward to that. A couple of things I didn't mention here is, so core to the lifecycle of cluster cut. Core to APIs. Yeah, it's a management cluster. So it's based on operator. It has to adhere to the contract. And that contract is dependent on the provider type. The contract is important because it allows that interaction between core CAPI and the providers so that things can talk. So I've mentioned provider types a couple of times. So there are three provider types currently in cluster API. So the first and probably the most widely used provider type is the infrastructure provider. You can probably guess from the name. The infrastructure provider is mainly concerned with provisioning the infrastructure that is required for your cluster. It is not concerned with Kubernetes itself and bootstrapping Kubernetes. It's just purely building that environment for you. So as an example, the cluster API provider for AWS or CAPA is an infrastructure provider. And it provisions AWS resources such as VPCs or security groups, et cetera. And those are then used as the basis to create the Kubernetes cluster. I mentioned that the infrastructure provider doesn't provision Kubernetes itself. And that's actually handled by a bootstrap provider, which is our second type of provider. The bootstrap provider, as the name suggests, is actually used to bootstrap Kubernetes on top of the infrastructure that has been provisioned. And then it will then create or join machines to a existing cluster. There are generally two parts to this bootstrap process. The first part is the actual commands. So how do I create a Kubernetes cluster? And how do I join one? So think something like QADM here. So that is the nuts and bolts of how I create a Kubernetes cluster. The second part is how do I format those commands so that they can be run? So generally, this involves putting them into a specific format, something like CloudInit or Ignition, and making that available via a secret or S3. So that secret can then be used by the infrastructure provider when provisioning machines so that it can execute those commands as part of creating that machine via user data or some other mechanism. The third provider type is the control plane provider. And this essentially represents the control plane of your Kubernetes cluster. It can take advantage of bootstrap and infrastructure providers to do certain tasks, but it's purely focused on the control plane. If you have something like a managed Kubernetes service, this might actually directly be responsible for creating a managed in something like EKS or AKS. But generally, it might be responsible for also provisioning the machines under that. The reason for this is it means that you can control the lifecycle of the control plane differently to the actual worker nodes. So the first rule of creating a provider, this is a bit tongue-in-cheek, you don't need to really create a provider. So actually, there are a lot of providers already, so the hope is that there is already a provider that you need. Creating a provider or generally an operator is a lot easier nowadays with things like QBuilder and controller runtime, but it's still non-trivial. And then there is a cost and an ongoing burden to you for building that. So what actually constitutes a cluster API provider? So I mentioned before, it's basically a Kubernetes operator. Sometimes it's referred to as a controller manager. So this basically means it has CRDs, and it has controllers that reconcile those CRDs. Additionally, there's going to be some Kubernetes resources to deploy your controller into a cluster, into the management cluster. So this is plain old YAML. There will be a deployment, bunch of RBAC, maybe some secrets, whatever, just normal Kubernetes YAML. You have the option within those to tokenize various parts of it so that you can override those when the providers are installed. So this might be useful for credentials or secrets that may be used to connect to AWS or something like that. Within that, customize and end subs is used heavily. Lastly, there are some requirements on you as a provider implementer around how you structure your Git repo and the Git releases and some files within your repo. So I'm going to move on to actually the resource kinds. So this is a diagram that we generally use. The reason we're going to cover this is because it's really important when you implement a provider. Cappy has a number of different custom resources that are basically used to logically represent a Kubernetes cluster and its lifecycle. So you can see this. These are basically the gray boxes here. They represent those Cappy resource kinds. So the cluster at the top there, that represents the cluster as a whole. So think of that as the root. And it has general configuration. So things like pod-sider blocks or service-sider blocks that are not specific to infrastructure or how you actually provision or boot Kubernetes. It's really, really important to remember that the cluster at the top here is the root and the owner of all of the other resource kinds. So you end up with essentially a tree of ownership. And you will see more of this later on. We then have a number of resource kinds that represents individual machines that are used as nodes for your Kubernetes cluster. So we have machine on the left there. So that represents an individual machine and a node in a cluster. So it's a one-to-one map in there. So then we have something in the middle, which is called a machine deployment. And that represents a set of machines that have the same template. And you specify a number of replicas for those. Then we have the machine pool. So this represents, again, a set of machines. But that pool of machines can scale up and down. And it's normally backed by an infrastructure-specific service. So think of autoscale groups in AWS or virtual machine scale sets in Azure. So now we're going to move on to creating a specific provider type. So if you're going to create a bootstrap provider, so then you're going to need to create a custom resource that represents the bootstrap information for Kubernetes. It will need to contain all of the configuration. So you need to represent that within your custom resource about how you will provision Kubernetes or either create it or join a cluster. For example, you would have kubadm here. So the CAPI comes with a kubadm bootstrap provider. And this exposes the kubadm configuration for you. You'll see it in this diagram that I've highlighted. So the bootstrap providers here are in pink. And we've put them on there so they're encompassed by the CAPI resource kinds. Now, this is on purpose. This is basically to show that the CAPI types own and reference the bootstrap provider's configuration. So this is a common pattern. You can also see here that the bootstrap information is required by all of the machine variants, so normal machines, machine deployments, and also machine pools. Just to give an example here, I've noticed there's a bit of a typo towards the bottom there. So this is a snippet of a cluster, so the top level. And in it, you can see it's referencing two different provider types. So there's an infrastructure provider referenced by the infrastructure ref. And then there's a control plane provider referenced by the control plane ref. Both of these will implement that contract, the relevant contract for their provider type. So it's really important. And any time you see something ref, it's normally a reference to another provider, or similar. So moving on to the infrastructure provider, so this is the most common one. So this is used to represent the infrastructure you will be creating in your target environment. So if you are creating clusters in AWS, this will be AWS-specific infrastructure configuration. In this diagram, they are in the orange boxes. So it normally relates to networking and security group stuff. It's not specifically for Kubernetes, and it's not specifically about machines generally. There is not so black and white with that. There are shades of gray within that. The infrastructure machine kinds will all contain configuration that is specific to creating whatever compute Kubernetes is going to sit on. So this could be things like EC2 instances or GCP compute instances. One other thing to note from this one diagram is it also gives an example of some of the naming conventions that as a provider implementer, most providers follow. So you'll see at the top here, we have something called AWS cluster or podman cluster. So the provider name is generally prefixed on your custom resource types. So you probably get a good idea of how this is represented now. And that actually applies to the control plane provider as well. So now over to Anusha. Thank you. So before getting into the nitty gritty of writing a provider, let's do a small refresher course on the basics. So what is an operator? An operator is a way to create, manage, and configure complex Kubernetes application. Suppose, say, you want to create a Kubernetes cluster. The steps involved are creating an infrastructure bootstrapping Kubernetes, possibly using QBDM, and then managing the versions. An operator codifies all of these steps for you. And the native way to do this in Kubernetes is via declarative APIs or CRDs or custom resource definition as we all know it. Using a CRD, you can specify the infrastructure of your choice, specify how you want to bootstrap Kubernetes, and specify versions, et cetera. These CRDs are then monitored and reconciled by one or more controllers. That brings to our next question. What's a controller? A controller is nothing but a control loop that watches the desired state of the cluster through the API server and continuously reconciles to move from the current state to the desired state. One last bit about the control loop. So this is the core principle of Kubernetes. You watch for the changes in the resources that you're interested in. If there's any diff, you take necessary action to move from the current state to the desired state. And this works in an infinite loop. So basically, watch diff at repeat is your mantra. So Richard spoke about three different types of providers. And let's look at in what scenario you may want to write what kind of provider. If you are operating on a cloud or bare metal service, then you will need an infrastructure provider. If you want a different way to bootstrap Kubernetes instead of Qubedium, then you would need a bootstrap provider and maybe a complementing control plane provider as well. If you have a hosted Kubernetes control plane service, no surprises there, you'd need a control plane provider. But if you want to use virtualization technologies like vSphere or KVM, then you need an infrastructure provider. But make sure to check out the existing providers like vSphere, microVM, and Qubeword. Because more often than not, they would solve your use case. If you want to provision your own infrastructure and get cappy to manage Kubernetes on them, you can use the existing bring your own host provider. It is used to provision Kubernetes on top of your existing infrastructure. Or something like cluster API provider for AWS, wherein, say, you want to create or configure your own VPC and want to make that VPC as part of your cluster. If none of the existing providers suit your use case, then you may want to write one or all of the different kinds of providers that we've mentioned so far. OK, now that you've decided, you absolutely need to write a new provider. And none of the existing providers suit your need. Let's look at what are the basic steps in creating a new provider. So for writing a new provider, Qubewilder and controller runtime are your friends. We start off with the Qubewilder init command. This generates the basic repository layout for your provider. It creates the necessary Docker files, make files, project files, and a starting main.go for your project. Also make sure to provide the versioning information. And this should be conformant with Kubernetes versioning standard. Qubewilder also allows for you to create controllers for the CRDs that you have created. To create CRDs, you'll use the Qubewilder create API command. And throughout this talk, we will be referring to a hypothetical infrastructure cluster API provider for Podman. So all of the CRD and controller references will be with respect to this provider. So we use the Qubewilder create API command to create the CRDs. And we have an option to also create a controller for the CRD. Most of the times, you would need a controller because the controller is the one that is continuously monitoring for the CRD that you created. But there are also times when you don't need a controller, something like a Podman machine template. So this provides a blueprint that can be used to create Podman instances out of. So the Podman machine template is used to create Podman machine resources, which in turn is reconciled by Podman machine controller. Therefore, you'd not need a Podman machine template controller. So you also need to specify metadata. This is to specify the compatibility of your provider with the cluster API contract. This is mainly used by the cluster API command line tool, that is cluster CTL. Cluster CTL can be used to initialize your provider. So like in the snippet here, major 0 and minor 1 means 0.1.x of your provider is conformant with v1, beta 1 contract of cluster API. So defining the API. So Qubewilder would have scaffolded some code for you. So for an API, it creates a Podman cluster spec and a Podman cluster status. But there are some fields that your type of provider has to be conformant with cluster API. As in this example, we add a control plane endpoint. And this field represents the endpoint that is used to communicate with your control plane. Similarly, there's a ready field in the Podman cluster status. This is used to indicate that if your infrastructure is ready or not. These fields, in turn, are read by cluster API CRDs, like the cluster CRD, to indicate the overall readiness of your cluster. So apart from these absolutely necessary fields, you can also specify fields that make sense only to your provider. For example, in this case, provider ID and extra amounts. So finalizers also play a very good role in writing your provider. So finalizers provide a way for the controller to clean up any external resources that it has created before deleting the API resource itself. So while writing a provider, you'll mostly end up creating external resources. So it's always a good idea to add finalizers to your API resources in the controller. So this is one such example. So this is a reconcile normal function. A reconcile normal function is for a create or an update flow. You would have an equivalent reconcile delete function for a delete API workflow. So as you can see, we are using controllerutil.add finalizer. And you have an equivalent remove finalizer. So whenever you add this finalizer, make sure you are patching the object so that the changes are persisted in the API server. Similarly, in the delete workflow, make sure to remove those finalizers. Next, we go into implementing the controller for the API types that we've created. CubeBuilder would have scaffolded controllers for you. These controllers pretty much contain two functions. One is reconcile, and the other is set up with manager. The rest of the logic in reconcile should be filled by the provider implementer. You can do additional things like add CubeBuilder annotations to watch for any additional CRDs that you want, or basic logic like if your cluster or your API resources paused, make sure not to go ahead with the reconciliation, or don't do reconciliation if it is an externally managed resource. So this is the typical set of steps if you are writing a provider. This specific example is for an infrastructure machine controller. So the first step is get the instance of the API type being reconciled using a get call on the API resource. Then get the owning CAPI type. We'll get to owner references in a bit. Example, if we are reconciling a Podman machine, get the machine resource. If we are reconciling Podman cluster, get the cluster resource. If we don't have this owner reference set yet, we exit from the control loop. Optionally, you can also get cluster and infrastructure cluster objects. And then from this top level reconcile function, we either break into a reconcile normal or a reconcile delete. Now this depends on the deletion timestamp on the resource. So if a resource with a finalized field set is requested for a delete call, Kubernetes does not directly delete the resource. Instead, it updates the resource with a deletion timestamp field. So if the deletion timestamp field is present on the resource, then it means it's a delete request. You jump into the reconcile delete function. You clean up any externally managed resources that you want, and then remove the finalizer so that your resources can be deleted from the API server. If there is no deletion timestamp on the resource, then it means it's either a create or an update request. And you jump into the reconcile normal function. So the first step that you do here is to add the finalizer field and then proceed towards any actions for creating or updating your resource. So owner references are heavily used in Cluster API. So it is a link to a resource that is the owner. Example, deployment-owned-spots or cluster-owned-spotman cluster, podman machine. And it is implemented via metadata.owner-reference field. But now what happens if the owner itself is deleted? It could be one of the two things. Either your child resource become often resources, or we can do the cascading deletion. So this is what Cluster API uses. It means it will wait for all the child resources to be deleted before the owner resources deleted. Now that we have APIs and controllers, you may also want to write webhooks for your providers. So webhooks are admission controllers that are used to add custom logic or validation to your CRDs. CubeBuilder can scaffold webhooks for you. You have to use the cubeBuilder create webhook command. And you can use either of defaulting or programmatic validation, or both of these flags. The defaulting flag would create a defaulting webhook, and programmatic validation would create a validation webhook. We'll look at a couple of examples. So this is for the validation webhook. So as you can see, CubeBuilder created a validate create function for us. Similarly, it would have scaffolded a validate update and a validate delete as well. So it is our job to fill in the logic into these empty scaffolded functions. One thing to note here is that you can add as many number of rules as possible to your validate functions and aggregate all of the errors and return all of them together. This is one such example for defaulting. On the top, we have, I don't know if you can read it, but there's a CubeBuilder annotation for adding a default number of CPUs to be 2. So CubeBuilder also provides a way for you to add defaults. Only if this is not enough, not enough for your use case, then you go and write a custom default webhook. So if you had used the default flag during creating webhook step, then it would have created a default function. And that is where you write any bit of logic and add defaults for your fields. Well, these are the building blocks for writing a provider over to Richard. So I don't think we're going to have time to cover fully the rest of the slides. So we'll probably stop on this slide after this slide so we have time for questions. So once you've developed your operator, you need a way to test, run it, and debug it locally. And this is where tilt comes in. So you can tell cluster API and its tilt file about your provider. So if you're building your provider, you need to create a file called tilt-json in the root of your repo. There's a few important things here to remember. So you see there, there is an image. So that image name must match the name of the image within your deployment configuration. This normally has a tag of dev, and that will be replaced as part of your build process. The second important part is this live reload depth. So tilt has an iterative development model. So it will watch those files and those paths within those files. If there's any changes to those, tilt will automatically recompile your provider. It will package it into your container, and then it will instruct the deployment to use that new container, all without reloading the rest of the controllers. Think of it as hot reloading for your solution. So I think we do have a lot more slides. We had probably about seven slides in there, but it's probably a good time to pause. You can download the slides and see what we talk about with the testing. The debugging is very, very useful within the tilt. So I'd recommend that you look at that as well and around the end-to-end testing. But maybe we have time for some questions.