 Hello, everyone. My name is Alex. I'm a software engineer. I work at Redhead and I work on a project called OpenShift. This talk is going to be about a part of OpenShift called MachineAPI. You'll hear what MachineAPI is, the motivation of having it and what component it includes. And because this is a deep dive, I will assume that you know what OpenShift is, what Kubernetes is, and you are familiar with concepts like custom resources and controllers and the relation between them. But generally speaking, custom resource is like a model in your database and controller is just the piece of code that acts on changes of this model. Okay, let's start. So when you hear MachineAPI, this picture might appear in your head, but in fact we are doing something else. Well, maybe in some way we are trying to control the machines. And as you might know, OpenShift can run on different clouds, both public and private. There are just some examples on that slide, like Azure, Google Cloud, AWS, vSphere. It's not all the list, just a couple of examples. And in the end it's just some piece of hardware. So MachineAPI tries to solve the problem of managing cloud infrastructure. MachineAPI is located in the center of everything and it's crucial for running OpenShift. It talks with cloud APIs and provides communication layer between OpenShift users and cloud providers. So, shortly speaking, that is a part of OpenShift, an extension to its API that is responsible for creating and deleting hosts for nodes and it tries to make this process as easy as possible for users. So, it operates on top of public or private cloud infrastructure, that means it talks to cloud APIs, and part of our logic is also responsible for horizontal cluster scaling. But before we go on, it's very important to understand the difference between two concepts in OpenShift, the machine and the node. The node is just a representation of a virtual machine that runs on cloud provider and node is Cubelet plus the object in OpenShift's API. And Cubelet is a thing that makes the MNode and it's responsible for registering node object in API. Our first and main building block of MachineAPI is the MachineObject. MachineObject is an object that describes the host. The user creates or deletes the machines and our controller interacts with cloud APIs based on this action. It either creates or deletes machine from cloud. You can see that the machine is just a normal resource in Kubernetes. It has namespace, name, spec and status. And the important part here is provider spec because provider spec is a template for describing these machines. The definition for provider spec is a bit different for each cloud provider because it's based on cloud provider specifics. And that means that each cloud provider has its own provider spec. And provider spec is a place where you can specify options like instance size. However, this resource needs a controller to manage its lifecycle. And for this, we have a controller. So it's just a piece of code that holds all logic for a MachineObject. And it's basically just an endless loop that tracks resource state and performs certain actions based on it. But the process of synchronizing desired state and the actual state of the object is called reconciliation. It runs periodically in case of great update or deleting or it can be called manually from the code. And for managing machines, we introduce phases. These phases are reflected in machine status and they represent where a machine is located in its lifecycle path. There are five phases in total. I will describe what we mean in next couple of slides. And these phases are provisioning, provisioned, running, deleting and failed. So let's start from machine creation. There is a very long diagram, but I will try to make it simpler for you. One of the first things that our controller does is checking if MachineObject has a deletion timestamp. In Kubernetes, deletion timestamp means that someone is trying to delete a resource, but right now we're talking about the case where timestamp is not present. So next thing that we do is look if machine is in a failed state or not. Failed state is set when machine has invalid provider spec or created instance disappeared from the cloud for some reason. If the machine is in failed state, then the controller does nothing. So our first step. We are checking if cloud instance exists. In our case it doesn't. So what does that mean? That means that the instance was either not created or someone removed it. We know this by looking at the machine status. So we'll get back to detailed machine status in a moment. But if cloud instance can't be found and machine status is not populated with IP addresses or provider ID, that means that the instance was not created and we should start creating it. So our controller sets face to provisioning and then creates instance in the cloud using cloud provider API. In case when provider ID is missing we set face to failed because, sorry, in case when provider ID or IP addresses are present we set face to failed and do nothing, we just assume that something is broken. So provisioning phase means that cloud instance creation was started. Now let's get back to the first step of this diagram. Well, what happens if cloud instance exists? When then controller calls update for this provider and the update just synchronizes some data from cloud providers or performs some additional configuration that needs to be done after machine creation. Next it looks if node for this machine was created and I will explain how node-machine relationships work a bit later. If node exists when the phase is set to running, if node doesn't exist then the phase is set to provisioned. So running means that machine has a node associated with it and it has a cloud instance. Provision means that VM was created but node is still missing. Last part of machine lifecycle is deletion. The controller says the phase to deleting indicating that deletion process was started. Then it performs node draining and node draining is a process that blocks a node from scheduling new pods and terminating all running pods and if node draining is combinated with pod disruption budget which is an API for controlling graceful shutdown of your workloads then the impact should be minimal. Next thing that controller does after draining node it deletes the instance from the cloud and the last thing it does it deletes the node object from API. Okay, well that was all about machine objects. Now let's talk about our next resource in machine API that's called machine set and machine set is just a group of machines and you can think about machine sets as like replica sets and pods. That means that machines are not meant to be managed on their own like pods and there are resources that manage them. So in our case machine set manages machines. If you delete a machine that is managed by a machine set then machine set controller will recreate your machines like the same way you can see the pods are working. Like replica sets machine sets need to have a replica count set and provider spec should be present too. Next thing I want to talk about before moving to health checks and auto scaling I'd like to explain how machine and node relationships work. Well we created a controller for establishing this relationship and called it node link controller. After kiblet runs on created instance and node object is created the node link controller looks at provider ID and IP addresses of this machine and tries to connect nodes and machines using this information. Provider ID is the identification of node and it's also present on the machines and because the logic for creating this provider ID is similar then they should match. So after finding matching machine for node we set node reference in machine status. So after node reference is set machine gets to running stage. Our next resource is machine health checks. So how do machine health checks work and what are they? Well that's a resource that provides automatic remediation of unhealthy machines in a group of machine sets. The logic here is quite simple the controller finds unhealthy machines then if user defines a value for max unhealthy field in machine health checks spec then when remediation is not performed if the number of unhealthy machines exceeds the max unhealthy limit. This is a short circuit mechanism and it's useful when a large amount of nodes are down for example because of a networking issue. The last step is deleting unhealthy machines and because machines are managed by machine sets new ones are created to replace unhealthy ones. So you might have question how do we understand that machines are not healthy? Well we do this by looking at nodes associated with them and usually if node is unhealthy it's reflected in node status. So as all machines and machine sets machine health checks are another Kubernetes resource and in its spec you should specify a selector for selecting machine sets node startup timeout and max unhealthy for allowed unhealthy machines. Now let's move on to autoscaling. There are two resources for configuring the scaling of a cluster. The first one is called cluster autoscaling and it's a cluster-wide syncleton resource that defines a set of rules for scaling the cluster. It defines number of cores, nodes allowed, memory, GPUs and so on. And it's required to have a cluster autoscaler to make autoscaling work. It's a resource that turns autoscaling on because you cannot scale your cluster if there are no rules for scaling it. The second resource here is machine autoscaler. It sets the amount of minimal and maximal replicas count for a specific machine set and scales machine number by setting replicas count at the machine set. So it just controls the number of replica counts on the machine set and tries to follow the rules set by cluster autoscaler. Here you can see the comparison between these resources and the cluster autoscaler includes resource limits such as amount of nodes. Also it includes the scale down policy where you define rules for scaling down the cluster and machine autoscaler only contains minimal, maximal amount of replicas and also reference to target machine set. Okay, so thank you all for listening. Today we talked about motivation of having machine API, about machines, machine sets, machine health checks, cluster autoscaler and there are some useful links. I'll try to attach the presentation somewhere in case you want to take a closer look and let's see if we have any questions. Thank you very much, Alex. It has been great. We do have some questions for you. So let's get started. Let's just start with Christian Pasarelli. He's asking regarding machine dilation. Is the virtual machine directly dilated after the node train or it is first gracefully shut down? First we drain the node and then we delete the machine. So basically when you do keep CTL, delete machine, then we have a mechanism that just blocks this node from scheduling new pods. We try to evict all the pods from a node and then we delete the machine from the cloud. Shutting down the virtual machine and removing it is the last phase of machine lifecycle. Perfect. Let's go to the next question. Is there a way to adopt existing nodes as machines? I guess he means that converting a node to a machine set. Yeah, I think it is. You can try to. I'm not sure how this exactly works but I suppose it should be doable but I personally didn't try to do that. Next question. Is machine API doing requests to cluster machine approver? Or how does it validate that the CSR endpoint is actually approved or not? Good question. We have a controller called cluster machine approver and it is responsible for approving certificate signing requests. And no machine API doesn't send request to CSR directly. Sorry, to machine approver directly. It's done by the cubelet from the node, I guess. And there is some smart logic to approve a CSR on cluster machine approver. We check that... I might be wrong but I believe we check that the name of machine matches the name of node and there are some other security things we do before approving the CSR but machine API directly doesn't do any request as far as I know. It's only done from node side.