 Good morning, everyone. I hope you had fun last night So today's topic going to be about what goes down the drain I'll introduce myself quickly. My name is Yanir Quinn. I'm working in Red Hat since 2016. This is my first DEF CONF. So very excited. I've been working on several projects like Overt, Qvert, OpenShift and some more. Okay, so let's get to the topic. On our Kubernetes cluster, when you want to manage it, from time to time, we want to perform maintenance operations. Maintenance operation in our, let's call it, our synonym here is no drain. So to understand what is no drain, why do we need it and how we perform will go through three, the three questions. So what we're close? Are we actually draining for our Kubernetes nodes? Why do we do that? The cause and and how? In the how section will go through one or two Kubernetes APIs for node eviction. Understand the complete complementary concept of pod disruption budgets, and if we'll have enough time, I'll dig deeper in the concept of server-side drain. So we have our Kubernetes cluster with it, which is managing and orchestrating our containerized application, and it's considered to be fault-tolerant. One misconception might be that, okay, if I have enough free resources and something goes down, my node goes down, the Kubernetes scheduler would just, okay, we'll make sure that the pods will be scheduled on another node, maybe autoscale or downscale our cluster, and everything is fine and dandy. But that's not always the case, and it's pretty much a misconception, and to understand that we'll dig a bit deeper about the reasons and the consequences we have. So we'll start with what is being drained from the node. So we have our pods containing our containers, which are running applications, microservices, and processes, some profound examples for our world, our virtual machines running on pods, as we do in Cupert, storage pods, like in OCS, and these are the heavy workloads. Okay, so I want to talk about what are the reasons for for maintenance. Before that, let's understand how pods are being evacuated or dropped for a specific node. Either intentionally or not intentionally. So we can split it to two by disruptions. So we have involuntary disruptions. These involuntary disruptions happen without us intentionally doing something, or by us or by something. Some example is the classic one, hardware failure. Well, simply my machine breaks and goes down, and my node is down. Human error happens from time to time. Let's say a cluster, I mean accidentally brought down a VM, and now I don't have any node. A VM can disappear on my cloud or hypervise on our cloud. Kernel panics, of course. A node can go missing if I have a network partitioning, and it just disappears. All of these are classic day-to-day failures that happen not only in a Kubernetes cluster, but in your old, all other clusters that you know, even 10, maybe 20 years ago. Another thing is node going out of resources. I left this to the end because Kubernetes has a mechanism to handle that. So we might want to prevent these things from happening if we can't evict these workloads running on our nodes and are gone because of involuntary disruptions. Some preventive methods can be defining for a pod what actually resources it needs so our node won't go out of resources. Of course, using high availability. We can replicate our applications across the Kubernetes nodes. Going even further, you can spread the application and make it even more high available using anti-affinity and multi-zone clusters. So one of the examples I talked about for preventing out of resources is defining in a pod how much memory or CPU I want a container to use. In this example, we can see two definitions of limits and requests. So we don't want a container to cross a certain limit and cross the pod. If a container crosses its limits, it will probably get restarted or get a runtime error. We can allow in a pod to have a certain limit if we know we'll have bursts or spikes. And usually, it will have, is usually a request of memory for day-to-day actions. So this is one of the preventive methods. Now we're going to the most more interesting part of our talk is voluntary disruptions. So this can be caused by either the application owner or the cluster owner. Starting with application owner, we can just delete a deployment or a controller that manages the pods and then the pods can just disappear. If we update a pod template, make some changes in the spec, it might cause a restart and the pod will be erected and then recreated by Kubernetes. Also, directly deleting a pod, although it might be by accident, can be caused a voluntary disruption. Now for cluster administrator, and this comes down to maintenance. I can decide I'm a cluster administrator, I now want to perform a role in update, a kernel upgrade, hardware replacement, but I know I'm going to do that and I want to prepare for that. In order to do that, I will want to do it gracefully and allow my pods enough time to terminate. Another example for such maintenance operation, voluntary disruption, is draining an old from a cluster to scale down the cluster. We might also want to remove some pods from time to time to allow something else to fit on the nodes. But let's focus on the top two. So this is why and also what? So node drain means safely evacuating or evicting all of our pods, allowing them enough time to gracefully terminate, complete our workloads, workloads, migration, all complementary actions, and then shut down the node and after the pods are rescheduled, someone else. So draining operation is comprised for two main actions. One is codoning the node. Codoning means I shut the node down, I don't, sorry, I have a barrier on the node. I don't want to allow any new pods to be scheduled on the node because I'm going to eventually shut it down. And the complementary action of the node draining is evicting or deleting the pods if eviction is not possible. It's also possible to prepare for node eviction and just create, call kubectl, coordinate node before we actually want to start the eviction process. Okay, so the CLI for draining a node is just kubectl drain with the node name. The node becomes unschedulable. You can see it also presented as a field or also attained. It tries after the node is condoned to evict all pods from that are running on that node. If it can evict these pods, there are some exemptions which I'll show soon. And if eviction is not supported, Kubernetes just calls the delete API. So let's start with the delete API. You can call it delete API and Kubernetes node to just delete the pods. The delete API also gives you a field which allows you to gracefully remove the pod, meaning I'll define a certain timeout before it picks out the pods along some workloads or some processes to finish, to complete. When we have involuntary disruptions, for example, if my node suddenly breaks down, we don't really have a chance to have this grace period. So we might have some sort of chance to get it in before we see a node comes down. Again, this is only for involuntary disruptions. So now we'll talk about the eviction API. Eviction API came to us to Kubernetes 1.7 and above. And we use it instead of directly deleting a pod. Here we can also avoid calling an external command. We can utilize this API in our code base or project if we want to. And we have also finer control over the eviction process. So you can see it pretty much looks the same like the delete API with a special kind for eviction and another concept that coincides with this eviction API. That is pod disruption budget. So what is pod disruption budget? It limits the amounts of pod that can be evicted by the eviction API to a certain limit. By definition, you can put on the custom resource of pod disruption budget. Some examples. For example, I have a stateless front end, running UI operations. And I want to keep the capacity at 10 percent. I want to keep the capacity that won't reduce more than 10 percent. So I define a pod disruption budget with a parameter called minimal available pods at 90 percent. Minimal available, for example, can be also a number of pods. Percentage makes it more easier to work in high amounts. In a single instant stateful application, I want to block eviction completely unless I tell it to proceed. I want to have control over that. To do so, I can just define max unavailable zero. That means I don't want any pods to be evicted from the cluster, from that node specifically. Unless I remove this pod disruption budget, I remove this limitation. And then the eviction API can go on and evict that pod. We also have a quorum example, like using ETCD in a multi-instance cluster, where I can actually utilize this pod disruption budget to define my quorum, setting a number for max unavailable for one, or defining minimum available pods that should stay on that node for, let's say, three out of five to maintain the quorum. So this is how the pod disruption budget CR looks. It's quite simple. You have in the spec the parameters I mentioned and their value. And a matching label to the pods I want to associate the pod disruption budget. When trying to evict and the pod disruption budget exists on the pods, I can have several scenarios. Either it's granted, pod disruption budget doesn't block me from evicting my pods from the node. So the eviction API just goes through and we're all good. Pod disruption budget is not accepted. It's not respected. So it will block eviction and you will get an error. This is 429, which is a bit weird for me. Or either I have a misconfiguration assigning too many pod, multiple budget to the same pods and that resulting in an error. So sometimes that can result in a broken state. Let's say I have a controller. This controller tries to evict pods using the eviction API. And I'm getting blocked by pod disruption budget. Or either my replacement pod is not ready. Let's say I terminated the pod. I need it to be rescheduled somewhere. And if it didn't reschedule somewhere, the operation did not complete in my controller. Handling these solutions can be by pausing the controller's work or deleting the pod actually manually or forcefully after a long time. So I was talking about the deletion API as part of kubectl drain. So we have some special pods that won't be deleted immediately. One permanent example is demon sets, which is not that horrible to delete from a node because they will spin up anyway. But in order not to block the eviction, you can specify a special flag to ignore demon sets and just go on with that. Mirror pods won't get deleted that are mere pods are representing static pods on the node. Unreal replicated pods, these are pods that don't have replica sets and deployment and they will not won't spin up because they don't have such deployments or replications. So if I'm sure it's okay to delete a pod from the cluster and evicted without it getting replicated, I can use a force flag for the kubectl drain command. Also pods with empty deers, if we're certain that it's okay to delete or evict them, we can just use the delete local data flag and it will be deleted. Okay, so when coming out of just simple application running on pods, we have some heavy workloads. Formulate exception examples are virtual machines. Let's say I want to perform an upgrade on a node or maintenance operation and I don't want my VMs to go down. Part of the classic process of that is migrating my VM. So I call a VM migration by initiating a drain command. I don't want the VM just to go down. So same thing goes also for storage nodes. I don't want operation of draining and eviction to be complete unless my storage process was completed, meaning replication, rebuilding of storage data. These are also very long and heavy transactions. So to deal with that, we want to allow or utilize pod destruction budget with some sort of mechanism. We want to put a pod destruction budget on the pod running on the node. We would like to signal a controller that will accompany these processes to see if eviction was successful, if the complementary action of rebuilding or replicating storage nodes or migrating a VM was completed before lifting the pod destruction budget and allowing the pods to finally be evicted. So this is one of the examples I was talking about. I have here a free machine cluster which I can allow only one machine to go down or one node to go down and I would like to maintain it. So we have our pod disruption budget controller that looks over all of our nodes and let's say I'm trying to evict one node. So first you will want to see that the process was ended. The replication of data was, for example, ended. After you saw that it will remove the pod destruction budget of maximum unavailable zero and will allow these pods to be evicted. Another scenario. I want to evict two nodes at a time but we already started one process. So we also observed that. I'll mention again that I want to signal the process to begin like VM migration or replication but how will I know that? So we can use canary reports for that. I saw a canary pod getting evicted. I will start some process to go on migration or replication. I saw a tint on a node added so I'll start the process on migration or replication. So all of these processes can be initiated or usually are initiated by Cube CTL drain command which is client side command. But what happens if I would like something extra, something more? I want the server side drain meaning I just create some CRD or call and I get some feedback. For example, if I want to use a UI to start server side, to start draining my nodes, I don't really have an option right now using the CLI command. So server side drain will be great for you. I want to accompany the process. I want to see what events are going on. If the maintenance on drain in operation was completed or I have some errors. Automation. For example, I have a machine configuration. I want to change and I want to allow the process to be as automated as possible. So entering some drain commands or drain CRDs for that process will be helpful. There are some several solutions or semi solutions currently in Kubernetes that you might take a library, put it aside. But then you get not only one solution but 10 or 20. So one solution for draining for server side drain would reduce the use of multiple solutions. So that basically will look the same. I see I have one more minute. But I want to show you a quick demo of something we wrote in Kubert for server side drain. It will be really quick. So one less question but I think it's better. So this is an example of a controller I wrote utilizing server side drain. So I have a cluster with one node on it. Sorry. We have our controller which we call node maintenance operator which is running on node one. This is a cluster of two nodes. And now I see all the pods running on node number two. And now to create a server side drain I'm using a CR which is for node maintenance. So in that CR I just give it a general name. The node I want to evict and the reason for that. Of course later on you can add some more fields like status and events. So I invoke this CR. The controller has detected that I created a CR for that. It already called on the node. As you can see down there that the node is not scheduled. And we can see node two has been called on and scheduling is disabled. And also most of the pods have been evicted except the demon sets that are running here on the nodes. Also taking a look at the controller logs you can see all the pods that could be evicted were evicted. So ending that process or returning to node, switching the node back up we just delete that CR for the specific node. And the node becomes schedulable again. And all the nodes are ready. So that was a very simple example but it was a bit fast forward. A lot of processes happened in the background. Nodes were evicted. Before that the node was called on. All of our workloads migrated or were rescheduled on different nodes. And that's a nice example for server side drain. If you want to know more I would be happy to share some links afterwards on the slide for the node maintenance operator. So yeah. So now I have three minutes for questions if you guys have any. Should have made the demo longer. Okay. So the question was if I can install these features by CRD or operator. I'll refine that. This specific feature exists as an operator. It's encapsulated as an operator but it's basically a controller which has its custom CRDs. Once you deploy this operator let's say in an open shift environment you will have the custom CRDs. You can just create the specific CR for draining the node and the controller will detect that and just start the process. Yes. Good question. So the question is if it's our own controller and where do we stand on the Kubernetes side. So we wrote this controller out of a need because it wasn't server side drain didn't exist in Kubernetes. There is a Kubernetes enhancement process. By the way if someone here is from SIG node come talk to me later. An enhancement procedure on Kubernetes to get it accepted and make life much easier for everyone if we have this type of operator. It doesn't necessarily have to be one to one but the concept is the same concept. So a cap is pending. It's getting attention but in more use cases but I think that is currently the stage right now for server side drain in Kubernetes. They are not too eager to push it forward at the moment but we keep bugging them. Another question. Okay thank you guys. Thank you very much.