 Hi, good afternoon, everyone. It's time to my presentation so that I would like to start my presentation. Today, I will talk about Kubernetes. That title is Kubernetes Failure of Improvement, Non-Grassful No Shutdown. Can you hear me? How are you? OK. How can I fix? Is there no problem? OK, let's start. Here is my today's agenda. First, I will talk about Kubernetes Failure of Issue. And I will explain that what happened when we use Kubernetes in production. And also, I will talk about the solution. The name of the feature as the solution is Non-Grassful No Shutdown. And then I will talk about how to use in production environment. And finally, I will talk about conclusion. At first, I would like to introduce myself. My name is Yui Komori. I'm a software engineer. I'm working for NAC Open Source Community Team. I joined the Kubernetes community in 2019 and mainly contributing for SIGU storage and SIGU testing. And also, I talk about something like Mission. Our company is providing Kubernetes and the product OpenShift, which is based on Kubernetes to our customers. In this work, sometimes we find Kubernetes bug or we notice that we need to add some features so that we fix Kubernetes bug and add some features which our customers need in Kubernetes community. Then I will talk about Kubernetes. What is Kubernetes? Maybe the people in this room know about Kubernetes as well so that maybe I don't need to explain so clearly, but I just copied and pasted this diagram from kubernetes.io. Kubernetes is an open source system for automating deployment, scaling, and management of containerized applications, as you know. And Kubernetes has many features. And one of the most attractive features is self-healing, which starts containers that fail, replaces, and reschedules containers when nodes die, and kill containers that don't respond to your user-defined health check and doesn't advertise them. I lost my mouse. Doesn't advertise them to clients until they are ready to serve. I'd like to introduce some terms in Kubernetes also. At first part, part is a group of one or more application containers and some shared resources for those containers. And the state-for-set is an object which used to manage state-for-application like Datastore. State-for-set manages parts that are based on an identical container spec. About the unit point of state-for-set, state-for-set maintains a sticky identity for each of its parts. These parts are created from the same spec but are not interchangeable. Each has a persistent identifier and that it maintains across any rescheduling. Then, node is a work machine in Kubernetes and maybe either a virtual or physical machine, depending on the cluster. And then, taint. Taint is a property of pods. The pods do schedule onto nodes with matching tints. And finally, persistent volume. Persistent volume is piece of storage in the cluster. Next, I'd like to talk about Kubernetes failover issue. Pod with persistent volume fails to migrate when the Kubernetes node is done. There is a pod named pod1 on the node, node1, which is attached to persistent volume pv1. And also, pod1 is a part of a state-for-set 1. When node1 is done due to hardware failure, we expect two things. One is that pod1 or node1 will be created on node2 newly. And another is that pv1 will be detached from node1 and attached to node2. But actually, pod1 or node1 is not created on pod node2 newly. And pod1 stuck in terminating status indefinitely, as you can see here. And also, pv1 is not detached from node1. That's not what we expect, right? The reason why this issue happened is node1 does not respond due to the node failure so that pod1 stuck in terminating status. And also, pv1 cannot be detached from node1. There is one thing to note. This issue may not be visible to users on managed Kubernetes clusters like AWS or Azure or something like that. For example, in case of AWS, the monitoring system deletes the failure node when it detects a node failure and create a new node. When the failed node is deleted from Kubernetes, pv1 on such node will be detached. So a new pod will be created on another node. Next, I would like to talk about the solution for this issue. The name of the feature is non-graceful node shutdown. Non-graceful node shutdown feature allows state for workloads to restart on a different node. If the original node is shut down unexpectedly or ends up in a non-recoverable state, such as the hardware failure or unresponsive OS, this feature has been released as alpha in version 1.24. And now it's stable, which has been released in version 1.28. I will introduce the steps of non-graceful node shutdown. When the administrator, user, or the monitoring system conform node 1 is powered off or isolated from the cluster, adds out-of-service taint to node 1. The values you set for that taint can be node.kubernetes.io slash out-of-service equal node shutdown on a node execute or node.kubernetes.io slash out-of-service equal node shutdown on a node schedule. Then the Kubernetes controller hosts there is all pods remaining on node 1 with an out-of-service taint. Thereby, pods on the shutdown node is created on node 2 nearly. And the PV1 is detached from node 1 and attached to node 2. Even after node 1 is back, the controller will not schedule new pods on it while the taint exists so that administrator user need to delete taint manually. From now, I wanted to show you a demo, but I couldn't take enough time to prepare so that I'd like to show you actual screenshots like demo. So let's go on. There are three nodes. One is control plane. And the other two nodes are test worker and test worker 2. These two nodes are worker node. And I created this node by using kind. So these nodes are Docker containers. And also, there is one port in state of offset. And also, there is one port in state of offset. This is sample application, and you can get with this URL. This sample application outputs current timestamp like this every second. This part is running on test worker 2. Then I stopped the node test worker 2. As I mentioned, these nodes are created by kind so that I do Docker stop, and this node will stop so that it means node shutdown. Eventually, the node status become not ready. And as you can see, in default, about 300 second later, pod status become terminating. The reason is in Kubernetes, when node become unreachable in default 300 seconds later, Kubernetes tried to evict and evict a pod from unreachable node. From now, I will show you a non-graceful node shutdown feature. I add the service taint to the node test worker 2 node. And then pod is created on the test worker. It's another node from test worker 2. So this pod has been moved to test worker from test worker 2. And then let's check the sample application. This sample application writes current timestamp every second. But as you can see here and here, this line is 5 o'clock, 15 minutes. And this line is 5 o'clock, 21. So you can see that for about six minutes, there is no output. So in this period, node was stopping. I will talk about the role of non-graceful node shutdown feature also. The role of Kubernetes is deleting pods forcefully from the node to which auto service taint is added. On the other hand, adding the taint must be done by administrator user or something out of Kubernetes. If human do this operation manually, it's a little bit hard if there are 100 or more nodes. Therefore, I will talk about how to use non-graceful node shutdown feature in production environment. I showed this table a few seconds ago, and I said that. Adding the taint must be done by administrator user or something out of Kubernetes. I will introduce this something out of Kubernetes from now. The name is node health check operator and self-node remediation operator. These operators are developed in Medicaid's open source project. Node health check operator monitors the nodes conditions using a set of criteria, for example, power off or connection timeout, and detects unhealthy nodes. Another operator, the name is self-node remediation operator, which remediates the unhealthy node. This operator reboots the node using watchdog or some other mechanism like VMC. And also, I will talk about the steps of node health check operator and self-node remediation operator and non-graceful node shutdown. Before using node health check operator, you can define unhealthy status in YAMO file. For example, when the control plane does not get a heartbeat for default of more than 40 seconds. Then when node health check operator detects node 1 is unhealthy according this YAMO file, the self-node remediation operator adds out of service to node 1. Basically, the following steps are almost the same as what I explained a few minutes ago. The different point is that the failed node is rebooted by watchdog or some other mechanism like VMC. If the failed node is rebooted and there is no terminating pods or some other objects, self-node remediation operator deletes the out of service state. Then I will show you a demo of node health check operator and self-node remediation operator and non-graceful node shutdown. It's my first time to show demo to you. I'm a little bit nervous. OK, it's playing. As you can see, there are three nodes, OK, control plane and two worker nodes, Kubernetes 2 and Kubernetes 3 worker nodes. We start. I create a name space named test. And also I created the stateful pods. And there are three into there are stateful pods, maybe seven or eight, I think. They are running on many nodes, Kubernetes 1, 2, 3. They are running. Then I will stop one node. As you can see, I'll stop Kubernetes 3 node. I powered off this node so that it means node shutdown, right, and restart. Then it will take five minutes to the status of pods change so that I move forward. About five minutes later, as you can see, here is out of service taint. You can see on Kubernetes 3 node, this taint has been added to this node automatically by self-node remediation operator. And I will start again. OK, as you can see, this pod was container-creating. So all pods are running, is not a problem. But all of pods are running on Kubernetes 1 or 2 nodes, not Kubernetes 3 nodes, right? So that these pods are evicted to other nodes by non-gressful node shutdown feature. My demo is finished. Then I would like to talk about this conclusion. Stateful workload stuck in a terminating status when node is shut down or in a non-recoverable state. And non-gressful node shutdown feature allows stateful workloads to fail over to a different node after in such situation. And node health check operator and self-node remediation operator makes using non-gressful node shutdown in production environment easier. This is all for my presentation. Thank you for listening. Any question? Thanks for your presentation. Can you please once more explain why the node ends up in the terminating state and why it stays there? Because I didn't understand quite why that happened. Or sorry, why the pods stayed in the terminating status? Your question is why pod status is stuck in terminating, right? What causes it to stay there? Yeah. Kubernetes controller requests to terminate a pod. But it controls its response from Kubernetes, which works on the Kubernetes node. So there is no fallback mechanism right now. So that's the reason pod terminating status continues forever. So we need, before this feature, one action we can do is to delete a node. So that's the reason the tutorials on some Kubernetes cluster. There is a monitoring system outside of Kubernetes. And they would monitor the node status. So if it detects a fail, they delete the node and also delete the node resource for Kubernetes. So at that timing, associated resources deleted completely. But we'd like to keep node for investing in a failure reason or some other purposes. OK, thank you. Thank you. He's my friend. That's all. Thank you.