 Hello everyone. We are very glad to be here today. Let's get started our presentation today. The title is Kettle No Pets, but don't delete it until you investigate it. Let us introduce ourselves first. My name is Masaki Kimura. I have been contributing to OSS community as a developer. My area of concern are Kubernetes, storage and reliability. I'm one of the main developer of GraphRock Volume and CSI Feature and I've designed and implemented prototype of cross-name space volume data source feature, which became alpha in this year. And my name is Keisuke Saito. I have contributed to developing infrastructure. And my hobby is fishing, as shown on the right picture. Listen to me. I have engaged in providing OpenShift Managed Services to enterprise customers. And improving the function or machine scaling. Here are the contents of our presentation today. I will first talk about background and issues. Then Masaki will take over and talk about existing technologies, solutions and summary. Now let's begin. The first topic is background. First of all, have you ever heard of this analogy called pets versus cattle? This analogy is useful to understand how in structure management has changed until today. So you might have heard it somewhere. By the way, the left dog is my boss's dog named Dorup. And the right picture was taken by Masaki. Okay. Pets is described as a service model or preparturization like on premises. On the other hand, cattle is described as a service model or post-parturization. I explained the details of the pets and the cattle. Pets is a service model or preparturization. And this approach individual is important. Some of you may have dogs, cats or something in your home. You give your pets special name, greeting and care. When they get sick, you give thorough treatment until they are cured completely. This shows our idea of pets. Let's see how to manage services in this analogy. Services are carefully maintained as it could not be replaced. Unique name is assigned and when application or service fails. Recover in a significant time and effort. And the causes of failure are investigated to prevent recurrence. That's the idea of pets. And I will explain the cattle. Cattle is a service model or post-parturization. And this approach group is important. Although there may be a few people who are not aware of the benefits, there may be a few people in this room having cattle. You do not give your cattle special name, greeting and care. You give uniform cruelty, greeting and care. When they get sick, they are affected, individuals are removed. This shows our idea of cattle. When they get sick, oh sorry. Let's see how to manage services in this analogy. Services are equally maintained as it can be replaced in time. Random name is assigned. When application or service fails, the affected server is quickly discarded without recurring and replaced new ones for subscontinuity. This is a cattle analogy. We've compared the idea of pets with cattle until now. In recent crowd-native source management, we put emphasis on service availability. So an application or server is replaced or discarded with a new one if it fails. For example, last year in Japan, many chickens were disposed because of a powerful pandemic in the future reform. The cause of that was investigated and revealed that highly pathogenic policies were transmitted from wild birds to chickens. After the investigation, many chickens were disposed to prevent from expanding the infection. Originally, it started with one chicken, but this region was decided through leaving that one factory to investigate in detail. In other words, it is important to investigate the individual when the problem occurred, especially when the problems often reoccurs or spreads. The importance of investigating causes or problems is also applied to Kubernetes, not fatal. However, there are issues of failover, which is an obstacle for investigating the causes. In the next slide, I will explain the failover issues of investigating the causes. So let's see the issues. I will explain the failover issue of part-width volumes. First of all, failover is a function to switch to a backup system when application or server fails. It helps to improve availability and reliability. In Kubernetes, parts are also automatically failed over. However, we needed to be careful the part is stateful, especially when the part has volumes whose data may be corrupted due to concurrent light. I will explain the example with a diagram. Let's assume our environment that has one control plane and two nodes, nodes zero and node one. On node zero, one part with PV is running. If node zero fails and really stops, the ideal failover is that part moves from node zero to node one. On the other hand, in a distributed system, it is not always possible to assume that the state over the nodes is properly recognized. Thus, even if node zero appears to be stopped from the control plane's recognition, for some reason, node zero may actually be running. In this case, both node zero and node one parts will rise to the PV. As a result, data may be corrupted due to concurrent light. To prevent this problem from happening in Kubernetes version 1.23 earlier, there is no other way to failover than deleting a node. These are failover issues for parts with volume on Kubernetes. Next, I will explain the details of failover by node deletion that I mentioned. In Kubernetes version 1.23 and earlier, to guarantee that there is no right to PV, part is failover after node is deleted. Let's see the flows are failover with deleting nodes. Firstly, Admin detects the node zero failures. Secondly, Admin deletes node zero. And then we can guarantee that there are no PV lights because node zero has already been deleted. Finally, when scheduler on Kubernetes recognize node zero is deleted, part with the PV which was running on node zero will be restarted on node one. And because PV is written by single part, the part can be failover with no constants or concurrent light. The process of this method is that availability and reliability can be granted quickly. On the other hand, the coins, it is also related to the machine health check which we will explain later is that node zero, node zero is like metadata on statistics which cannot be retrieved from application is deleted. And it is difficult to determine the cause of failure. Now, let's talk about what kind of failover is better and what is ideal. To solve the issues that we discussed so far, there are two requirements for the idea failover or part with volumes. Firstly, failover is done. And secondly, node is not deleted. These are the goals to keep the data necessary to investigate the causes. Next, let's see the idea for a self-failover. First, when detecting the node zero feeders, second, fixing the node without deleting the without deleting the node. Facing the new word I explained later. Third, part which was running on node zero is failover to node one. As a result, node is not deleted and investigating the cause of failure can be done. This is an idea failover or part with volume. So, I talked about fencing in detail. I explained the definition and purpose of fencing. The definition is to isolate the failed node from the cluster. And the purpose is to prevent unintentional access to shared resources to ensure data integrity. Fencing can be achieved in other ways than deleting nodes. For example, I introduced two approaches of fencing. First one is approach of stopping the node itself. One of the approaches is to power off nodes to prevent a node from issuing IO. This is called power fencing. And rating node is also included in this approach. And second one is approach to prohibit access to shared resources. I introduced three persistent reservations and switch configuration change. The former is to buy persistent reservation instruction or schedule protocol limit which nodes can issue IO to a particular disk. The latter is to private routing from specific nodes to storage. Of course, there are other methods of fencing and you can choose the fencing method suits your purpose and requirements. In the next chapter, Masaki will describe the current failed over technologies implemented in Kubernetes. As existing technologies, I will explain machine health check MHC and non-graceful shutdown. Machine health check MHC keeps checking the health of nodes and there is unhealthy nodes. MHC is handled by machine health check operator. Let's see the detailed behavior step by step. First, machine health check operator keeps checking the health of nodes and there is failure, detect failure of nodes. Second, the operator request to delete node zero via crowd API and the node is deleted. After the node is deleted, the operator delete the node, node resource on the Kubernetes API. Finally, once Kubernetes scheduler detects that node zero is deleted, it can guarantee that there is no volume access from the pot on node zero. So the scheduler decides to fail over the pot. As explained earlier, if nodes are deleted automatically, pot with volume can fail over after nodes are deleted. On the other hand, because nodes are deleted, information that is required to investigate a cause over failure can be obtained from the node. Next existing technology is non-graceful shutdown. Non-graceful shutdown is a feature to allow failover of pot with volumes without node duration. It became stable in Kubernetes 1.28. A summary of its mechanism is that once administrators or tools convert to fencing a node and add a special time to the node, Kubernetes scheduler regards that there's no risk of data corruption and start failover of the pot. Let's see the detailed behavior step by step. First, an admin or tool detects a failure of nodes and fences node zero, for example, by powering off the node zero. Second, once the admin or a tool ensures that the node is fenced, it adds a special time to the node. The time which is very wrong is node.kubernetes.io slash out of service equals node shutdown, call on no execute. The time is used to tell the Kubernetes scheduler that the node is surely fenced or that there is no right access to any volumes from any pot on the node. Finally, once Kubernetes scheduler detects that the time is added to the node, it decides to failover the pot to another node. In this way, we can achieve failover of pot with volumes without node duration. The advantage of this feature is that because failed node isn't deleted, information for investigation isn't deleted. On the other hand, the disadvantage of this feature is that no standard mechanism of automation for fencing and adding the paint is provided, so it requires time for failover and cost for operations. Let's move on to solutions. Before explaining solutions, let's look back at the goal again. The goal is to achieve both failover is done and node is not deleted. In this way, we aim to keep the data necessary to investigate the cause. To achieve the goal, we utilize the external remediation feature of MHC. External remediation is a feature to change remediation process of MHC to any desired projects. By using this feature, we will automate the fencing and adding the paint to shorten the time to failover and improve investigations of a failure. I will explain external remediation feature itself first. In external remediation, we need two additional components, remediation CR and external remediation operator, which works together. Let's see the detailed behavior step by step. First, machine health check operator detects failure of node zero. This behavior is the same as the existing logic. Second, instead of integrating node zero here, machine health check operator creates a remediation CR. This CR has all the information required for the external remediation operator to do its remediation process. Finally, external remediation operator keeps checking the remediation CR. Once it detects that the CR is created, it executes the remediation projects for the node. The remediation projects are up to the developer of external remediation operator so any projects can be executed. We will utilize this feature to automate the fencing and adding the paint to shorten the time to failover and improve investigation of a failure. I will first explain a case that remediation projects are powering off the node and adding the paint. Remediation steps consist of three steps that are described as 3-1 to 3-3. Let's see the detailed behavior step by step. Step one and step two are the same as explained. Machine health check operator detects a failure of node zero and creates remediation CR. As remediation projects, external remediation operator first regrets the power of node zero to a cloud API. Second, after the power of success, the operator adds a special paint of non-gressive shutdown to the node zero. Finally, Kubernetes scheduler detects that the paint is added to the node. It starts to failover the part to another node. As a fencing, powering off the node is done automatically. Safe failover is achieved. Also, node is stopped instead of deleted. So we can avoid deletion of data on the node that is required to investigate the failure. As a result, we can achieve both failover is done and node is not deleted. We at least achieve the original goal now. However, status of memory is lost when node is powered off because system memory is burnt out. So from the viewpoint of improving investigation, we still have a room for improvement. Let's consider a further improvement. To solve an issue of the status of memory is lost, we already have a matured technology called K-dump. K-dump is a feature of Linux kernel to write out the status of memory to a disk for future investigation. We can request to take a K-dump by using non-mercable interrupt NMI, then the status of memory can be written out to a disk. The data retains the status of the failure on the failure, so we utilize it for future investigation. As described in the table, NMI can be injected externally because each platform, like bare metal virtual environment and cloud vendors, provide CLI, WebUI or API for it. From a fencing viewpoint, there is one more important thing to be mentioned. After starting to take K-dump, writing to disk is restricted to only to a dump disk. Fence K-dump is a feature to fence by detecting the start of taking K-dump. It fences by utilizing the fact that after starting to take K-dump, writing to disk is restricted only to a dump disk. On the other hand, there is a lag between successful completion of K-dump and K-dump regret and actual start of K-dump, so we are not sure whether K-dump is started just by successful completion of K-dump regret. Fence K-dump fences nodes by detecting the start of K-dump. Let's see the detailed behavior step by step. First, an admin requests to take K-dump and OS starts to take K-dump. Second, once taking K-dump is started, Fence K-dump send is executed on the OS. A special packet to notify the start of K-dump is sent. The destination of the packet is configured in AdWords by specifying the host name or IP address of the host that Fence K-dump listener is running on. Finally, once Fence K-dump listener receives the notification packet, it detects that the fence is completed. In this way, we can separately handle the request to take K-dump and detection of start of taking K-dump. Fence can be achieved by using K-dump. We can combine these features to make remediation logic to be fencing with Fence K-dump and adding the taint. Let's assume that K-dump and Fence K-dump send are already configured on the node and external remediation operator is implemented to handle the Fence K-dump packet to detect the start of K-dump. Let's see the detailed behavior step by step. First, machine health check operator detects failure of node zero and creates a remediation here in the same way. Remediation step consists of five steps that were described as 3-1 to 3-5. First, external remediation operator requests to take the K-dump by sending an NMI to OS via cloud API. Second, OS to take K-dump. Third, Fence K-dump send notifies that taking K-dump is started. First, once external remediation operator receives the notification, it detects that fencing is completed, so it adds taint to the node. Finally, Kubernetes scheduler detects that the taint is added, so it starts port failure over. In this way, port failure over is safely achieved with keeping the node itself and the status of the memory. As a result, more detailed information for investigation over failure can be obtained. In summary, as a background, we've explained a famous analogy of pet versus cattle. And even cattle or a cloud native word, investigation over failure are important. As a specific issue of Kubernetes for the background, we've introduced the dilemma of failure over and investigation. On Kubernetes node failure, it's hard to achieve both shortening the time to fail over and improving investigation of failure. It is because to avoid data corruption, fencing of failed node is required. On the other hand, existing technology only allows to choose either of automatically fail over but not good for investigation or good for investigation but not automatically fail over it. The existing technologies are MHC that provides automatic fencing by the rating node and non-graceful shutdown that provides manual fencing without the rating nodes. As solutions, we've introduced by utilizing external remediation feature of MHC, the remediation process of non-graceful shutdown can be automated. As fencing methods, power fencing and fence cadence can be used and rather one is better for investigation. That's all for our presentation. Thank you for your attention and any questions? Can you show the slide 4-5? Yeah, no, no, no, 4-5. Yeah, just one. You said that the worker node has a failure. You have some mechanism to fail over and graceful shutdown or something like that. What happened to the control plane's node has a failure? The same behavior going to happen? Actually, control plane should be configured to be redundant and so control plane itself works. Sorry, so your question is has happened on control plane server or control plane node itself. That same fail over something going to happen? Yeah, that's my question. Okay, so yeah, you can choose either of them to configure to be the same behavior or exclude the control plane to not to be fenced. Oh, okay. So that's user's choice. You have to select which option going to take. Yeah, and I think there should be no problem to fence control plane if the control plane has a failure. It failed over to another node to just like that you described in your presentation. So the control plane's pot is already redundant, so it doesn't need to be failed over in this sense, but yeah, you can configure as your idea. Thank you very much for your presentation. My name is Akihio Hasegawa. Could you tell me the example for using K-damp? You mentioned you want to do investigation. I'd like to have some example. What did you know using K-damp? Okay, so you mean which kind of investigation? Yeah, yeah, yeah, because we're also using the system. So sometime we have also a similar station. So K-damp is very useful or not? I would like to know. Yeah, what kind of problem? Yeah, actually, examples are so kernel bugs that should only be investigated from there. Is it sometime or not? So it depends on your environment. It happens, and I'm not sure the frequency. It depends on your environment or your workload. Okay, thank you. Any other questions? I think there's no questions, so thank you very much.