 Hi, good morning everyone. Today I would like to talk about Kubernetes. The title is Kubernetes Failover Improvement. Self-healing or port with persistent volume when Kubernetes node is done. Here is today's agenda. At first I will talk about Kubernetes, Kubernetes Failover issue. I will explain that what is happening when we use Kubernetes in production. And also I will talk about the solution for this issue. Currently the solution for this issue is discussed in Kubernetes community so that I will introduce it. And finally I will talk about development status in Kubernetes community. And also talk about communication in open source community. I would like to think about how we can manage to progress with development in open source community. At first I would like to introduce myself. My name is Yui Komori and I am a software engineer. I joined in Kubernetes community in 2019 and I am mainly contributing for Sieve storage and Sieve testing. Then I will talk about Kubernetes. What is Kubernetes? I just copied and pasted this diagram from kubernetes.io. Kubernetes is an open source system for automating deployment, scaling and the management of container-wise applications. Kubernetes has many features and one of the most attractive features is self-healing. Restarts containers that fail, replaces and reschedules containers when nodes die. Chaos containers that don't respond to your user defined health check and doesn't advertise them to clients until they are ready to serve. And then I would like to introduce some terms in Kubernetes. At first, PUD. PUD is a group of one or more application containers and some share resources for those containers. Then Node. Node is a worker machine in Kubernetes and maybe either a virtual or physical machine depending on the cluster. And then Tint. Tint is a property of PUDs. Tints are applied to PUDs and forbid the PUDs to schedule onto nodes with matching teams. And also I would like to introduce some terms in Kubernetes related to storage volume. And first, persistent volume. Persistent volume is a piece of storage in the cluster. Next, container storage interface called CSI is a standard for exposing arbitrary block and file storage systems to container-wise workloads on container orchestration systems like Kubernetes. Using CSI third-party storage providers can write and deploy plugins exposing new storage systems in Kubernetes without ever having to touch the core Kubernetes code. And I also introduce CSI architecture. CSI driver consists of two plugins. One is Node plugin. Here it's basically the Node component should be deployed on every node in the cluster through a daemon set. And Node plugin communicates with Kubernetes. Let's make the CSI Node service codes which mount and unmount the storage volume from the storage system which making it available to the PUD to consume. Another is controller plugin. Here is controller plugin. The controller components can be deployed as a deployment or state offset on any node in the cluster. It consists of the CSI driver that implements the CSI controller service in one or more sidecar containers like external provisional or external toucher. These controller sidecar containers typically interact with Kubernetes objects and make codes to the driver's CSI controller service. I would like to talk about volume lifecycle in CSI also. This figure represents a lifecycle of volume when we use CSI. And they are names of RPC. At first when container orchestration system like Kubernetes codes create volume RPC then volume status would be moved to created. Then by calling controller published volume RPC status would be moved from created to node ready. In this RPC persistent volume is attached to node. And then by calling node stage volume RPC status would be moved from node ready to volume ready. And then by calling node published volume RPC status would be moved from volume ready to published. In this RPC persistent volume is mounted to code. And then these RPC are reverse operation of these RPC. So that in node unpublished volume RPC this that's unmounted amount and also controller unpublished volume RPC does detach. Next I'd like to talk about Kubernetes fadeover issue and self-healing of parts with persistent volume when Kubernetes node is down. When there are two parts on a node node one and persistent volume the name is PV1. PV1 is attached to node one. When node one is down due to hardware failure we expect these two things. One is that two parts on node one will be created on node two newly. Another is that PV1 will be detached from node one and attached to node two. But actually two parts on node one are created on node two newly. That's good. But PV1 is not detached from node two. Sorry PV1 is not detached from node one. That's not we expect, right? Why this issue happen? I'd like to talk about the reason. Currently persistent volume is not allowed to detach from node when volume attachment exists. Volume attachment is information about attachment of node and the storage. It means persistent volume is not detached yet. Volume attachment resource can be deleted only in the case that volume is already mounted so that we need a new flow which detach persistent volume from node even if node is down. And also I would like to explain what occur in volume lifecycle in CSI when this issue happen. When this issue happen in volume lifecycle in CSI we cannot call any RPC which are done by node planning due to node failure. Next I'd like to talk about the solution for this issue. In Kubernetes community CAP has been posted so that I will introduce it. And first I'd like to introduce what is CAP? CAP is a way to propose, communicate and coordinate on new efforts for the Kubernetes project. You can see the detail in this page. The reason why we need CAP is having a CAP in one place will make it easier for people to track what is going on in the community and find a structured historical record. Especially in case that new feature someone wants to propose has impact to two or more projects, CAP makes sense. Because everyone in different cities feel easy to review. Then I'd like to introduce CAP number 1116. This CAP proposes a solution for the Kubernetes favorable issue. The title is add non-gray for node shutdown. You can see this CAP with this URL. Please check it. The goal of this CAP are that at first increase the availability of state for workloads when node is shut down. And the ultimate self-healing for state for workloads. And no goal of this CAP. One is that node or control plane partitioning other than a node shutdown is not covered by the proposal. But will be addressed in the future and built on top of this design. Second is implement in cluster logic to handle node or control plane partitioning. And enable detach for all the storage providers. And existing entry volume are not targeted by this proposal. This CAP just targets CSI. And then I'd like to talk about what will be changed in CSI volume life cycle by this CAP. This CAP proposes a new flow like this. And new states. So I'll introduce them. And first, quarantined S and quarantined SP are new status for the case. We cannot call node unpublished volume or node unstage volume RPC. When volume status is published and node down, we will be able to call control unpublished volume fenced. So status would be changed from published to quarantine SP. And then by calling node unpublished volume forced RPC, status would be changed from quarantined SP to quarantined S. And then by calling node unstage volume forced RPC, status would be changed from quarantined S to create it. And then I'd like to introduce the solution which the CAP proposes. When node one is done, due to hardware failure, port zc controller in kube controller manager, deletes ports on node one. Then the port zc controller would also apply a quarantine taint on the node one. Quarantine taint is also a new taint which the CAP proposes. This should happen before the ports are being evicted. When this taint is applied to nodes, new ports should not be scheduled on the nodes. Then the attach-detach controller in kube controller manager. Keep checking whether volume is still mounted and deletes volume attachment resource. Volume attachment resource is information about attachment of node and storage. When the external attach here detects the volume attachment object is being deleted, then external attach will call CSI drivers controller unpublished volume RPC. When controller unpublished volume RPC called CSI driver ensures that pv1 is not being used. And then pv1 will be detached from node one. This is called unstorage fencing. One example of storage fencing is that persistent volume has a list named access controller list. With ACL, we can manage which node can access to the persistent volume so that by deleting this data of node one from ACL in pv1, node one cannot access to pv1. And then ports will be created on node two newly and also pv1 will be detached. pv1 will be attached to node two. That's good. Next, I will talk about communication in open source community. As I said, we are discussing on CAP number 1116. This CAP was posted in 2017 so that we having been discussing about this issue over four years. You know, it's very long. The reason why managing this CAP takes so long time is, I guess, there are three reasons. One is that there are so many use cases which we want to cover. The second reason is that this CAP impacts multiple projects. The third reason is the matter related to release cycle. I will talk these reasons in detail in the next slide. The first reason, there are many use cases which want to cover. For example, when shutdown command is executed in this case, this can be solved by graceful shutdown feature. Graceful shutdown feature is a feature which enables Kubernetes to gracefully evict ports during a node shutdown so that we can set grace period to the node. But in order to cover cases related to node failure, like when node hardware is broken or when Kubernetes is unresponsible, they cannot be solved by graceful shutdown feature because when node is done, amount cannot be done so persistent volume cannot be detached from node. And also, it seems like there would be more use cases like network failure so that in order to solve this matter, for example, listing up all use cases and explaining them clearly in CAP seems a good solution. The second reason impact multiple projects. This issue cross over SIG storage which handles storage in the SIG node, which handles node including Kubernetes. So in order to progress with this development, it needs both SIGs agreement. This CAP was posted by SIG storage's member and only SIG storage members had discussed. Then recently, SIG node's member has petitioned to participate it to discussion so that clarifying stakeholders can make appropriate decision, I think. The third reason matter related to release cycle. This is current release cycle of Kubernetes. You can see the period of enhancement fees is very short. And also, after enhancement fees, developers are too busy at other works to continue discussion on CAP. As a solution on this matter, always be aware of release cycle or to check the progress of CAP in weekly meeting seems good summary. Kubernetes is used in production and also said production readiness. But when we use Kubernetes in production, we often or sometimes we'll inconvenient or face issues. In such cases, we can share these issues and solve them in community. In this presentation, I explained one of such topics around pod cell healing. If you are interested in this issue, please join us in our discussion. Thank you for listening. That's all. Bye-bye.