 Hello, everyone. My name is Xin Yang. Hello, I'm Yang Xin. I work at VMware in the cloud storage team. I'm also a co-chair in C&CF tag storage and a co-chair in Kubernetes 6 storage. Here is my co-speaker, Nick. Hello, my name is Nick Ren. Hello, I'm Ren Yuquan. I come from Bed-Dance, and I am now focusing on Kubernetes-related work. Thanks, Nick. Today, we will talk about how to keep persistent warnings healthy for state-of-the-art workloads. Here is today's agenda. We will first talk about what problems we could encounter after warning is provisioned. Then we will introduce the Warning Health monitor feature. We will give a demo of the existing functionalities of the feature. Then we will talk about what we are working on in Kubernetes 1.23 release to improve the feature. And finally, we will talk about what we are planning to do in the future. So let's see what could happen to your warning after it is provisioned. As shown here, we have an application running using a CIFO set with three replicas. It has three pots. Each is using a PVC. Everything looks perfect for a while, but then one day something happened. And admin was working on something and made a mistake by accident. And as a result, the warning was deleted from the storage system. In any case, the admin didn't realize that he or she did something wrong that could affect applications running on the platform. So on the surface, everything looked fine because Kubernetes does not check what happened to the underlying volume on the storage system after it is provisioned and used by a pod. Everything looked normal until when the application tried to write to the volume. Then, of course, it failed. Now the user tried to figure out what happened by checking the logs, but there are no clear messages that indicate the volume was deleted. So the user opened the ticket with the infrastructure team. Then the infrastructure team started to troubleshoot the problem. It is not easy for them to root cause the problem either as there isn't enough information in the logs. So it's going to take a while for the real problem to be discovered. There are other possible failures as well. The admin may have removed the disk for maintenance or replacement without knowing that it has an impact to the running applications. The disk that the volume resides on could fail. There may be configuration issues with the underlying storage system that affect the volume's health. The volume could be out of capacity, so app can no longer write the volume until the volume is expanded. The disk may be degrading, which affects its performance. There are other possible problems that could happen to the volume after it is provisioned and used by a pod. For local volumes, if the node where the local volumes resides fails, the volume also does not exist anymore. There may be rewrite IO errors. The file system on the volume may be corrupted. The file system may be out of capacity. Volume may be outmounted by accident outside of Kubernetes. There could be other issues that are not captured here. So how do we solve these problems and improve user experience? To solve this problem, we introduced an alpha feature called volume health monitoring. Without volume health monitoring, Kubernetes has no knowledge of what happened to the underlying volumes on the storage system after a PVC is provisioned and used by a pod. With this feature, CSI driver can communicate with storage systems and find out what happens to the volumes and communicate back to Kubernetes. So Kubernetes can report events on PVCs or pods if volume conditions become abnormal. This feature includes two parts. There is an external volume house monitoring controller that is a side car deployed with the CSI driver to monitor volume health from the controller side. It reports events to the PVC when abnormal condition is detected. Kubernetes also monitors volume health from the node side. It reports events to the pod when abnormal volume condition is detected. So here we have a CSI deployment example that shows various Kubernetes components. The CSI driver and the storage system that is used to persist the data on the volumes. Here we have CUBE controller manager on the master node. CSI driver controller plugin is deployed together with Kubernetes CSI external provisioner, external attacher, and external house monitor side cars. Note that the CSI driver controller pod does not have to run on the same node as the Kubernetes master, but it is recommended to run dedicated control plane nodes. The Kubernetes CSI side cars are watching Kubernetes API objects such as persistent volume claims, persistent volumes, volume attachments to detect creative volume, attach volume requests. The side cars for the CSI driver and the CSI driver communicates with the storage system to complete those volume operations. The external house monitor controller works differently from other side cars. It periodically calls CSI driver to retrieve volume health information from online storage systems. On the Kubernetes worker nodes, we have kubelet and the CSI driver node plugin deployed together with the node driver registrar setca container. Node driver registrar fetches driver information using node get info from a CSI endpoint and then registers the CSI driver with the kubelet of the node. Kubelet directly issues the CSI node get info, node state volume, and node publish volume calls against CSI drivers to get info and mount volumes. Kubelet also periodically calls node get volume stats to get volume stats information, including volume health information. Now let's take a closer look of the external house monitor controller. This controller costs either these volumes or controller get volume CSI RPCs and reports volume condition of normal events with messages on TVCs if abnormal volume conditions are detected. If the CSI driver supports list volumes and volume condition controller capabilities, it must implement controller RPC list volumes and report the volume condition in response. If a CSI driver supports get volume and volume condition controller capabilities, it must implement the controller RPC controller get volume and report the volume condition in the response. If a CSI driver supports both list volumes and get volumes and volume condition controller capabilities, in that case only list volumes CSI RPC will be invoked by the external house monitor controller due to performance considerations. There is also a node watcher component in the controller. If the enable node watcher flag is set to true when deploying the external house monitor controller, node failure events will be watched. So when a node failure event is detected an event will be reported on the PVC to indicate that parts using this PVC or field node. The controller side volume monitoring has been alpha since Kubernetes 1.19 release. Now let's take a look of the 1.19 house monitoring feature from the node side. When 1.19 house was first introduced in 1.19 there was an external house monitoring agent that monitors house on the node side. In 1.21 we made a design change to avoid duplicate the CSI RPC cause. Cubelet already cause node get volume stats to retrieve volume metrics such as available, total and use capacity. In addition to that now it also retrieves volume house condition through node get volume stats CSI RPC call and reports volume condition of normal events with messages and parts if abnormal volume conditions are detected. Only CSI drivers with volume condition node capability support volume house monitoring in Cubelet. We added a new alpha feature gate called CSI volume house in 1.21 release to introduce this feature in Cubelet. Here is how volume condition is defined in the CSI spec. If there was a problem with the volume or the system the abnormal field in the volume condition should be set to true. There is also a message field that you can use to add additional information to explain what the abnormal condition is. Note that if a volume is not found on the storage system the abnormal field should be set to true so that an event can be reported on the TEC which is still in the Kubernetes cluster. So now I'm going to hand it over to Nick. He's going to show us a demo. Okay, thank you Xing. Yeah, as Xing said that we created a demo to visually show what a volume house monitor can do right now. So we expected that it can send abnormal volume events to the port. Okay, Xing, could I have clicked the video? Yeah, first we will need to set up a Kubernetes cluster locally using the local app cluster script provision in Kubernetes repo and the Kubernetes version and truth is one point 19. I made some changes to the deployment fires in order to set up the testing environment and run the demo correctly. I will summarize the changes later and we can ignore this right now. Yeah, and maybe it will take some time to gather the cluster ready. Yeah, the local cluster is ready. And then let's create some CSI containers and the host path plug-in drivers as well using the deployment fires from the CSI drivers for the repository and PV data in the host path in the host is in this folder. Yeah, run the deploy script and all the containers will be created. Okay, so after that we will create a storage class and the PVC and as well and a port as well. So let's create the storage class and the PVC. Okay, we can see that the PVC is created and the one PV is dynamic provision that is better host path driver. And let's describe the PV. Yeah, we can see it is dynamic provision that it has the path driver and we create a port using that PVC. Okay, we can see that the port is created. And is it running now? Okay, then let's delete the mount pass. Okay, after we're deleting the mount pass we expected that the volume house monitor can send a number of events to the port. Okay, so let's describe the port. I can know one event right now that may take several seconds to send to let the volume house monitor to send a number of events. Okay, still no number of events. Yeah, okay, the number of events is here. So we can see that the number of events at the monitor works as expected. Okay, next please. Okay, here I summarized the changes I made to the deployment fires. So since some storage driver containers need privilege security option. So I said a low privilege flag to true for APS server. This is the only change I made for Kubernetes. Another one, I made a small change to the host pass plug-in YAML file too. And I changed the CSI provision image version to version 3.0.0 and the previous version has some bugs. Yeah. And I use this YAML files and deploy the containers and set up the local environment and we can run the demo correctly. So if you are interested in this feature, you can test this in your environment too. Okay, next please. Okay, since I just know that the node side monitor component is TrueBlade, but actually we created external agents which is responsible for node side monitoring work at the beginning. But finally we decided to move all agent work to TrueBlade. So for now, TrueBlade will be responsible for checking volume health conditions periodically and sending a number of events if it prevents problems. Also we are now trying to let TrueBlade emit metrics too. And this will target our version in 1.23 Kubernetes version. Okay, next please. Okay, apart from the current work, we are now also discussing what we can do in the future for volume health monitor. The most important topic we are talking about is automatic reaction to unhealthy volumes. I listed some cases here. For example, if one node breaks down and we have stable pause with local PVS running on that node, so what can we do? We're waiting for more than five minutes for Kubernetes to delete the pause, although we can delete this pause. But since they are using local PVS, the newly created pause will get stuck in pen instead all the time after the node cannot recover. So what we can do, another option we can do is delete the pause and the PVS is as well so that we can have the chance to schedule a new pause to other nodes. Also another case is if the volume is out of capacity, maybe we want to resize the volume automatically in order to avoid the data loss. Also, if the underlay volume is deleted by mistake. Maybe we prefer to delete the related applications automatically. If we have the ability to do reactions, maybe we can, another thing we want to do is support push mechanism too. So customized drivers can report problems by themselves. Maybe we need to add a new PV or PVS status condition to indicate whether the underlay volume is health or not. And the reaction part can react to this status condition. And some users are asking for this feature too. So maybe there are maybe some other important features we need to add for volume health monitor. So we can discuss this offline if you have any ideas. Okay, next, please. Okay, and we added some resources here. I wanna say that if you are interested, please join us together to make this feature better. You can try this feature if you need it. You can report a problems when you deploy or use it and it'll be super great if you can contribute your proposal to solve this problem. Thank you.