 Today, Yan and I will be giving an interactive dive of Kubernetes 6 storage. My name is Xin Yang. I work at VMware in the cloud native storage team. I'm also a co-chair of 6 storage. And I am Yan Shafranek. I'm from Red Hat. I work on OpenShift storage and I am tech lead of Kubernetes storage. So, what are we going to talk about? We will talk about what is 6 storage, what we did in a couple of past releases, what is being developed right now, and most importantly, how to get involved. In the end, we would like to have some time for question and answers. And so, what actually is special interest group-storage? It's a pretty loose group of people. We don't have any onboarding process or graduation. You just come and contribute and that's how you become part of 6 storage. We have two co-chairs, Sadali from Google and Xin Yang from VMware. And we have two tech leads, Michelle O from Google and me from Red Hat. We have quite some CSI channels on Kubernetes Slack. Here are some numbers from the biggest one. The main 6 storage channel has 5,000 people on it. But that doesn't mean that everybody contributes. Most people, they just come and ask or find something in the history and never say anything on the channel. And the group that actually does try to answer the questions of right, poor requests, fixed issues, work on new features. This is a pretty small group. We have a regular bi-weekly storage Zoom meeting. On average, we have 24, 25 attendees. But again, not everybody is speaking. I would say like 30% is active and the rest is just listening. And throughout the time, we accumulated 30 unique sick approvers from our sick in our different repositories and directories, some directories of different repositories. What we do, we have a charter for that. We basically maintain the storage APIs for users, like persistent volumes, persistent volume claims, snapshots, snapshot content, storage classes, volume attachments, storage capacities, and so on. So all the APIs exposed in Kubernetes related to storage, we maintain them. We also maintain the implementation of those APIs in Kubernetes. And on the bottom end, we maintain Kubernetes volume plugins that are in Kubernetes slash Kubernetes repository, like NFS, like SAP RBD, and so on. Except for secret config maps, then our APIs projected and emptied theirs, we maintain them with sick nodes because sick node knows better than us how to get a secret in Kubernetes. They already have functionality for that. And we know how to mount volumes and present it to users, to pods. We also created and maintained the container storage interface, CSI, and we maintain the Kubernetes implementation, both in Kubernetes slash Kubernetes repository and in many of our CSI sidecars. But we don't maintain too many CSI drivers. We maintain the generic ones, like NFS, like Samba, but most of the better, well, they aren't better, but more important drivers are owned by cCloud provider, like CSI drivers for AWS, all the clouds, or other community. For example, Rook maintains the CSI driver for SAP RBD and SAP FS. And we also started container object storage interface, which is our attempt to provide object storage to pods, and object storage in this case means like S3 buckets and similar object storages. This is in alpha yet. So what we did in 126, SGA in 126, we allow CSI drivers to opt in for applying FS Group to volumes. FS Group is Kubernetes mechanism how to allow a random bot that runs as a random user, a random group to access data on a volume that again is owned by a random user, a random group. So with FS Group, Kubelet basically changed ownership of all files on a volume so the bot can read it, they can access it. This is slow, and some storage backends have better options how to do that. So if the storage backend has some better way how to apply FS Group to a volume, they can opt in. This is really dedicated only to special CSI drivers like Samba, where you can apply FS Group using mount option, for example. If your CSI driver doesn't have any such option, please use Kubernetes functionality because if we improve it or anybody improves it, then all CSI drivers will benefit from that, not only your CSI users. SGA we also are continuing with CSI migration and we have Azure file and vSphere, CSI migration, that means that users can still use the old Azure file and vSphere storage classes, persistent volumes, they don't need to change anything. However, when they upgrade to 126, all the storage work will be done by CSI driver and not by Kubelet. So again, from user perspective, nothing changes. From admin perspective, they need to install the driver, of course, but it should be seamless for users. As beta, we are improving how storage class is applied, how default storage class is applied to persistent volume claims. If you know something about default storage classes, it's just an annotation on a storage class and if there is no default storage class, at the point when user creates persistent volume claim asking for this default storage class, then this persistent volume claim will be never dynamically provisioned. It will stay pending forever basically. So we fix that when you create persistent volume claim. First, asking for default storage class and the default storage class. Later, we will retroactively change the persistent volume claim and it will get provisioned from the default storage class. Also, if you guys have two or more default storage classes for a short time, persistent volume claims will get some of them and they will not pretend error. And I would like to dive deeper into non-graceful node shutdown, which is a new feature in Kubernetes 126 as beta. First, what is graceful node shutdown? It is a Kubernetes feature in Kubelet and it handles a situation when user wants to turn off a node. So they SSH to node issue a command to power off the machine or they use some API to gracefully shut down to send a shutdown signal to the node and Kubelet will gracefully stop all the containers. It will unmount all the volumes and it will mark itself as not available. That's graceful node shutdown when you can SSH to machine and shut it down. The most graceful node shutdown is when you can't SSH to a node. For example, network is split. There is hardware failure somewhere on the way or even in the machine, in the node. You cannot panic. Kubelet gets stuck. Whatever happens in the node, you can't SSH to it. You can't shut it down. It's something like fencing. We are getting closer to fencing. We don't have fencing yet, but in 126 if you want to use the feature, the expected workflow is that of course you enable the feature gate node out of service volume detach feature gate. Then the cluster admin or some third party software detects that the node is unhealthy or the network is split or whatever. Again, cluster admin or third party software shuts down the node using either physically going to the machine and turning it off or using IPMI using iDRAC, using cloud APIs, whatever. That's still nothing new. You could do it even before. But if you do it in Kubernetes 125 without this feature then you need to wait six minutes for Kubelet for the control manager to see that the node is not available. After six minutes, we detach volumes from that node. Only after six minutes with this improvement in 126 if cluster admin or third party software applies a taint nodekubernetes.io out of service we will kill all the pods immediately or delete them because we can't kill them on the node because it's not available and we will detach all the volumes at once. So the replacement pods that need the volumes can start on different nodes quickly. You don't need to wait for this magic six minutes. And as alpha in 126 you can provision volumes across namespaces either from other volumes like cloning or from snapshots and tomorrow there will be a talk by Masaki and Takufami please go there if you are interested. I will not go into details. And in 127 that will be covered by Jing. Sorry. Thanks, Yang. So I will talk about what we did in 1.27 release. You can see that we moved a lot of features to beta in 1.27. We have a secure Linux relabeling feature that speeds up the container startup time by mounting volumes with the correct secure Linux label instead of changing each file on the volumes recursively. We also have robust volume manager reconstruction feature. This is a refactoring of the volume manager code. It allows Kubelet to provide more information about how existing volumes are mounted and this will help us to rebuild and clean up the volumes at the Kubelet restart time. And we also have node expand secret. This allows the CSI driver to pass the secret information to the storage system when the file system is expanded on the node during volume expansion. And we also have a feature that we co-owned with six apps where we provide an option to allow the PBCs created by the staple sets to be deleted automatically when the staple set is deleted. In 1.27 we also have this rewrite once part persistent volume access mode. This is a new access mode that we added. So in CSI spec there are a few access modes. One is called single node writer. That means that the volume can reach a node as a rewrite only once at any given time. But it did not specify whether that means only one part or multiple parts can access the volume. So we introduced two new access modes. One is called single node single writer that restrict the access to that volume by just one part on that node. This is corresponding to this new rewrite once part PV access mode in the Kubernetes. And also there is another new mode, single node multi writer that allows multiple parts to access the volume on that node. So this is corresponding to rewrite once PV access mode in Kubernetes. If a CSI driver do not support this new modes then we use single node writer that's the existing behavior. In 1.27 we also have an option to prevent unauthorized water mode conversion. This also moved to beta. Without this feature we can create a PVC from a snapshot and convert the water mode from file system to Roblox or vice versa without any control. This could potentially introduce security problems. On the other hand, the ability to change water mode from file system to Roblox and retrieve change blocks is a valid use case for efficient backups. So we need to be able to support it. So we added this feature we introduced some API changes in the volume snapshot content. We added a new field for water mode. The snapshot controller will be populating this field from the water mode field in the original PVC and when user wants to create a PVC from a water snapshot the external provisioner will be checking this water mode field. If the PVC water mode is different from the source water mode in water snapshot content there is an annotation called allow water mode change under volume snapshot content and if that is such a true if it is true then you can proceed otherwise the request will be rejected. Since this feature is a beta in 1.27 we are planning to enable the feature flag prevent water mode conversion by default in the snapshot controller and CSI provisioner in the sidecar release for 1.28 so if your workflow requires water mode conversion please make a change accordingly otherwise it's going to fail. In 1.27 we introduced an alpha feature to support water group snapshot we already have a water snapshot API that is for individual snapshots however some applications will be able to create a crash consistent snapshot across multiple volumes this is a very useful feature if you have an application that uses multiple volumes for example one for the data one for the logs you want to be able to take a snapshot at the same point in time across all the volumes that are used by the application to ensure a right order consistency so in this feature we added a few new Kubernetes APIs we introduced a water group snapshot API this is a username space object it represents a user's request for a group snapshot and also a water group snapshot content this is a cluster scope it represents a group snapshot under SUADI system also a volume group snapshot class that defines how a group snapshot should be created this is defined by the admin in the cluster scope it includes information such as the CSI driver name and the deletion policy and so on and also there are CSI spec changes we introduced a new group controller service and for this new control service we have a new capability and three new GRPC interfaces create, delete and get water group snapshots so if a CSI driver wants to support this new feature it needs to implement this new group control service and the three new GRPC interfaces and we added new control logic and a snapshot controller and the CSI snapshot sidecar to support this new feature if you want to dynamically provision a water group snapshot then you need to specify a label selector in the water group snapshot spec so that the snapshot controller will find the pvcs with the matching label to snapshot them together so CSI migration is a feature that we have been working on for several releases in 1.25 the CSI migration core feature moved to GA and also OpenStack Cinder Azure Disk and File AWS EBS GCPD and vSphere all moved to GA and a few of the plugins entry plugins have already been removed and some of them are targeted for removal in later releases you can refer to this table for the status of the CSI migration and this table shows the entry driver removal the strivers do not go through CSI migration in 1.26 we removed the cluster FS entry driver there are also a few features that are in design prototyping change block tracking this is a feature that the data protection when group has been working on we want to design new APIs and common approach to be able to retrieve changing blocks of the data this is for efficient backups so if you are interested in this feature join the data protection when group meeting that happens bi-weekly on Wednesdays for this discussion and there is also a Kubernetes volume provisioned IO this feature is also being designed right now we are trying to design APIs to dynamically provision and modify IOPS and throughput of persistent volumes and also we have a round time assistant mounting of persistent volumes and warning expansion for safer sets those features are also designed in progress so if you are interested take a look of the cap so now let me talk about how to get involved we have a six storage community page this page has a lot of information to help you get started there is a contributing page there that has links from some previous presentations and we also have our bi-weekly meetings on Thursdays in that meeting we go over the features we are targeting for every release and making sure that they are on track and also we recently added a new meeting we have a weekly issue trial meeting on Wednesdays so if you are interested in contributing to six storage I strongly encourage you to join our meetings we also have mailing list and Slack channels and this Friday there is a secret meet and greet and Yang and I will be there so join us if you are around here are some additional resources for your reference that's all for our presentation today thank you all for coming are there any questions thank you I was wondering if you could elaborate a bit more on the functionality of the change block tracking functionality because for example we use vSphere as an underlying infrastructure where you already have it enabled on the hyper-virus level but what does it actually do at the Kubernetes container volume level at container level you could use the same APIs using VATP APIs you could use that same APIs but here we are just trying to have some common APIs in Kubernetes so allow you to retrieve those blocks let's say you have two snapshots you want to get the diffs so we provide the same snapshot handle and get that and make the backup more efficient the primary use case is backup so backup vendors will use this API to get the changes between two snapshots and backup just the bits they need to backup thank you any other question just as a bit of a follow on to the CVTs what's the timeline and what kind of things do you need to help with that? please come to our data production meeting tomorrow I'm hoping to join the session tomorrow as well oh okay yeah sure so the meetings of bi-weekly you say yeah tomorrow in the session yeah but I think but that session yeah sure you can join I won't be there because I have a complete sorry we actually have been discussing that for quite a while now there are some design details we have not finalized there are some concerns initially we were trying to use aggregated API server but API reviewers have some concerns on the performance impact on the API server too much data so we are trying a different approach so student in discussion if you want to know details please join we will finalize the design yeah sure that would be awesome so one question to the stuff you mentioned the first stuff you mentioned six minutes magic six minutes what did you mean by that initially in Kubernetes we added a six minute timeout so when a node gets unavailable and we don't know it yet in Kubernetes and somebody deletes a pod that uses a persistent volume we give kubelet on that node six minutes to unmount the volume and after those six minutes we force detach the volume if cluster API allows or cloud API allows or if we have API to detach the volume so the volume could get corrupted but in some cases it will help you to start the new pod on a different node with some data that was on the volume with the ungraceful node shutdown you can shutdown the node and detach the volumes faster does it answer your question yes I guess I'm just wondering what is about this ungraceful shutdown because usually if you have this situation that it does not work the node you can try to coordinate or plug off if you use a vSphere or something similar I'm not really sure what is the use case of that all the other if I can just remove the node I can use the existing functionality to have something you can but not in Kubernetes API if you delete a node in Kubernetes API nothing really happens the node is the virtual machine is still there it still has volumes attached yeah but assuming I have the access to the underlying architecture yeah then of course you can use the cloud API but people usually want to use just one API and see that's usually Kubernetes so I think actually without this feature you can manually achieve that you can actually just forcefully delete the pods that will also happen but this one just added a little bit making it more formal and also as a Kubernetes way you just apply that annotation there and then also as Yang was saying that you don't have to wait for that 6 minutes additional 6 minutes is what just happened automatically after that so making it more smoothly but right now still need a little bit of time to have to go apply that annotation so we are trying to next step is we are trying to see if we can make this more automatic which is completely automated but this is like one step further you can right now at least you can add something on top of this one you can just apply that annotation instead of you go forcefully use the command to delete the pods they are nice thank you another question somewhere here I want to ask are there also plans to create a compliant archive image what do you mean compliant compliant will say that it can never be changed for a certain period anymore perhaps you thought about that often image for instance we don't care about images in 6 storage we care about persistent volumes only volumes yeah we don't pull images and we are not responsible for image store or nodes or how it gets stored on nodes we don't that's a sig node any other question sorry thanks first for working on the consistent volume snapshotting the volume groups question is about that if there is already some CSI driver that supports that they may do something on the niche but this feature is not a Kubernetes feature yet before we introduce this group snapshot API are you asking if there are already CSI drivers supporting it or are you asking about future I think we are planning to add support to our oh it's there so right now this is just we just introduced this feature to alpha so we will be waiting for storage vendors to implement this feature I'm just thinking ahead you can be the first any other question yes hi I just wonder why it's not possible to mount a PVC in the same pot twice on a different mount points I think it is is it now supported because in GKE it's not possible like in POSPEC it works on with different sub-process so you can only mount it once in the POSPEC you have generic volume section when you declare what volumes you use you have in each container you have volume mount section and you can mount each volume multiple times in different places as you want I'm pretty sure GKE supports that it's about something you said at the beginning about storage classes there could be a moment where there are two default storage classes and you said that Kubernetes would decide it will pick some, it's deterministic with newest creation timestamp which is not something you should depend on it's just some so we have something deterministic but if there are the problem is if you want to change the default storage class you can either delete the old default storage class and create new and then you have some meantime when there is no default we fix that or you can create two default storage classes and delete the old one when you have two default storage classes and we pick that one with newer creation timestamp but you should not depend on that behavior it's racy anyway I was thinking there is a very short time if someone is just deleting it depends on you they can leave cluster with two default storage classes for a long time if they want I don't know why would they do that we will consistently pick the one with newest creation timestamp and my point is to me it would be wiser to not spawn the pods until that is a mistake I forgot to remove I defined twice the default storage class that's what we did all the Kubernetes when you had two default storage classes and you create new pvc then you get 400 error that the pvc is not worried about the default storage classes and that's not user friendly because user needs to go to the cluster to fix that and go back so we choose some default storage class among these two or three or more how many you have if I am distracted the user I could then live with two default storage classes for a while before I realize that I am looking at the storage in the wrong place just you have an alert for two default storage classes here sorry it's related to that so if you accidentally delete one storage class and you have pvcs belonging to that storage class those pvcs get orphaned somehow and as far as I remember it's difficult to delete them if they are not belonging to an existing storage class is there anything in place to solve this kind of problems like ideally the storage class should be used only during dynamic provisioning and ideally all the information from the storage class should bubble up into the pvc so when you delete the storage class you don't need to delete the persistent volumes I know there are some cases some storage backends that they need the storage class during dynamic deletion but I would suggest to fix them to put all the information from the storage class to csi volume attributes or volume context and so it's there available for the deletion or resize or whatever they need so is there any way to patch the pvcs to a different storage class just to make them easier to delete them no way the storage class field is sticky once it's set it's not possible to change it thank you welcome hi so with a discussion about several default storage classes another kind of question idea let's say we have a stateful set which contains let it be elastic search it has three replicas and we want to have with pvcs provisioned by stateful set in different storage classes because storage class one is data center one and storage class two is data center two and storage class three is data center three are there any plans to actually implement like round robin so to say or random choice of default storage class and colleague of mine Andrew can maybe explain his idea of his implementation but not across storage classes you can have one storage class that can provision volumes in all data centers and then volume topology feature volume topology ever provisioning and note zone anti affinity in your pots will make your pots stateful set pots schedule in different zones and the volumes will be provisioned for them in different zones so you have one storage class but it can provision storage in different zones regions here so while talking about volume snapshot groups I think we said something for dynamically provisioning we have to apply labels can we briefly talk about that because if I am correct in case of volume snapshots we don't have to apply any labels obviously so how is it different why do we have to apply labels in case of volume snapshot groups this is because we just introduced this one volume group snapshot actually three but they are all about a volume group snapshot but we don't really have a api object that hold all the volumes so we need to have a way to say what volumes you want to be grouped together so this label is basically just to say here are all the volumes they are in the same application let's put them in the group snapshot the same group snapshot we initially actually in the earlier version of the design we actually have a volume group construct that was the initial design but it just getting very complicated and we will introduce six new api objects the original plans to have that support both group snapshot and also group replication but it just because of the complexity then we decided just to for the group snapshot we just to use this one set of constructs unfortunately they are out of time so thank you for the questions thank you