 Welcome to the Coronetistic Storage Diptype. Before starting, I want to ask a round of applause for Xing and Patrick. Today, they got one community award. It's the Catwood or Chapwood. Chapwood, something with water. Let's also wish her good luck with that thing in the airport. Right. Thanks for coming today. We are going to give a deep dive of Coronetistic Storage. My name is Xing Yang. I work at VML in the cloud-native storage team. My name is Mauricio. I work in the Anthos Storage team at Google. Here's today's agenda. We will talk about who we are, what we did in 1.25 and 1.26 release. Mauricio will give a deep dive over CSL windows and we will talk about how to get involved. Sixth Storage has two co-chairs. Sadali from Google and myself are the co-chairs. We also have Michelle from Google and Yang from Red Hat. They are two tech leads. There are also many other contributors in Sixth Storage. We have more than 5,000 members in the Sixth Storage Slack channel. We also have several other channels. We have a channel for Cozy, for CSI and CSI windows. We also have regular bi-weekly meetings and we have unique approvals for Sixth Storage-owned repos and packages. What we do in Sixth Storage is defined in our charter. Sixth Storage is a special interest group that focuses on providing storage to containers running in a Kubernetes cluster. In Sixth Storage, most notable features include persistent volume claims, persistent volumes, storage class and dynamic provisioning. We have volume plugins. In addition to persistent volumes that persist data beyond the post-life cycle, we also have ephemeral volumes such as secrets, config maps, empty doors, their life cycle are tied with the part's life cycle. We also have container storage interface CSI that allows storage vendors to write a plugin and have it work in Kubernetes and other container orchestration systems. CSI is for block and file storage. Now we also have Cozy container object storage interface that's an alpha feature introduced in 1.25 but that's object storage support in Kubernetes. So first let me talk about what we did in 1.25 release. We have a few GA features. The first feature moved to GA in 1.25 is CSI inline ephemeral volume. This feature was introduced back in 1.15 release. So CSI inline ephemeral volume, just like other ephemeral volumes it can be used to store a scratch space for the parts. One most important difference between this ephemeral volume and other types is that this is provided by a CSI driver and the special CSI driver meaning that you have to write a CSI driver just for this purpose or you need to modify your existing driver to support this feature. To use this feature in a CSI driver you need to specify volume life cycle modes as ephemeral and in the pod you need to specify the type as CSI and specify driver name and volume attributes. A CSI driver is suitable to be used as an ephemeral volume if it's a beautiful special purpose and need per volume custom parameters like a driver that provides secrets to a pod for example, secret store CSI driver is a good example of that. Another example is a search manager CSI driver. A CSI driver is not suitable to be used as a CSI inline ephemeral volume if it needs to support features such as room snapshotting, cloning, dispensing. If a provisioning is not limited to the local node or if you need some volume attributes that are supposed to be restricted to the admins only like the parameters in a storage class and also if you want to persist your data beyond the pod's life cycle that's also not suitable for this feature. The second feature mode 2GA is the local ephemeral storage capacity installation. This feature was introduced back in 1.7 release so it's been quite a while now. Pods use ephemeral storage for scratch space or for caching or logging, same logs but the pods could be evicted because other pods are filling the local storage space. So this feature allows users to manage local ephemeral storage as they manage their memory and CPU. Each container of the pod can specify the limits and the requests so if the pod uses storage more than what is specified in the limits the pod will get evicted. You can also use this to manage resource quota for local ephemeral storage. Current to this feature cannot be used for a CSI driver there were discussions extending this feature to support CSI inline ephemeral volume that I just talked about a minute ago. In 1.25 we also have a few alpha P features. The first one is a secure Linux relabeling with mount options. There is already a feature that allows you to skip the volume ownership and permission change at mount time to speed up the pod start of time but that feature will not work if the system has secure Linux enabled. So that's why we introduced this feature to mount volumes with the correct secure Linux context to avoid recursive change of files on the volume to speed up the pod start of time. And the second alpha feature is node expand secret. It is typical for a storage system to require credentials to be passed in through a CSI driver for volume operations that includes volume expansion. Volume expansion can happen on the controller side or the node side or both. From the controller side we already have a way to pass in the secret from the CSI driver. So now in 1.25 we also added this feature for the secrets to pass in to CSI driver when the volume expansion happening on the node side when the file system is resized. And the third feature is a reconciled default storage class assignment. This feature allows the existing pvcs without storage class names specified to be updated to use the new default storage class when that storage class becomes available. And the fourth feature became alpha in 1.25 is the object storage API COSI. COSI adds new Kubernetes APIs to provision buckets and also allow pods to access those buckets. COSI component includes a controller manager that binds the COSI created the buckets to the bucket claims. There is a side car that watches the Kubernetes API objects and calls the COSI driver to provision or delete buckets. And the COSI driver implements gRPC interfaces communicates with the storage backend to provision buckets or delete buckets. There are two sets of COSI APIs introduced here. We have a bucket, bucket claim and a bucket class that's very similar to pvpvc and storage class. Bucket represents a physical bucket on the storage backend and bucket claim is a user's request for bucket that's in the user's name space. And the bucket class that specifies the type of the bucket that you want to provision. And the second side of APIs are bucket access and bucket access class that is for the pod to get access to the bucket. So now let me talk about what we are working on in 1.26 release. In 1.26 release we also have some features that are targeting GA. So we have delegate FS group to CSI driver is targeting GA in 1.26 release. So in the pod security context there is a FS group setting that you must specify so that the voting is readable and writable by the pod if the pod runs as a non-route user. And based on the FS group cooperatives will recursively change the ownership and permissions of all the volumes on that pod. Unless if the FS group changes the policy setting specified to skip it. However, not every CSI driver can modify the voting ownership of permissions. For example, NFS type or other voting type that does not support unique style permission changes. That's why we introduce this feature to let a CSI driver to opt out. So there are three different settings for the FS group policy in CSI driver spec. By default it is a read write once with FS type. That means we're examining that at a long time and determine whether we need to apply modification or not. The second one is a file that means always apply modification and the third one is none that allows the CSI driver to opt out and never apply modifications. We also have a few features targeting beta in 1.26 release. The first one is the control voting mode conversion between source and target. So we have a use case for backup restore. Sometimes the backup software needs to change the voting mode from file system to block to support efficient backups. However, that could lead to a potential security problem. That's why we added this feature to prevent unauthorized voting mode conversion. And we also have a reconciled default 30 class assignment that we are targeting beta in 1.26. And the third one is a project we call on with SIG apps, auto remove PVCs created by Stateful Sets. This adds an option to allow PVCs created by Stateful Sets to be removed automatically. And also we have a non graceful no shutdown feature targeting beta in 1.26. So this is the feature we introduced in 1.24. This allows Stateful workloads to fail over to a different node after the origin node is shut down non gracefully. A no shutdown is graceful only if a cubelet can detect that. And cubelet depends on system D's inhibitor lock to detect a shutdown. However, not every system supports this mechanism. So if it's not supported cubelet does not detect that, then it becomes a non graceful no shutdown. This feature, the graceful no shutdown feature also requires two grace period config parameters in cubelet. If those are not configured properly, then this feature also will not work. So that's why we also need this non graceful no shutdown. When non graceful no shutdown happens if you don't do any intervention, then your workload could be stuck under a shutdown node forever without moving to another running node. To use this feature you first need to enable the feature gate and make sure that the node is released shutdown and apply an auto service taint on that node. After that the port gc controller will forcefully delete the parts and the attach-detach controller will forcefully detach these volumes. After that the workload will move successfully to another node. So now we are targeting beta and after that depending on user feedback we do plan to move this feature to GA eventually. And this feature right now requires a manual user intervention. You need to manually apply this taint. So we are also looking at how to automate this in our next steps. We had a session about this on Wednesday so you can check out the video recording if you are interested. And in 1.26 we also have a few features targeting alpha. The first one is the provision volumes from cross namespace snapshot that allows pvcs to be provisioned from a volume snapshot in a different namespace. And we also have this volume group and volume group snapshot feature which introduces new APIs to manage multiple volumes together and I provide a volume group snapshot API to take a snapshot of a volume group. So here's a table that shows the CSI migration. CXR has been working on this for several releases now. In 1.25 the core CSI migration moved to GA. We also have several drivers including OpenStack Cinder, Azure Disk, AWS EVS, GCPD moved to GA. And Azure File and vSphere are also targeting GA in 1.26 release. So you can take a look of this table and find out the CSI migration status. I also have this table here that shows the entry story driver removal. These drivers do not go through the CSI migration process. So with that I'm going to hand it over to Marisha who is going to give us a deep dive on CSI Windows. Alright, thank you. So let's first look at Kubernetes, Windows and CSI. On the diagram in the right we see a cluster that has two node pools, a Linux node pool and a Windows node pool. The idea is that Windows workloads run in the Windows node pool. Kubernetes supports three Windows server operating systems. Those are the long term servicing channel 2019-2022. And it also supports the semi-annual channel. The long term servicing channel has released every two to three years. The semi-annual channel I think it's being discontinued. We also see in a Kubernetes cluster that is running Windows, a Windows node pool that the control plane components are still running in Linux. So all of the components that you are aware of, the API server, the queue scheduler, the queue control manager, all of those are still in the control plane. In CSI we have three sets of RPC services representing the three components that CSI drivers need to implement, the node component, the identity component, the controller component. So the controller component is still running in the control plane. But for operations that need to run in the node component, that's where we need support for Windows too. So in CSI there are three components that need to be working in Windows too. Those are the node component of the CSI driver, liveness probe and node driver register. Those two last ones, liveness probe and node driver register are maintained by Kubernetes CSI and the CSI driver is maintained by the storage provider. Also, Kubernetes is minimally it doesn't tell you how you should deploy your solution into the cluster. And a usual solution to deploy the node component into the cluster is to use a demon set. And in order to avoid having two different images in the Linux demon set and in the Windows demon set, we can use a multi-arc built pipeline where we create images for both operating systems and tie them together with an OCI image manifest. That way both demon sets refer to the same image and then it's in the node where the decision to take the right image is taken. So let's look at what happens in CSI under the hood. There are calls that the CSI spec defines. In this diagram on the right we see one case for the lifecycle of a dynamically provisioned volume. What happens when a workload requests a PVC is that there are components in the control plane that will receive requests from CSI sidecars. These are calls that should be implemented by the controller component of the CSI driver, the one that is create volume and controller publish volume in the red rectangle there. And as I mentioned those are still happening in Linux. But for Windows we're focused on the blue box which has four operations. So there are two operations for setup. The first one is node stage volume and node publish volume. So I think it's worth mentioning what happens in Linux so that we can understand what is the difference with Linux or with Windows 3. So in Linux what happens when a device is attached to a node, we need to format it. And this happens in this node stage volume call. The CSI driver usually uses a library from Kubernetes called Kubernetes MountUtils that has utilities to execute the MKFS command in Linux. And also to, well that's to format it and to mount it it uses the mount volume. In Windows we don't have these commands but we have similar PowerShell commands. So the workload in Linux or in Windows is to create a Windows volume, we format it to NTFS, we create a partition access path in the node and that's called the global mount. The next step is node publish volume which is to make this volume available to the pod. In Linux this is done with a bind mount but in Windows because there is no support for it with a sim link to the partition access path. And this is called the pod mount. That is for the setup step and at that point the workloads can start using the PV. Then at some point the workload will be out of the cluster and we will go through the reverse steps which are to tear down the volume. So in Linux this is done with a U mount command but in Windows it's done by reversing the steps above which are. Removing the sim link in node and publish volume and in node and stage volume we remove the partition access path. Okay so in addition in Windows these operations, the volume format and mount operations are privileged operations and they can't be done through a Windows container. There is an asterisk there because it's now doable. I will talk about that in the next slide but when we started working on this there was no support to do that. And as a workaround a lot of components in Kubernetes that run in Windows usually do this through proxies and we also have a project called CSI proxy which is a binary that runs in the host. So when CSI proxy starts it creates name pipes in the host. You could imagine that this is all happening in the Windows node and yes CSI proxies, the CSI proxies binary is right at that point. Next we deploy the CSI driver and the CSI driver needs to communicate with the CSI proxy binary through a library that is also an artifact of the CSI proxy project. So we have multiple versions of it because it could be possible that a driver used the beta version while we were still working on making this B1 and then at some point it became B1 and now we also need to manage this multiple version problem. And finally it comes time to those operations that I mentioned before that are executed from the Qubelet. So the Qubelet will call node stage volume, it will go to the CSI driver, the CSI driver will try to format and mount the device. It will use the CSI proxy library and it will talk with the CSI proxy implementation that's all of those commands that I mentioned above. So it's worth noticing that this, even though this is the current CSI proxy architecture we're planning on modifying this so that it's easier and this is where I'll talk about host process containers next. A few of the drawbacks of this model that are not only forester HR for most of the proxy components in Qubelet is that we have an additional component that we have to maintain with CSI proxy. And it's difficult to maintain as I mentioned, we may have new features and those need to be part of a new release so that eventually becomes hard to maintain. And there is also a tiny increase in latency because now we have to make calls from the CSI driver to CSI proxy through the name pipes. All those calls are available in the system. And there is this protobuf serialization, deserialization and so on. So that's a tiny bit of increasing latency. That's a small drawback of this design. Okay, so luckily the CIC windows team started working at the same time while we were working on this. On this feature called Windows Host Process Pot which is a way to run the process container. This is different to a privileged container in Linux. It's actually running as a process in the Windows host. And one of the caveats is that there is no file system isolation. So this means that the cluster should have additional policies so that workloads cannot become host process pods because they could access the volumes of all of the workloads. So that's something to keep in mind. So in addition, if now that we know of this feature if a CSI driver becomes a host process pod then the CSI driver itself can run this privileged storage operations. It doesn't need to go through the CSI proxy binary. But the CSI driver still needs the format and mount functionality that the CSI proxy. And the current plan is to transform all of that code that we had in the previous slide to a library so that CSI drivers can include and perform the format and mount operations directly. And with this the CSI drivers become super easy to develop and deploy because they are similar to the Linux counterpart where you just bump the image and you have all of the format and mount and additional component. This initiative is being tracked in that enhancement issue 3636. So this slide is about people that have already implemented Windows supporting their CSI drivers. On the left we have the current implementation and on the right we have the new implementation. We wanted to have to reach a step where we had as minimal changes as possible. So as we can see it's just renaming the imports in the file where the calls are used. That is one change and the other change is that the CSI driver deployment needs to have additional host process related spec fields. And in this drivers we also mount the namepipes in the pod. So all of those fields are also removed because we no longer have any access to any of the types. And that should be it. So I would like to thank Alexander Ding from Brown University for the work that he already did in CSI proxy to make this work. And I also encourage everyone who has a CSI driver and doesn't have Windows implementation yet to add Windows support to their services. So I would also like to thank SIGWINDOS for working on this because they enabled us to remove a lot of code that is needed. With that, I'll sit back to Shing. Thanks, Marisho. So on this SIG storage homepage we have a lot of information for you to get started. There are some video recordings that are available in one session, so check it out. And we have bi-weekly meetings on Thursdays. We have mailing lists and Slack channel. So if you're interested, please join us and get involved. And I have included a few resources here. This is the QR code for the session. Please leave feedback. This is the end of the session. Marisho, have this. Let's go ahead. I wanted to know if you have a reference implementation of CSI driver for Windows, sample CSI, which does the mount and so on. Sure. So there are at least three CSI drivers that have support for Windows. Those are the cool GC persistent disk CSI driver. There is the Azure file CSI for Windows. And there is also a CSI implementation. All of those are under the hood using CSI proxy. And is there variation across them? Because the only place where I see in a node-related location is how do you find out where the disk is attached pretty much? The rest of the code should be similar, right? How long will the CSI proxy support running as a service versus as a library? That's a great question. I wrote the timeline for it on the CEP. We are thinking about having it on maintenance mode because of the drawbacks that I mentioned. But that's still open in the CEP. So if the community wants to keep it open for features, we could still support both. But my take right now is to keep it in maintenance mode and just go on the library mode. Perfect. I just saw your CEP. So I'll take a look at that. Thanks. We have time for one more question. Thank you so much. Thank you for coming. Thank you.