 Hello, everyone. Today, Michelle and I will give an intro and deep dive of six storage, Kubernetes six storage. My name is Xin Yang. I work at VMware in the cloud native storage team. I'm also a co-chair in six storage. Hi, I'm Michelle. I'm a software engineer at Google, and I'm a six storage tech lead. Here is today's agenda. First, we will talk about who we are and what we did in 1.28 release. And then we'll talk about what we are doing in 1.29 release and what features we are designing and prototyping and finally, how to get involved. In six storage, Sadali and myself are co-chairs. Michelle and Yang are tech leads. Other than leads, we also have many members on the six storage Slack channel. We have about 30 unique approvers for six-owned packages. What we do in six storage is defined in our charter. Six storage is a special interest group that focuses on how to provide storage for containers running in a Kubernetes cluster. The most notable features owned by six storage are persistent volume claims, persistent volumes, storage classes, and dynamic provisioning. We also have volume plugins. In addition to persistent volumes that possess data beyond a pod's lifecycle, we also have ephemeral volumes, such as secrets, configure maps, and empty dirts that can be used as a scratch space of a pod that are tied to a pod's lifecycle. We also support CSI container storage interface that defines a set of common interfaces for a storage vendor to write a plugin so that the underlying storage can be consumed by containers running in Kubernetes and other container orchestration systems. CSI is for block and file. And we also have COSI container object storage interface that provides Kubernetes APIs and the GRPC interfaces to support object storage in Kubernetes. Now let me talk about what we did in 1.28. We have two GA features in 1.28 release. The first one is reconciled default storage class assignment. This allows an existing unbound PVC that does not have a storage class nameset to be updated later to use the new storage, a default storage class when that becomes available. The second GA feature is non-graceful no-shadowing. This is different from graceful no-shadowing. A no-shadowing can only be graceful if Kubernetes can detect it and handle it gracefully. Let's say a user SSH into a node and run a power-off command that is detected by Kubernetes. Kubernetes depends on system D inhibitor lock mechanism to support graceful no-shadowing. And once Kubernetes detects that, it will make sure the pods are terminated in a normal fashion. So in this case, the node will be joined, all the resources will be released. However, if the node shuts down unexpectedly because of a hardware failure or kernel panic, in that case, it becomes non-graceful no-shadowing because Kubernetes cannot detect that and cannot handle it gracefully. Even in the case of a planned no-shadowing, well, a user SSH into a node to do a power-off command, if that system does not support system D inhibitor lock, then Kubernetes still cannot detect that and handle that gracefully. When non-graceful no-shadowing happens, let's say pods are part of a stable set, and those pods will be stuck in terminating status, and they cannot get restarted on another running node. And the volumes also will be still attached to your original node, cannot be detached and reattached to a new node. As a result, your application cannot function properly. That's why we introduce this feature to handle non-graceful no-shadowing. To use this feature, you need to apply the auto-service obtained on the shadow node. After that, the podGC controller will forcefully delete the pods and the attached detached controller will forcefully detach the volumes and so that the workloads will move to another running node successfully. This feature was introduced as alpha in 1.24 release, moved to beta in 1.26, and in 1.28, it became GA. We also have features staying in beta where we made some bug fixes in 1.28. The first one is secure Linux relabeling with mount options. Here, we try to mount volumes with the correct secure Linux context to speed up the pod startup time. And the second feature remain in beta is robust volume manager reconstruction. This is a code refactor of volume manager. Here, we allow KubeLit to provide additional information on how existing volumes are mounted so that we can properly rebuild and clean up after KubeLit restart. We also have alpha features in 1.28. Recovery from resize failure is a feature that we introduced in 1.23. We have been making enhancements. This feature allows a user to retry volume expansion by specifying a smaller size than originally requested size so that we have a better chance of being successful. And in 1.28, we also made some additional API changes. And the second alpha feature introduced in 1.28 is PV last phase transition time. In the persistent volume status, now we have a new field which has a timestamp that shows when the PV moved to a new phase. CSI migration is something we have been working on for multiple releases. The core CSI migration feature moved to GA in 1.25. And other cloud providers, including OpenStackCenter, Azure Disk and File, AWS, EBS, GCPD, and vSphere all moved to GA. Some of the entry plugins are already removed and others are targeting for code removal. This table shows entry story driver removal. Those story drivers do not go through CSI migration. GlassGFS entry plugin was removed in 1.26 release. And RBD and CFFS, those both were deprecated in 1.28 and targeting for code removal in 1.31 release. So that's all I want to cover here. Well, hand you over to Michelle to talk about what we're working on in 1.29. Great, thank you. So we have a lot of exciting things going on in the 1.29 space. First features that we are promoting to GA. The first one is read, write, once pod, the persistent volume access mode. And so what this is is actually a new, it's a new access mode on PVCs called read, write, once pod that can actually enforce access to a volume per pod. This is in distinction from the existing read, write, once volume, which actually allows you to share, have multiple pods share the same volume if they're scheduled to the same node. And so this new mode makes it more explicit and there's actually enforcement on that. The next feature we're promoting to GA is node expand secret in the CSI persistent volume source. And basically this will support any CSI drivers that also need to pass in secrets for node expand operations. Then a couple of features that we are promoting to beta in 1.29. First is the persistent volume last phase transition time. This adds the timestamp to the PV object whenever the phase changes. And so you can use this if you have any sort of monitoring tools where you want to kind of know when PV states are changing. The next thing that we're promoting to beta is preventing unauthorized volume mode conversion. And so this is a situation where today it's possible to create a PVC from a snapshot but potentially also change the volume mode at the same time compared to when you actually took the snapshot. And this potentially has some compatibility issues. Certain drivers won't support converting the mode. And so what we're doing here in 1.27 is we're going to prevent allowing the volume mode to change when you restore from a snapshot. And this is going to be set by default. But for applications like backup software that might want to take advantage of this, we're still going to provide an annotation to allow the backup software to do this. And then the next feature we are promoting to beta. Well, it's actually already beta today. But this is an enhancement we've been working on with SIG app to provide more volume management capabilities into Stateful Set. And so today, when you create a Stateful Set, you can also specify the volume templates. And Stateful Set will create those PVCs automatically. But actually, Stateful Set won't delete those PVCs when, say, you delete the Stateful Set or you scale down the Stateful Set. And so this new feature is going to provide that ability to Stateful Set. You can set a persistent volume deletion policy. And so you can say, now whenever my Stateful Set is deleted or my Stateful Set is scaled down, also clean up the PVCs that were created by it. And then in terms of new alpha features, in 129, we are introducing a new feature called modifiable PVCs. And this is going to give you the capability to modify the PVC after provisioning to basically change certain properties on the volume, including performance properties like IOPS and throughput. And to go into that into more detail, we can kind of give an example of what's going to look like. So we're introducing this new concept of a called a volume attributes class. It's very similar to storage class, except the main difference is the attributes that you specify here are supported by the CSI driver in terms of being able to modify it post-creation. And so you can define, for example, two volume attributes classes, one called silver and one called gold. The silver one will have some lower IOPS number and the gold one will have greater IOPS. And then in your persistent volume claim, you specify you want the silver volume attributes class. And then later on, sometime later, when you are scaling up your application and you need more IOPS and performance out of it, you can go ahead and change the volume attributes class to gold. And then underneath the covers, we will go and invoke the CSI driver to execute that change on the back end. All right, we're back. All right, so those are all the kind of new features that are being introduced or have been being promoted. There's also a lot of features that we're currently designing and prototyping right now, and it's under active discussion. First is change block tracking. And so this is a feature where it's going to provide a way to get the differential snapshots between two volume snapshot objects. And so this is very useful for any backup software that wants to do incremental backups. And then the next feature that we're designing right now is to support volume expansion for staple sets. Basically, today, if you create PVCs and you want to be able to expand the capacity, you need to directly go to the PVC to update that. But here, we're going to add that support to staple sets so you can actually change it in the staple set template and have that pass through to the underlying PVCs. Another feature that we're under active discussion is storage capacity scoring. And so this is basically an enhancement to the storage capacity tracking feature that we promoted to GA a couple of releases ago. But now this is going to add some additional logic into the scheduler to be able to provide preferences for nodes, either preferring a sort of bin pack model where we try to bin pack all of the PVCs onto fewer number of nodes. Or you can specify a spreading preference where you want to try to spread all the PVCs evenly across all the nodes. And then the last thing that is under active discussion right now is figuring out how we can consolidate all of our CSI sidecars into a single component. And so I think this is relevant to you if you are riding a CSI driver today where we have some five or six different sidecars that you need to add to your plugin. And so we're looking at ways of simplifying the maintenance of that and being able to actually consolidate it into one single component. And that will also help simplify our own release processes and be able to do more regular patch releases and that kind of thing. All right, so those are all the highlights that we wanted to talk to you about today. If any of this sounds interesting to you, we are definitely looking for feedback and contributors to help with all of these efforts. So if you want to get involved, please attend one of our meetings. You can join our Slack channel and our mailing list. And we can continue a lot of the discussions there. And yeah, there's a couple more resources for you here. All of these slides will be uploaded with all the links available. OK, so I think we'll just leave the rest of the time for Q&A. Hi, thanks for the talk. With a quick question about the removal of the Gluster driver. And I think the CSI project sounded like they were kind of, Gluster was kind of moving away from a file system. Do you have any recommendations of what open source solution would replace Gluster? Like what's the natural progression for CSI with an open source storage solution, I guess? Yeah, it's a bit of a tough situation because I think it seems like the Gluster maintainers have decided not to continue to support Kubernetes with it. And so there isn't really a great alternative. I think I've heard there's a project called Kotolu, I think. They might probably be the best resource for trying to kind of have a CSI alternative to the main Gluster project. And yeah, I would probably reach out to them and see. Otherwise, that's if you have to keep all your existing Gluster volumes today. And you can't migrate to some other alternative storage system. Thanks. OK. All right, guys. Is this related to Gluster or can you, there's a line of people there if you want to? Yeah, hey, that was super informative. Thanks for the talk. I was just wondering, with AI being so important and data and storage, obviously being a really important part of that, if you all have just kind of thought about sort of what the features would be from your perspective to support just kind of that industry movement. Yeah, that's a very interesting question. So if I understand correctly, you're asking about with sort of the shift towards AI workloads, what are sort of the AI data problems that we'll want to look at? Yeah. Yeah, a much more concise wording of what I was asking. Yeah, yeah. So definitely, AI workloads do have to consume a lot of data. It's a very, from my understanding, it's a pretty different pattern than what we've traditionally thought of as stateful workloads, like for databases, as an example. My understanding is AI data access is very largely sequential and read-only. And so there's a lot of interesting opportunities we can do here. I think also with AI data, it's also very much parallelizable stuff. We have many consumers trying to all read the same data over and over again. And so I think this is where things like read-only many storage types will really play a big factor. And so I think we'll see, because with compared to databases, which often use read-write-once-block storage, I think with AI data, I largely see a trend where the storage is mostly a read-write-many solution that can handle a lot of concurrent readers and can do small files and sequential reads in an optimized manner. Object storage is also very prevalent in AI workloads. And so I think a lot of the there are like fuse adapters for object storage today. So you can use like a S3 fuse CSI driver or a GCS fuse CSI driver. I think those are all very relevant for the AI layer. Hi there. I'm Ahmed, one of the Kubernetes maintainers. I have a question around the CSI when you move the drivers out of entry. So how much benefit are we going to see in testing and CI, in particular, because testing storage stuff is a little bit tricky? Yeah, so there is a very interesting, it's a little bit of a misunderstanding of what the CSI migration feature does. The CSI migration feature is mainly a way to allow your existing workloads to continue running without breaking. But CSI migration itself doesn't give you like the new features that CSI has, like snapshots and cloning. And so if you actually want to use those features, you do actually have to convert your existing persistent volumes to the new CSI persistent volume type. And so yeah, the CSI migration feature was mainly a kind of internal feature that basically allows your existing workloads to not break when we end up moving all of the cloud provider dependencies out of Kubernetes. Hey, my question is in the context of running Kubernetes locally on premises, not in the cloud, and usage of storage. In your table, the migration table that you showed, I recognized two file systems which are commonly used for there was Cephifes and Goster, I think. My question is in the direction of plans to include others is, I mean, since this is a special interest group, we can discuss such things. Are there efforts going on in including other local file systems? If not, what can one do to help? What are requirements for file systems to be used for local storage? Can I help with adding one more? And how? Yeah, so actually, so the list we were showing, I don't know if you want to go back to it. But the list we were showing was actually the built-in Kubernetes storage drivers that were moving out of core Kubernetes. And this is a general trend of just, we want to kind of slim down the core of Kubernetes, just kind of the core kernel part, and keep all the specific storage integrations out of Kubernetes. But it doesn't mean that we're not supporting other storage systems. So it just means that in the entry Kubernetes plugins, you won't have these drivers anymore. But all of these things, and even more, are available through CSI. And so we have over 100 different CSI plugins now. And it's very easy to extend and create your own CSI driver, too, if whatever storage system you are using is not there. So the answer is I need to learn how to write such plugins, you call them. That's right. And so we have, in one of the last slides, we have links to our CSI developer page. And that has things like code. It has a sample CSI driver that you can use. It has a lot of documentation on what all of the CSI calls mean and how you should implement it. But I would also first check the page here, where we have the list of the 100 different drivers. And you can see if the system you're using is already supported. Michel and Shane, thank you. Thank you for a good presentation. Just curious to get your opinion on this. So if you really look at, I mean, you guys support both CSI as well as Cozy, I think, on your second slide there, do you have an opinion as to what will potentially take off two, three, five years down the road? In terms of technology, right? I mean, I know there's a lot of object storage out there. And there's a lot of backing around CSI. And if you really start thinking about where the future is headed, what's your opinion? Nothing dies at the time. Yeah, it's a very interesting question. I think my answer is it depends. But I think a lot of it depends on the workloads themselves and as their requirements change. And so I do see one interesting trend, which is that there's a lot of new databases now that are being architected to work with object storage and take advantage of the cost and portability properties that object storage has. And so that is a trend I'm seeing with a lot of these new databases. They also do things like they use object storage to persist their data. But they also have a local cache of some block device to speed up the access. And so that's an interesting trend that I see happening. And probably I'll see those are the new databases. But I would expect maybe more of the more traditional databases will also start kind of adopting that paradigm. Thank you. So kind of connected question, I guess. So how much demand do you see? How much interest do you see with the COSY? Demand for COSY? Is that where you're at? Yeah, so for customers and vendors as well. How many contributions do you have in this space? And how many demand do you see for this? So right now, we have four COSY drivers. If we actually have that updated our readme for COSY. So there are definitely vendors participating. We are actually looking at what is the next phase, because it has been alpha since, I think, 1.25 release. So definitely I would like to have more contributors joining that effort, trying to move that to the next stage. Yeah, I think that answer your question. Yeah, I think it's also like if any folks out here are writing operators, or you're managing some sort of stateful workload that uses object storage, if you can benefit from having a standardized interface to provision and manage your buckets, I think we definitely want to hear about all the use cases that you have. Yeah, I think that. So I don't know if it's possible to, if anybody here is using COSY today or is interested in COSY as a customer. OK, cool. By the way, I'll be glad to talk maybe a little bit. Like I think I largely see it as like, if you're doing something that's like multi-cloud or multi-environment, right? And you want to have like a portable management interface in the same way that you can, you have the portability of PVCs today, then I think COSY would be a good fit. OK, thanks. Hi, just a question for the plug-in removals. Is the plan then overall removing every plug-in? I guess more of a question like, for NFS, is there a plan in the future to removing that as well? And then you guys recommend just using the CSI NFS plug-in then instead? Because I guess what other people are saying that they're migrating away from the Gluster setup, we're going to replace it with NFS with one of our setups. And I was looking to see if I could just utilize the volume plug-in for NFS or just go the route of deploying the NFS CSI, I guess. Yeah, so I think we don't have plans. Right now, the plug-ins that we're removing is concentrated on like vendor-specific plug-ins that have to pull in a lot of vendor-specific dependencies in the Kubernetes. For generic things like NFS, which is just like part of the Linux kernel, we don't have plans on removing that from the core of Kubernetes. But I would say if you want kind of the feature set of the in-tree NFS driver is going to be kind of limited to just mounting existing shares. But if you want more features like being able to provision NFS shares or even do like snapshots or things like that, then you're going to want to move to the CSI driver. Yeah, thank you. So are you aware, never is it already a CSI NFS driver? Have you been trying that one? We're about to play around with it. So I think I was just seeing if it's worth deploying and playing, which we will probably end up doing. But I think for our use case, we just are just going to mount into it. We're not really looking into creating provisioning new sub-pads or volumes. So maybe we would just go the NFS route. But I like playing around with CSI, so I might just do NFS as well with the CSI NFS plugin. But yeah, I guess. Thank you. So I guess my question would be kind of one theme I've been hearing about is multi-networks, multi-cluster types of stuff, and how storage is kind of planning on following that trend. Because eventually data has to follow the network wherever that happens. Are we considering replication-like policies that we can add into storage classes and whatnot? Yeah, definitely. I think multi-network is definitely an interesting problem. And trying to figure out how to make storage replication work is definitely sort of on the radar. I think one of the interesting challenges is that every storage system kind of does replication its own way. And so trying to figure out a model that works I think will be something interesting, because basically it might be that you have a storage system might require you have two different PVC objects in both of the clusters, and then the two PVC objects have the same relationship, some relationship with each other. But then some other storage systems might just want a single handle kind of covering the whole region of clusters. And so I think that will be some interesting discussions we'll be having to support that use case. I think in our data protection learning group, there was a proposal by replication that we discussed a while ago. But right now there is no one is actually working on that. But if you are interested, you can join a group and we can discuss that again. My second question is regards to moving from data migrations between PVCs types. So if you had one storage vendor over here that you had a PVC class for and you want to move it over to here, is that another kind of thing that's in the works or considered or not going to happen? Sorry, was it within the same storage vendor or different storage vendors? Different storage vendors. Different storage vendors. That's a harder problem. Yeah, I mean, I think there's definitely ways you could do it that are very not efficient. You could just copy the data, but it'll be very slow. I wonder if there's maybe some opportunities for vendor agnostic snapshots or something that could facilitate that. I think right now the one way that I know that you can do is just to use a backup and restore type of approach. And for example, Valar has this RESTIC, but now I think they use Cope here. That would be a vendor neutral way of doing a backup and then you can restore it so that you can move it from one to another. And RESTIC supports. It's vendor neutral. It's vendor neutral, is that? Yeah. What was that project saying? I was saying RESTIC or actually right now, the Cope here is actually better. So if you look at Valar, I'm not sure if you're familiar with it. Valar has a way to do that. The new way is called a Cope here. So it's like a file system backup. Yeah. All right. Any other questions? I think we're running out of time. All right. Thank you. Right, thank you.