 Hello everyone Today young and I will be giving the intro and deep dive of Kubernetes 6 storage My name is Qing Yang. I work at VMware in the cloud native story team I'm also a co-chair of 6 storage working with young. I am Jan Schafranek I work at Red Hat and I am Kubernetes 6 storage tech lead Here's today's agenda We will talk about who we are what we did in 1.29 release and what we are working on in 1.30 and Features that we are still designing prototyping and finally how to get involved in 6 storage sad Ali and myself are co-chairs Michelle and young are tech leads other than tech leads we also have a lot of Members on our select channel. We have more than 5,000 members on the 6 storage select channel And we have various other select channels as well We have about 30 unique approvers for sick-owned packages What do we do in 6 storage is defined in the 6 storage chatter? 6 storage is a special interest group focusing on how to Find storage for storage to be consumed by the containers running in Kubernetes cluster We have persistent volumes persistent volume claims. We have story classes and dynamic provisioning And we also have volume plugins in addition to Persistent volumes that can store data beyond the pod's life cycle. We also have ephemeral volumes Such as a secret config maps Emptiers that can be used as scratch space of the pod and their life cycle is Coupled with the pod's life cycle We support container storage interface CSI That defines common interfaces for a story vendor to write a driver so that their underlying story system Can be used by containers running in Kubernetes We also have a container object storage interface Cozy that is trying to add object storage support in Kubernetes and CSI is for block and file Now let me talk about what we did in 1.29 release In 1.29 we moved this feature read write once pod persistent volume access mode to GA Without this feature we have read write once PV access mode But it's not clear whether it means just one pod or multiple pod on that node that can access the volume So we added this new Volume access mode So that it is clear that only one pod on that node can access the volume Which is very important for some stable workloads that require single writer access to storage We also added some changes in the CSI spec accordingly and The next feature mode to GA in 1.29 is node expense secret Now that allows you to expand your volume on the node To expand your file system if your underlying story system requires the credentials to be passed in We also have a feature mode to do beta That's the persistent volume last phase transition time that adds the timestamp To the PV status when the face of that PV changing from one to another We also have a brand new feature added in 1.29 release Volume at truth class now in addition to resize you can also modify other attributes Such as I offset throughput after the volume is dynamically provisioned So why do we need to add a volume at truth class when what we already have storage class? storage class has parameters for dynamic provisioning but those parameters are Immutable so you cannot change it after the volume is provisioned. That's why we added the volume at truth class Now you can have a parameters for dynamic provision defined in voting attributes class and those parameters are mutable in The person volume claim in addition to storage class name Which cannot be changed after voting's provision now. We also have a volume at truth class name field And that can be modified after the volume is provisioned The feature was introduced in 1.29 release. It's staying in alpha in 1.30 Now here's the example We have two volume at truth classes on top on the left hand side. We have silver That has I ops 5000 under parameters on the right hand side We have a voting at truth class named gold and I ops is set at 10,000 and At the bottom we have definitions for persistent volume claim Both are for the same persistent volume Now we just to change the voting at truth class from server to gold As you can see that storage class name stays the same So when user updates the voting at truth class name from server to gold that will trigger that volume To be modified by the underlying system and as a result the I ops will be changed from 5,000 to 10,000 accordingly in this example So you did you in addition to this volume at truth class this new Kubernetes API We also made changes in CSS spec we added a modified volume capability and we also High-crossfinding RPCs to modify volume So in 1.29 release, we also have warning group snapshot that is staying in alpha This was introduced in 1.27 release We continue to work on it and we finish implementation in 1.29 now this Allows you to create a snapshot of multiple volumes at the same point in time and We introduced a new Kubernetes APIs. We have a volume group snapshot API that represents users request for a group snapshot and we have volume group snapshot content That represents a group snapshot on the storage system and we have one in group snapshot class that defines a tight the type of group snapshot and on the storage system that is usually defined by the admin and We also made CSI spec changes in order to support this feature we Introduced a new group controller service in CSI And we also have new gps interfaces to create the lead and get one in group snapshot CSI migration is something that we worked on for multiple releases CSI migration allows you to move the entry plug into auto-tree CSI drivers and So this will be easier for CSI driver to have independent life cycle independent release cycle that is from the Kubernetes release cycle and Also, if there's a bug in the driver, it will not cause a crash For their entry Kubernetes components, so it will be easier to maintain the entry components Note that this feature does not really migrate any data. It is the control path that does the trick You can continue to use entry PVPVC storage classes But underneath the kube controller manager and kubelet will route those calls to auto-tree CSI drivers in 1.25 Core CSI migration mode to GA We have CSI migration for open stack sender, azure disk and file, AWS EPS, GCPT and vSphere all mode to GA and Some entry plugins have already been removed others are targeted for removal Now this table shows entry storage driver removal. This drivers do not go through CSI migration That means after the entry plugin is removed You can no longer use the entry PVPVC storage classes As shown here, glass FS entry, while entry plugin was removed in 1.26 release Also, self-abridia and self-FS entry plugins were also Deplicated and they are targeted for removal Now, let me hand it over to Yang to talk about what we are working on in 1.30. Thank you So in 1.30 We are targeting GA robust volume manager reconstruction Which changes how kubelet Behaves during startup and how it discovers what is mounted where This should be completely invisible to users and even cluster admins. We just use the feature process and feature gate So we have a feature gate that you can disable if we break anything and we actually broke Kubernetes 127 when in 1.27.0 the feature was enabled by default it broke some it broke some clusters and in I don't know 1.27.1 or 2 it was disabled by default So now it's GA. You cannot disable it anymore and we hope it doesn't break anybody SGA We prevent an authorized volume mode conversion Volume mode Corvette conversion is a term that we invented So we have a name for a thing when you have a snapshot of a raw block volume and you restore it as a file system Volume so kubelet will then mount it we call this Thing volume conversion and we want to prevent People from doing it regular users Because converting a block volume into a file system volume and mounting it it could have a security implications So only trusted users should be able to do that and trusted software so if You or your software needs to convert a raw block volumes raw block volume snapshots into file system volumes and please read this gap If you use sidecars with these versions external snapshot or external provisioner You will your you or your software will need additional permissions and also the software will need to do Special annotation on volume snapshot content Regular users should not worry at all because they should not convert raw block volumes into file system volumes as alpha We are continuing with improving as Inux labeling speedups So if you have a Linux distro that has as in actually as Linux enabled Then every time you start a pot The container runtime will Recursively relabel the whole volume that the pot uses it will go through all the volumes all the directories on all those volumes It will set the right as inux labels on the on all the fires on the volume that can be slow So we are trying to speed it up using mount options in Kubernetes 129 We had a better implementation for read write one spot volumes because Using these volumes it cannot break anybody So it's enabled by default. It should just work and you should not notice any difference However, we are extending this mount option support to all the volumes and There could be corner cases when you share a volume among several ports and each of them has different as seamless context So this is alpha. This is disabled by default. We are doing for testing It will stay disabled for a release maybe two and if you run Kubernetes with a single as inux enabled I strongly encourage to talk to me or to test this feature while it's alpha because it can really break some workloads and Container storage object interface. It stays as alpha and In design and prototyping we have couple of other features The first one is storage capacity scoring if you remember Kubernetes 124. I believe we introduce storage capacity tracking when We track How much free capacities is on each note for dynamic provisioning of local volumes Typically if you use topo LVM, this is the thing how topo LVM tells Scheduler how much free capacity is on each note. So when a scheduler picks a note for a workload It can pick a note that has some capacity We are improving this with capacity scoring the scheduler will not pick any note with free capacity It will pick the best note with free capacity. What is the best capacity? It's written in the cap and if you have any opinion what is the best capacity for scheduling of local storage? please read the cap and Say your opinion because now is the right time to influence the cap and We are also trying to combine CSI sidecars into a single sidecar So the external attach or external provisioner resizer Snapshotter we are trying to merge them into single git repository. The primary motivation for us is to Save some maintenance costs. So we will maintain one repository instead of many but also If we have a single sidecar, we can save some memory during runtime. We can save some CPU We can share all the informers. It could be also more comfortable for CSI driver vendors to have just one sidecar instead of four Last but at least what is in prototyping is change block tracking it is an attempt to Help backup software to take incremental backups easier The expected workflow is that the backup software takes a snapshot of a volume. It takes a full backup of that volume No changes there, but when they take a second snapshot of the same volume, they can ask the CSI driver What blocks are changed between these two snapshots and the CSI driver just reports? IDs of those blocks and the backup software backups only those blocks and save slot of space and network traffic For that we introduced a new CSI snapshot metadata gRPC service That the backup survey the backup software will talk to the CSI driver It will it will be a regular service in Kubernetes And We also introduced external snapshot metadata sidecar at least initially it's going to be a new sidecar that will be Us use exposing this gRPC service. It will do Authentication authorization encryption everything using Kubernetes Arbeck rules So the CSI driver will just implement new CSI call to get the Diff between snapshots and doesn't need to worry about TLS and authorization and anything so at least initially It's going to be a separate sidecar so We can rapidly prototype we can release new versions quickly We don't need to worry about other sidecars, but eventually we hope it's going to merge into this Common sidecar So that was for features How can you get involved? The best is to look at our landing page That has all the links all the work groups that we have all the meetings We have quite a lot of meetings. I would say the main meeting happens every two weeks For overall six storage we trick we track all the features and we also have time for any discussion It happens twice a week on first day morning Pacific time We have a weekly issue triage meeting again morning Pacific time now on Wednesday And We would gladly welcome any contributors here because we have far more issues that we can fix We on these meetings we usually go through the issues we can see okay We acknowledge it's an issue, but we put it into backlog because we don't have anybody to fix them So any contribution would we welcome don't be afraid to join we can find some easy issues But most of the easy issues are fixed like we have hard problems to fix But we can help you we can do some mentoring we can we can find something don't worry and All all sub projects and work groups. They have their own meetings. We have Weekly cozy meeting on Wednesday weekly CSI meeting on Monday and by weekly data protection meeting on First day I believe all of them morning Pacific time in late evening Sorry, but late afternoon early meeting European time We have a mailing list That is not very used because everybody's on slack. So we have couple of select channels and That's basically all Here are just some links for first study if you want and this is available for download in the Schedule and here is a link for or you are called for any Feedback you have for this session. So any questions if you have a question, please there is a microphone Don't be shy Okay, I I'll get started very interesting all the information I was wondering about the volume snapshots if there is a way if there is already I definitely missed it, but I looked for it didn't find a way to report the real size used by a snapshot If you know what I mean Yeah, that's not there yet Okay, well, do we extremely welcome to have if I may say because if you take a multi-terabyte snapshot and but it's using a few gigabytes just really don't know. Yeah, I think storage system has a way to report that but Currently We do not we do not have that we did discuss a little bit actually in one of those meetings Have not reached consensus like everybody wants it, but if you Yeah, if you think this is very important We can definitely talk about this again One thing about snapshot size is that it tends to grow if you take a snapshot Yeah, and you write to the original volume then the snapshot grows So it's hard to report the size actually and that was the contention point when we discuss it on on our meetings So because nobody knew how to solve it and how to report it like at least somehow accurate Well, actually well, yeah So it's hard for you to just report a snapshot size for like one individual snapshot But actually you can actually report like total snapshot for the volume. It can get changed You know the whole snapshot chain, but there was actually a way for the Online story system to report that but I know it's difficult. I think some city vendor may have concerns regarding performance But we can definitely talk about that Because we do have to solve this problem for our own product If we did have a solution or but we need to of course For the whole community we need to have a consensus right if we want to solve it For with a common API. Yeah. Well, if if anything serves for for us. We're running databases So it's it's quite important to support snapshots well and to inform the user of the size of the snapshots Definitely dynamic is obviously changing over time. So I may rephrase my statement of more than Reporting it being able to query because the the drivers they know or they may know, right? And this is in very valuable information. Happy to share more details conversation if you want But for my side a plus one or plus a hundred for this feature So you basically just want to be able to have a way to curate it after that has been created Yes, right and but even that curate well It's going to affect the performance even you curate that call alone, but that's not an issue for you You think no, it's okay All right. Yeah, I think we should definitely talk about that again Yeah, thank you I'm happy to provide because I do know I do know that definitely there's a problem because we're not counting that towards your quota Right the quota is only for the persistent volumes, but your snapshot is still taking space, but we don't really count them That's definitely a problem. You can actually Can run out of space. Yeah, absolutely. Yeah, that may happen. Of course And if I don't see a big queue behind me, so I may take a second one if that's okay Maybe it's not exactly related to to six storage. Correct me if I'm wrong It's more about the snapshot or process if there is any plans to add the capability of Adding let's say a layer on the container image while the container is running Who that's more question for container run times than for storage. I Don't know about such plans. Yeah, I haven't heard either I would talk to sick note. Okay. Yeah, and turn off because That's really container runtime problem. They need to have a support in Container runtime interface. Yeah, and the snapshot are That's why I was here. Well, you don't take snapshots of this container layers Do you want to snapshot them and why no just to add another layer? Let's say I want to add some my particular case In databases in Postgres, we have extensions which are kind of plugins database You can add new functional and you like you plug into the browser right for the database And we don't want to ship a container with 200 extensions because that's a security nightmare Instead we want to add them dynamically and one option would be to say, okay I have now zero extensions now one one extensions and they could be provided as an additional container layer Then I want to add I don't want to restart the container that means bringing down my database. Yeah, that's definitely for note team. Okay. Okay. Thank you welcome Any other questions? I just want to know please why cozy remains in alpha release Because we like volunteers Yeah, so we actually have weekly meetings if you are interested You can join that meeting if you can if you can help that'll be great Well, we really just need we need people to work on it. We do have a few people I don't know if Blaine is here today. There's also Matthews who are on the team. Oh Hi Yeah, so Matthews. He's working on it. He's here has a test plan We are trying to add some tests because to move to beta You need to have the you just out of the pipeline. You'll have each retest and all of that. So he's working on that Blaine is working on updating the cap filling out of the production read the news reviews So all of those are steps that needed for us to move to beta No other questions everything works everybody understands everything Right well job well done