 Hello, everyone. Today, Xiangqian and I will give a deep dive of the Kubernetes data protection working group. My name is Xinyang. I work at VMware in the cloud storage team. I'm also a co-chair of CNCF tech storage and Kubernetes 6 storage. I co-lead the data protection working group with Xiangqian. Hello, everyone. This is Xiangqian. I'm a software engineer in Google focusing on storage in good GCP. Next slide, please, Xin. So today's agenda, Xin and I will go through the foreign items. We'll talk a little bit on what are the motivations to form this data protection working group and the organizations that are involved into this group. And I will give you a little bit of updates on what is going on in this working group. And a vast majority of the time, Xin and I will spend on deep dive into the listed components over there. Some of the components are directly driven by community members from the group. And some of the projects are highly related to achieve or go for protecting state workloads in Kubernetes context. And lastly, Xin is going to have a slides to talk about how you can get involved into this community if you are interested. Next slide, please. Some motivation. Many of you have already deployed, been deploying state for workloads into the Kubernetes environment to benefit from its reach APIs and functionalities to manage your state for workloads. State for workloads can include, for example, your relational databases like MySQL or your time series databases, Prometheus, influx, DB, Node, DB, et cetera, and key value stores, and message queues, et cetera, et cetera. Developers and users achieve that we are Kubernetes native APIs, namely persistent volume operations, persistent volume claims, you can allocate persistent volumes of our underlying storage system using the native PVC-PV constructs. And also to manage your workload, there are APIs like deployment, state for set, demand set, et cetera, et cetera, which allows you to achieve automatically scale up and down your workload based off of scenarios, different scenarios. We observe more and more state for workloads moving into Kubernetes environment. However, there is apparently a gap over there for data operations to protect your data in this context. There are tools like GitOps, which allows you to back up your Kubernetes configuration or to achieve application rollback when upgrade fails, for example, but Vito has been discovered in this effort to cover the protection of your persistent volume data, which sits on your PVCs. Next slide, please. With that, we formed that the data protection working group and the following organizations are highly involved in this community. If you do see that your organization is not listed over there, please feel free to reach out to Shin and me, and we can update the slides. Next slide, please. Key updates. In the past year or two, the community members have been working very hard to define the scope of the data protection working group. This year, we have published the first version of our white paper thanks to all the community members, special thanks to the authors. These are all there. Within the white paper, we are trying to define what exactly data protection look like in the Kubernetes context, why we need that, especially in the Kubernetes environment, and simple use cases, for example, how do we protect the application, how do we protect the namespace, et cetera. There's a couple of sections talking about the existing components that is available to you in Kubernetes as of today, and also sections about what are the missing building blocks in Kubernetes that kind of blocks or road blocks for backup vendors to provide meaningful data protection tools for these therefore closing Kubernetes. And also, we try to list a lot of application-specific use cases, how to perform backup and restore processes of different, for example, databases or message queues, et cetera, et cetera. Again, this is considered to be a huge milestone of the data protection working group. If you are interested, the white paper link is in the slides. The other key major updates is that we are moving towards fully supported one snapshot V1 APIs, and we are trying to remove the V1 beta one API in Kubernetes 1.24. The detailed cap has been listed over there in this slide. In short terms, a non-backover compatible change for V1 beta one to V1 has been introduced for the V1 snapshot API. There are three phases when we release the V1 snapshot API from V1 beta one to V1 to minimize the impact on your application. But at this moment in Kubernetes 1.24, we're executing the last phase with the phase three. It's to deprecated V1 beta one API from 1.24 and onwards. The community will stop supporting V1 beta one API. So if your application still depends on this API, please make sure you do the upgrades to the V1 API as long as possible. If any details are needed, please read through the cap. It should provide you the guidance as well as P4s, if there's any, you want to avoid doing this upgrade process. The last item over there is that today's session is going to be very different from previous talks, which we covered in many, many sessions already. If you want to look at the previous sessions, those are the links you can find in different years and different areas. Next slide, please. Now let's do some more deep dive into individual areas. The first area is this volume model convention. Why we're doing this is imagine you have a block PVC and then you take a volume snapshot of it. As of today, the volume snapshot API or the PVC construct does not prevent you from creating a file system PVC from that snapshot. Even that snapshot is taken on a block device and this is problematic because it allows the, it brings the vulnerability to the kernel potentially. This is considered to be a security loophole and we're trying to fix. However, the volume convention, excuse me, is actually considered to be a feature of volume snapshot API. The imagine you have a file system PVC and you took a snapshot of it and backup vendors or even yourself sometimes in order to achieve efficient backup workflow, volume backup workflow, you want to restore, rehydrate the snapshot into a block device. That way you can do block level deep calculations and only shift the change the blocks to the remote. So this is a dilemma that we're on one hand, we might introduce vulnerabilities to the kernel. On the other hand, it's kind of, you know, feature that we need it. So this is something that community tried to solve. Next slide, please. So in, so we introduced this kind of changes to help or mitigate the security concerns while still maintain the functionality. So there will be API changes in the volume snapshot content API. Source volume model field will be introduced. This field will be used to keep track of the source PVC volume model when a snapshot is taken. Besides that, a system level or annotation called allow volume model change will be placed on the volume snapshot content as well. This is a boolean annotation, which allows only fours and two values. The behavior right now becomes if a volume model change is detected by checking the source volume model while rehydrating a volume snapshot into a different PVC. And the new PVC has a new volume model, which is different from the source volume model. The controller will automatically reject the rehydration, only under the exception that the snapshot content carries the allow volume model change annotation and has a true flag being set. This way, because one snapshot content is not namespaced, only privileged personnel or the controllers who have corresponding privileges can attach such an annotation so that to satisfy the feature one where we want to actually rehydrate a file system PVC snapshot into a block device. This is right now checked in. Next slide, please. You can find more details in the cap link over there. And there's a block. It's going to be out very soon. Right now, the plan is to enable this on alpha into 1.24, when it's 1.24. A special call out to the dev lead will not go over there. They have been putting a lot of effort to make this happen. With that, I'm shifting to Xin to continue to deep dive into other areas. Xin, thanks. Thanks, Xiangqian. I'm going to talk about Voting Populator. Without Voting Populator, we can only create a PVC for another PVC or for another Voting Snapshot. But what if the backup data is stored in a backup repository such as an object store? The Voting Populator feature allows us to provision a PVC for an external data source such as a backup repository. In addition, it allows us to dynamically provision a PVC, having data populated from that backup repository, and honor the wait for first consumer of Voting Binding Mode during restore to ensure that volume is placed at the red node where pod is scheduled. I will talk about the components needed for Voting Populator. Every Voting Populator must have one or more CRDs that it supports. This CRD can be specified in a data source graph in a PVC. The Voting Populator needs to have a controller that understands this CRD. It watches PVCs with data sources pointing to the CR. When there is a request to create a PVC with the data source, the CSI driver will create an empty persistent volume first. The populator controller is responsible for filling the volume with the data based on that CR. The PVC doesn't bind to the PVC until it's fully populated. There are a few beauty components that facilitate the Voting Populator feature. There is a lib Voting Populator. It's a shared library for use by Voting Populators. Voting Populator can rely on this library for Kubernetes API-level work. The populator itself only needs to make sure that data is written to the volume based on the CR. There is also a sample populator included in the lib Voting Populator repo. The second beauty component is Voting Data Source Validator. It is a controller responsible for validating PVC data sources and that generates warning events on PVCs with data sources if there is no populator. To use the Voting Populator feature, the any Voting Data Source feature gate needs to be enabled. It is a beta in Mono24, so the feature gate is enabled by default now. User deploys the Voting Data Source Validator controller and the Voting Populator. User creates a CR. Here's an example on the right hand side. It's a HelloCR. User creates a PVC with the data source pointing to this HelloCR as shown on the right hand side. The Voting Populator makes sure PV is created and populated with data from the data source and binds with the PVC. Ben has been meeting the Voting Populator feature. The any Voting Data Source outer feature gate was introduced in 1.18 release and had a redesign in Mono22. Now it is beta in Mono24. I included references for the CAD blocks and repos for lib Voting Populator and Voting Data Source Validator. Next, I'm going to talk about CBT. This is a feature that the data protection wing group is actively working on. CBT stands for Change Block Tracking. It's a feature that identifies blocks of data that have changed. It enables incremental backups to identify changes from the last previous backup writing only changed blocks. Without CBT, backup vendors have to do full backups all the time. This is not space efficient, takes longer time to complete and need more bandwidth. Another use case is snapshot-based replication where you take snapshots periodically and replicate to another site for disaster recovery purpose. Without the ability to extract CBT, it is impossible to make snapshot-based replication a practical solution. So what are the alternatives? Without CBT, we can either do full backups or call each storage vendor's API individually to retrieve CBT, which is highly inefficient. Currently, the data protection wing group is working on a design for CBT. This diagram is from the data protection workflow white paper. Here we have a new differential snapshot service that handles CBT. It shows backup workflow that utilizes the differential snapshot service to increase backup efficiency. Let's take a look and see how this might work for block volumes. First, we create a volume snapshot of the pvc to be backed up. Then we create a new pvc, call it pvc2, using the volume snapshot as a data source. Then we query the differential snapshot service to get the list of changes between the two snapshots. For block devices, the list of changes is of the format of a list of changing blocks between the two snapshots. For file system volumes, this list of changes is changing the files list between the two snapshots. We are only focusing on block volumes here. Based on this list, we can backup only changing the data from the pvc2. If the list of changes cannot be obtained, then we can backup the entire volume. In the CBT design, we are proposing Kubernetes API object changed blocks. There is a working progress TLC for it. I will show how the proposed API looks like in a minute. We are also proposing CSS back changes. It's either going to be a new capability in the existing CSS control service or a new option service for differential snapshots. It will be a new CSI RPC, get change blocks. We also discussed potential performance problems with this new API because change blocks could be huge. We are also investigating how to use API aggregation to address this performance problem. Here's the proposed API. The changed blocks spec contains two snapshots handles. There is a snapshot base that is optional. There is a snapshot target that is required. Change blocks will be the difference between the two snapshot handles and the voting ID that is optional. We also have the start offset and max entries. Those are for pagination. There is also a secrets field. This is for vendors that need to use secrets to access snapshots. We also have parameters here. This is a vendor-specific parameters. They are opaque p-value pairs. This is the change block status. In the change block status, we have change block list. Each of the change blocks has a context. That's the additional vendor-specific information that is optional. It has a logical offset and the size of the block data. Those two fields are required. There is also a zero-out, which is a boolean. If that is true, that means this block in the snapshot target is zero-out. This is for some vendors who want to avoid data mover to transfer zero blocks. Here is the status of CBT. Feng has been leading this project. He has been organizing design meetings. There are also other community members who are contributing, including Yvonne, Dave, Ed, and Shaan, and many others. Right now, design RPVC are in progress. I added links to the POC repo, the working progress cap, and CBT meeting minutes for your reference. Next, I will talk about backup repository. Backup repository is a location or repo to store data. This can be an object store in a cloud. It can be an on-prem storage location, or it can be an NFS-based solution. There are two types of data to be backed up that we need at the restore time. The first one is Kubernetes cluster metadata. The second one is snapshot data. We need to back them up and store them in a backup repository. Currently, there's a proposal for object store backup repository. That's the proposal for object bucket provisioning or COSI. COSI proposes object storage Kubernetes APIs to support orchestration of object store operations for Kubernetes workloads. It has APIs for bucket and bucket claim. That's like PV and PVC. It also has a bucket access that's for a pod to get access to the bucket. It has a bucket class that's like the storage class. It was also bucket access class. COSI also introduces GRPC interfaces for object storage providers to regular drivers to provision buckets. COSI components include a COSI controller manager that binds COSI-created buckets to bucket claims, similar to the binding of a PV to a PVC. There's also a COSI sidecar that watches COSI Kubernetes API objects and calls COSI driver to provision buckets. We also have a COSI driver that implements GRPC interfaces to provision buckets. Here's the status of COSI. CID has been leading the COSI project. There are also many other community members who are contributing to the project. Kubernetes COSI is already a sub-project in 6 storage. It has weekly design meetings. CAP review is in progress. We missed 1.24, so now we are targeting Alpha in 1.25 release. Now I want to talk about COIS and un-COIS hooks. We needed these hooks to COIS application before taking Snapshot and un-COIS afterwards to ensure application consistency. We investigated how COIS-un-COIS works in different types of workloads. They have different semantics. We want to design a generic mechanism to run commands in containers, but we want to mention that application-specific semantics is out of scope for this design. We currently have a proposal called container-notifier. CAP is submitted and it is still being reviewed. Now let's take a look and see what is proposed in the API. In this one, we are adding a container-notifier list in the container type. Inside the container-notifier, there is a handler that defines a command. For example, it could be lock tables for MySQL. User creates a pod notification to request a specific container-notifier to be triggered. Pod notification status contains aggregated status of container-notification status. In Phase 2, we are proposing a notification type. This can be used to select pods with specified notifiers to be triggered. There are also different pod selection policies. Either we can support all the pods or we can only support pre-existing pods only. We mentioned earlier that CAP is still being reviewed. Xiangqian and I are leading this work. We talked about container-notifier proposal, which tries to ensure application consistency. What if you can't acquire the application or if the application requires is too expensive so you want to do it less frequently but still want to be able to take a crash consistent snapshot more frequently? Also, an application may require the snapshots from multiple volumes to be taken at the same point in time. That's when consistent group snapshot comes into a picture. There is a CAP on Volume Group and Group Snapshot. It proposes to introduce a new Volume Group CRD that groups multiple volumes together and a new Group Snapshot CRD that supports taking a snapshot of all volumes in a group to ensure right order consistency. The CAP is being reviewed. I'm leading this feature. We have snapshot APIs for individual volumes. But what about protecting a stateful application? There is a CAP submitted that proposes a Kubernetes API that defines the notion of stateful applications and defines how to run operations on those stateful applications such as Snapshot, backup, and restore and so on. Xiangqian and I are planning to pick it up. We are going to work on this CAP. This is still in a very early design stage. We talked about the features we're working on in the data protection working group. Now I'm going to talk about how to get involved. Here is the community page for data protection working group. It has information on how to get involved. We have bi-weekly meetings on Wednesdays at 9 a.m. Pacific time. If you are interested in joining the discussions, you are welcome to join our meetings. We also have a mailing list and a select channel as shown here. This is the end of the presentation. Thank you all for attending. If you have any questions, please don't hesitate to reach out to us. Thank you.