 Hello, everyone. Today, Xiangqian and I will give a deep dive on Kubernetes data protection working group. My name is Xin Yang. I work at VMware in the Cloud Native Storage team. I'm also co-chair in Kubernetes 6 Storage and the co-chair in the data protection working group working with Xiangqian. Hello, everyone. My name is Xiangqian Yu. I work for Google as a software engineer around the Cloud Storage department. I also lead the group with Xin. Thanks, Xiangqian. Here's today's agenda. First, we will provide key updates. We will talk about who are involved in the data protection working group, what is our motivation. Xiangqian will give a deep dive of the change block tracking proposal. And finally, we'll talk about how to get involved. We have published a white paper on Kubernetes data protection workflows. Here are the authors who have contributed to the white paper. We also have annual reports. Those are in our community repo. And here we also provide links to our previous KubeCon talks. There are many companies who are involved and are supporting this data protection initiative. So we have them listed here. So let me talk about motivation for this working group. In Kubernetes, the DevN operations for stable workloads are well supported. There are persistent volumes, persistent volume claims to support the voting operations. There are workload APIs such as default set, deployments for declarative management of your workloads and the components of those workloads. According to the 2022 report by the data on Kubernetes community, more and more stable workloads are moving to Kubernetes. This includes different types of workloads such as database applications, analytics, AI machine learning, streaming and messaging and so on. They move to Kubernetes because they need to have a rapid iteration and agile deployment. They want to be able to have built-in scalability which can be provided by Kubernetes. And they also want to be portable. So all of those can be facilitated by Kubernetes. On the other hand, data operations for stable applications such as protection, data protection are still limited in Kubernetes. GitHub's workflows has limitations for supporting stable workloads. For example, secrets and config maps cannot be stored in Git and data stored in the persistent volumes cannot be stored in Git either. So we need to have a better way to support data protection for the stable applications. Data protection in Kubernetes refers to the process of protecting valuable data and configs of applications running in a Kubernetes cluster. The result of the data protection process is typically called a backup. When unexpected scenario happens, for example, data corruption by malfunctioning software or data loss due to a disaster, such a backup can be used to restore the protected workload to the states preserved in the backup. In Kubernetes, a stable application contains two primary pieces of data. The first piece of data is Kubernetes metadata. It's a set of Kubernetes resources stored and managed in the EDCD database and accessible through Kubernetes APIs. And the second piece of data is persistent volume data. Kubernetes has the persistent volume claim API to allow users to provision persistent volume for user workloads. Persistent volume data will be managed and stored on the underlying storage system. Data production aims at providing backup and restore of the above mentioned two pieces of data. Part of the data production, one group's charter is to define and implement a list of Kubernetes native constructs to enable backup and restore at different levels. This figure shows the backup workflow with existing and missing building blocks in Kubernetes. The blue color shows the process. The green color represents the existing Kubernetes components. Yellow means working progress and orange means missing Kubernetes components. To backup an application in Kubernetes, we need to backup two pieces of data. As mentioned earlier, the first one is the Kubernetes metadata. And the second one is the data stored in the persistent volume. There are two ways to backup the volume data. You could use native data dump, such as MySQL dump, or you could use the controller coordinated approach where a volume snapshot is created to ensure application consistency. The application should be co-used before taking the snapshot and unco-used afterwards. The backup of both Kubernetes metadata and the volume data will be exported to a backup repository. A backup repository is a location or repo to store the data and the metadata. This can be an object store or NFS or other type of storage. It could be in a cloud or on-prem location. We have a few components here that are green. There are existing features in Kubernetes. We have six apps owned workload APIs, such as Staplesat. We have one snapshot that has been GA since 1.20 release. We also have a few alpha features. Cozy container object storage interface introduces a set of Kubernetes APIs to allow object buckets to be provisioned and accessed by the parts. There is also a set of GRPC interfaces so that a storage vendor can write a driver to create object buckets in its object storage backend. Cozy can be used together with a backup repository. Consistency group snapshot is another alpha feature introduced in 1.27 release. Sometimes application consistency may not be possible or maybe too expensive. An application may require the snapshots from multiple volumes to be taken at the same point in time for crash consistency. This feature introduces Kubernetes APIs to create a consistent snapshot of a group volumes. Another existing feature is volume mode conversion. When creating a PVC from a volume snapshot, a user can change volume mode from file system to draw block. This can introduce potential vulnerability to kernel. On the other hand, volume mode conversion is needed for efficient backup workflow. To solve this problem, we introduced the prevent unauthorized volume mode conversion feature. We added a source volume mode field in the volume snapshot content. Volume mode conversion is only allowed if an annotation called allow volume mode change is added on the volume snapshot content. If the annotation is not there, a request to create a PVC from a volume snapshot with the volume mode change will be rejected. This feature is a beta since 1.27 release. In the upcoming 1.29 release, we are going to enable the feature flag to true by default in the external snapshotter and external provisioner. So for any application that relies on this workflow, action is required. You must update your application accordingly. Otherwise, your application will fail. We have the change block tracking in a yellow box here. This is the feature that the data protection or in group has been actively working on. Xiangqian will give a deep dive about it later. Now let me talk about the restore workflow with existing and missing building blocks in Kubernetes. During application restore, we need to import backup from the backup repository. Restore Kubernetes metadata and restore PVC and PV. If the volume was backed up natively, we need to restore from the native data dump. Otherwise, we need to rehydrate PVC from the volume snapshot or volume backup. Here we have another beta feature called warning populator. That is very useful during restore. It allows you to create a PVC from an external data source, not just a volume snapshot or another PVC. This also supports the beta for first consumer warning binding mode. Warning populator makes sure a PV is created and populated with data from the external data source, such as a backup, and binds with a PVC. So that's all I have for the restore workflows. Now let me hand it over to Xiangqian to give a deep dive on CBT. Thanks, Xiangqian. So, okay, let's get started. So what exactly is change block tracking? From the community's perspective, it's nothing but a set of interfaces which allows backup applications or any other related features to retrieve a set of changed block on a storage device between arbitrary to snapshots against the same volume. So there are some key things over here. One is a set of interface. Secondly, it operates on snapshots of the same persistent volume, which means that you cannot use this API to operate against different persistent volumes. Snapshots for different persistent volumes. So the purpose of that is to utilize in the underlying storage systems capability to efficiently calculate those changed blocks for backup purposes or for replication purposes. Now, if you look down the two graphs underneath the, sorry, the graph underneath the inner sides, in a simple way, let's assume there's a volume with nine blocks or labeled with one to nine. So at time one, you take a snapshot and that's exactly what the data will look like on your disk or on a persistent volume. After some certain time, you take another snapshot as they take T2. And this moment, those ones marked as red color has been touched by the application, meaning that there will be changes to those blocks. And the purpose of these interfaces to efficiently return you the changed blocks indexes 2689. Now you can think about that way. It's not the purpose for this API or interface to return you the changed data content. The only thing that will return to you is the list of changed block indexes. So with that, why do we even need this? The next slide, please. The main purpose of having changed block tracking is to do backups or replication efficiently. In this context is more about backup. Why is that? Because these will enable incremental volume backup. First of all, from the space's perspective, you'll be using less space when you contact the backup because you only copy the deltas instead of the entire volume compared to the full one backup, of course, to the destination. In this case, let's assume your change rate is about 10% that you are talking about. In between two snapshots, you only copy 10% of the data. The second thing is actually, it may greatly reduce network bandwidth requirements. Why is that? Simple, right? Less data to transport than less bandwidth requirements. And the consequence of it, you can lower your RPO. In mind, not necessarily, sorry, it might increase your RPO a little bit because during the restoration, you need to kind of replay all the changed blocks from in between arbitrary two snapshots. But for RPO's perspective, since there's less data to transmit and it's incremental, it can actually effectively reduce the RPO. So that's one. The other main purpose is there's so many storage systems in this world, right? So enterprise storage arrays and the cloud provider storage systems, et cetera, et cetera. So backup systems in general do not want to have customized solution for every single one of these storage systems, rather than they want to deal with the one set of generic interface, right? That effectively couples the backup system or the replication system from the storage system. Of course, there are other considerations backup team or backup applications can use. They can use full backups which they will suffer in performance and they will be less efficient in space, et cetera. They may choose to couple with the storage system just for the sake of utilizing more advanced features from the storage system, et cetera. And that also opens those tools like a Rustic. Rustic is storage system mechanistic. It calculates the changed blocks by itself. The main benefit is that for backup systems to utilize Rustic is to, they don't need to worry about how, what is exactly the underlying storage system, but it leaves utilizing the underlying storage systems efficiency, an efficient way of calculating changed blocks not too possible because Rustic does the calculation by itself. So moving on to design principles, right? So we have a couple of goals and non-goals. Next slide, please. The main goals here is to utilize in CSI for the interface for change block tracking. That allows, remember, one of the main things we want to achieve is actually decoupling the backup system from the storage system. So using CSI is a standard interface. So that has nothing to do, it's independent to the underlying storage system. The second goal is we have to meet Kubernetes security standard, it cannot be worse. And the third thing is to, we have to avoid overloading API server. Why? Because the amount of metadata per, for persistent volume to present the changed list of blocks is tremendous. So we estimated about five gigabytes of metadata per one terabyte of persistent volume. And we all know that API server is actually one of the non, bottlenecks of the Kubernetes cluster. Of course, there are a couple of non-goals. So data accessing interfaces for change block, it's not in the consideration because it's so related to the underlying storage system and to the backup system. As I said, when the interface returns, it returns only the indexes of the blocks. It does not return the data content on those blocks. And there's another no-goal, which is to keep track of changed files. This is typically, RESTIC does provide that. And some of the storage systems does provide that, basically view the catalog of file changes based off of block changes. This is not the goal of this community to provide such a functionality in the change block tracking network. So talking about goals and non-goals, let's go through some of the difficulties we went through during the design. Next slide, please. The first block we hit during the process is whether we have the APIs or interfaces as imperative versus declarative. So as all you know, the Kubernetes resource model is all declarative. So basically you define your YAML, your CRD, and you store the CRDs in the API server. There's a reconciler, keep on reconciling on that. The community took that route initially and utilizing CRD plus aggregated API server. However, due to massive metadata, there will be massive that metadata needs to be served, especially the scenario is especially bad for the first change block tracking because for the first backup you do, you literally just back up the entire volume. And that imposes huge risk of overloading the API server and thus get rejected. The current thought is to provide imperative interfaces. So instead of providing CRDs and utilizing API server as a data serving pass, we'll be providing GRPC interfaces that allows softwares like backup systems to interact to get the data out. That's one. The second thing is how do we hide heterogeneous storage systems that is available to Kubernetes community, right? So the answer seems to be pretty straightforward is to define the interfaces in CSI and then there's a little caveat over here is that we need to define a community-maintained GRPC service in front of the CSI interface. I explain why we did this in the next slide. So from the storage systems perspective, they implement the CSI driver and from the community's perspective, we provide the GRPC service which provides generic GRPC interfaces for a backup system to use. And lastly, we need to be able to meet the security standards. Actually, that's the main reason why we need to hide the CSI interface and instead provide a community-maintained GRPC service because we need to do good amount of OS N and Z for across all the kinds that try to utilize this feature. Moving on to next slide, please. I'll go through some very high-level detailed designs. So all the way on the left side is the backup system or anything that will be utilizing this feature and the right-hand side will be the core components. So there will be in total seven steps before the backup system or any client can actually obtain the change list based off of the two different snapshots. So the backup system initially tries to talk to the API server where the token request API to obtain an off-end token. What is contained in this off-end token is actually the identity of the backup system. It can be your service account or if you're a cloud provider, it can be your cloud service accounts or some identity that you plug into the Kubernetes cluster. Then the backup system, after it obtains token, it includes the token to call the generic interfaces we provide as a DRPC service. It's called getData or getAllocated. There's some differences between these two core. I will not dive deep into these other differences. This is where the TLS connection to the service directly. And once the metadata service, which is the community maintenance service, DRPC service, which requests the call from the client, it retrieves the token from the request and to do a token review against the API server. That way, the metadata service knows who is calling from. And this token review will basically say, oh, this is actually authorize a user in Kubernetes cluster to use my service. And once this passed, the metadata service knows who exactly this request is sent from. And then it does an OSD, we are subject review. Why OSD here? Why we need to do this here? This is to validate the service account of whoever is calling the DRPC interface has access to the snapshots it provides. So basically, this is nothing but validating whether there exists an RBAC in the system, grants that particular user the permissions to access the snapshot into information. Of that is done, the metadata service will go to talk to API server again and obtain the Worm Snapshot IDs from the system. That is stored in the Worm Snapshot API and the Worm Snapshot Content API. After all this, that's the real work gets started. The Snapshot metadata cycle, which is the community maintenance cycle issues a CSI call. In this case, it's a private CSI call to the underlying storage system CSI container with the snapshot IDs provided by the client. And this is through a CSI-GRPC call where the domain socket. Once this is done, the metadata cycle will wrap the request together in the format it provides in the public interface and then return the changed block list to the backup system. And this whole thing identifies an entire workflow from very high level, how step-by-step change block tracking, a change blocks can be retrieved from two snapshot IDs. And the green line stands for API server calls from the controller or from the client. And the red line stands for GRPC interactions between different systems. So this is just a high level with that next slide space. And in reality, this is what exactly it looked like. You probably won't be able to read the image from this diagram, a huge call out to folks who work dedicated on this, Carl, Prasad, and Elon, et cetera. This is still stolen from their sites that they presented to the community in our data protection working group meetings. With that, let's conclude to the status of the current project. The dev team is mainly composed by Ivan, Prasad, and Carl. They've been spending a lot of effort on this thanks to them. And the debris socket right now is alpha in Kubernetes in 1.29. Current status is the cap is under review. Both the cap and actually the new CSI, it's under review right now. And they have also provided the POC as a prototype that I put the link right there. That's all I have for change block tracking. So if you want to get involved, next slide space, please feel free to with the homepage and get understanding what we are doing, et cetera, et cetera. They are bi-weekly meetings hosted on every Wednesday at 9 a.m. Pacific time. So there are a lot of meeting recordings available on YouTube and there's a agenda doc which will tell you what exactly will be discussed during a particular meeting at that. And we have a Google group mailing list. Get yourself subscribed to that group so that you can receive up to date updates from the group. And we also have a stack channel listed over there. And if you have any questions, feel free to communicate with us. We are any of those methods and feel free to join our bi-weekly meetings on Wednesday. Thank you, that's all I have. Thank you, bye. Bye.