 Let's get started. Right on time, actually, 30 seconds before. Welcome, everybody. My name is Xiang Qian. It's kind of hard. Just call me Xiang, everybody call me Xiang. I work for Google. And standing next to me is Xin. Hi, my name is Xin Yang. I work in the cloud native storage team. I'm also co-chair of the Kubernetes 6 storage and working in the data protection when group with Xiang Qian. Cool. We have a long agenda to go today. So we'll try to go fast. The agenda is mainly to deep dive into a couple of steps into this whole data protection working group or what are we doing, what kind of progress we've made so far, and what problems we're trying to solve. As usual, we'll start with the motivation and move on to the organization that has been involved and actually contributing into this working group. And give this group here a little bit key updates, what has happened in the past year or two. And then we loop into the caps. Those are the individual caps. And lastly, we'll close with how you folks can get involved. Who I believe everybody over here has been either a user of Kubernetes or an active contributor of the community. So you're probably pretty aware of the fundamental constructs to support stateful workloads in the Kubernetes environment, namely persistent volume claims. Those are the user facing API that gives you a volume, as well as workload APIs like state foreset, deployment, et cetera, et cetera, which in any of these open APIs as of today allows you to attach a volume to your workload. And when you bring down your workload and bring it back again, your volume persists. So your data is not lost during the process. Another trend is that some of you, I think, recognize went to the data in Kubernetes meeting on Monday. The trend is pretty obvious. So more and more stateful workload is moving into Kubernetes work. Initially, the Kubernetes actually has been built in a state where you can safely bring down and bring up your application at any given point of time, at any scales for you. But not a very fundamental problem to solve over here for stateful workload is actually how do I make sure my data is protected properly? So day two operations for now in Kubernetes is actually still having a couple of gaps over there. There are tools there by saying that. For example, GitOps is very popular as of today to allow you to save your configuration into a Git repository while still allowing you to do application loadback and upgrade failures, recovery, et cetera, et cetera. However, the main gaps has been noticed and found in the areas where we want to do application level consistency, snapshots, or backup of your system, and then the restoration pieces along with your data stored in your persistent volumes. So this is the entire motivation of this working group is to provide or build or design the basic components to support stateful application protections in the Kubernetes environment. With that, these are the list of organizations and the contributors from these organizations are very active in this working group. So if you're interested, feel free to reach out to us, too. Key updates. So in the past year or so, this is a very hard topic, first of all, because it is really challenging. And they are very matured commercial products for VM workloads in terms of a database running on VMs, et cetera, et cetera. In Kubernetes, not a lot has been done until now in this area. So we published the first ever white paper in the community. One of the key things I think a lot of people can benefit from this white paper is how what are the kind of modern applications that consider or are moving to Kubernetes environment and what are the mechanisms those applications use to protect the work, to protect the data. That includes relational databases, message cues, or a key value store. In this, this is a very long white paper. So if you're interested, please take a look at that. And then there's an annual report which documents all the caps that the working group has been tracking in the past year or two. And also we provide all the previous talks and links in this slide stack. If you're interested, feel free to take a look. Cool. With that, this is a very busy slide. I'm not going to dive too much into that, but conceptual-wise, it gives you a rough idea how application backup can happen to achieve application consistency in the Kubernetes environment. Again, I'm not going to dive too deep into this, but the key point I want to point out is that there are green labeled components which are already available. And the blue ones are workflows. And orange and yellow ones are either in progress or still being designed. So you can see there's one important one called COSI, which is already alpha in 1.25. This provides you the ability to provision a bucket and grant access in Kubernetes environment, supporting right now I think a couple of vendors, GCP, AWS, as well as Azure. And then the volume model convention is right now in alpha. This is more about the backup workflow. And on the restoration part, this is another kind of busy slide. Not diving too much in, but highlighting COSI, which serves as the source where you store your backup, as well as the volume populator, which allows you to plug in any specific, when there's specific implementations to do a volume restoration at a run time. With that, I'm gonna wrap up and start deep diving into each of these caps. This is the first one, volume model convention. Some of you may be aware. As of today, the volume snapshot feature allows you to take a snapshot of a persistent volume in your Kubernetes cluster. This is a point in time snapshot of your data. It allows you to also restore a block volume into a file system volume and vice versa. However, this is an interesting dilemma over here where the volume model transition can actually introduce vulnerability to the kernel. And this is considered to be a CDE. So what happens is like if you DDE, for example, a block volume, or you just touch it and maybe put some mail way on a block volume and take a snapshot of it and you accidentally restore it into a file system volume and that will cause your kernel to crash. We don't want that because that crashed the entire node. So on the flip side, this is actually in a great need feature of backup vendors. The backup vendors, what they want to do is I will take a snapshot of your file system volume. However, I want to do a very efficient volume backup, meaning that I only want to backup the data that changed in the past, in between two snapshots. And in order for them to do that, they want to introduce something called block differences calculation, where even vast majority of implementations really just calculate the hash value of the block devices and try to see, okay, if there's anything changed, I will back it up, otherwise, I'll just leave it there. So this is a great, is a need feature, however, introduce the vulnerability. So in or volume snapshot, so we introduce this volume convention model to, of course, we want to fix the CVE, right? So basically the idea is in the volume snapshot content resource right now, there's a field called source volume model. It tells you whether your snapshot is coming from a block device or the snapshot is coming from a file system volume. And then adding this model, the behavior becomes when you do a restoration, if the source volume is a block volume and you're restoring into a file system volume, it will not work, it will block you. However, you know, to support the backup systems, right? There's a special way of tricking in the system by adding an annotation. This is a very supported, well-supported annotation that allows you to actually do the transition. This is basically, basically open the door, not close the door entirely for backup system because the assumption is that they know exactly what they're doing. So to prevent the kernel attack from happening in the beginning. Right now, this is in the alpha stage in 1.24. We have the cap and the block list they were there. Great thanks to Renak who has been working on this for, I think, couple of months because this involves API change in the volume snapshot content. Moving on, this is an interesting one, got a volume populator. So what exactly we're gonna do is here is, you can, as of today, you can create a PVC by providing a reference as a source in the original PVC API. However, this source currently only supports a volume snapshot, right? Volume snapshot in many storage vendors is just a point in time local snapshot of your system. It does not necessarily back up your data. It depends on when there's implementation but if your cluster crashes or your storage system crashes, you still don't have the ability to recover it. So what do we do in this case is really to back up your data out of your storage appliance to, let's say, a bucket or just storage bucket or a different appliance. In order to support that, we need to introduce a mechanism to allow the vendors to be able to read the backup data and restore into the volume you want to use later after your disaster recovery or a disaster or your application failure. In order to do this, where we need, in order for the vendor to support within a CRD that supports a specific data source reference in the PVC, this is a newly added field right now is in beta stage, the CRD will be recognized just like a CSI driver's storage cause will be recognized by the name of the CRD to point to the particular volume populator. You don't want, one thing you don't want to happen is that, okay, you have a backup that is created by storage window one and storage window two's volume populator kicks in and try to read the data. It's never gonna work and it also introduces problems. Then the volume populator controller will watch the PVC that has that data source field specified and it's smart enough to say, oh, this is actually this specific volume populator's job. So they will pick up from the volume of the PVC and start populating the data off the line. So in order to support this right, there are two things that introduced. Once this is called, it's a library, it's a volume populator, which have all the logic allow a vendor to quickly build up a logic and save the effort to watch all the PVC's or Kubernetes API level changes, et cetera, et cetera. And then there's a validator. And this validator is very important is because that if you have a PVC that has a data reference, it has no corresponding CRD or no corresponding data populator to help you recover the data, you as an user, you want to understand what's going on. So this validator is actually just to be there and tell you, okay, there's a matching one or there's not a matching one for your volume recovery. This is the API. It does look pretty straightforward and basic, at this stage, it's on beta in 1.24 and it's gated by this any volume data source feature gate in Kubernetes. If you want to try it out, turn on the feature gate, you should be able to plug in your volume populator and do all the volume backup and restore process in your case. And the foreign other steps, how you can carry all these operations. With that, I'm going to shift you to Xin to continue our journey. Xin, please. So I'm going to talk about CBT. This is the next feature that we are working on. So CBT stands for Change Block Tracking. This identifies the blocks of data that have changed. This enables incremental backups. Without this, a backup software will have to do full backups all the time and that is not space efficient, takes a long time to complete and takes more bandwidth. Another use case is a snapshot based replication where you periodically take snapshots and replicate that to a remote site for disaster recovery purpose. Without CBT, this solution will become highly inefficient. So what is the alternative? If you don't have a common CBT API, we either have to do full backups all the time or we have to call each story vendors API to retrieve CBT. So this is not ideal. Right now we do have a cap that's being reviewed. This is the based on aggregated API server because we try not to save all the CBT records in the API server to overload that. However, there are concerns from the reviewer. There's just, even if we do not save those records in the API server, there still could be a big amount of change blocks going through the aggregated API server. So right now we are looking to other design options. So Yvonne and Fang have been leading this project. Next I'm going to talk about backup repository. Backup repository is a location or a repository used to save data. And there are two types of data that we need to save. One is the Kubernetes metadata. The other one is the snapshot data. So we need to save them in this backup repository so that we can use them at the restore time. This can either be a object store or an FS or other storage location. It could be on-prem or in the cloud. There is a project, COSI, container object storage interface that is aimed at supporting object storage in Kubernetes. COSI introduces Kubernetes APIs to provision the buckets and also allow the pods to access those buckets. And also COSI introduces GRP interfaces allow a object storage vendor to write a plugin to provision and delete the buckets. There are several COSI components. So we have a COSI controller manager that binds the COSI created buckets to the bucket claims. And also there is a sidecar that watches the COSI API objects and costs the COSI driver to provision buckets. And there's also the driver that is implemented by object storage vendor using the GRP interfaces communicates with the storage backend to provision and delete buckets. So there are two sets of COSI Kubernetes APIs. The first set is bucket, bucket claim and bucket class. Those are similar to the relationship between PV, PVC and storage class. And we also have bucket access and bucket access class. Those are for providing access to those buckets. As shown here, so the bucket is a representation of a physical bucket in the storage backend. And the bucket claim is a user's request for a bucket. And we also have a bucket class that is a type of the bucket that you want to provision. We support three protocols right now as three Azure and Google Cloud Storage. As shown here, we have a bucket access class that specifies the authentication type. It can be a key or AAM. And we also have a bucket access. In the bucket access, we specify the bucket access class name, the bucket claim name, the credentials and the protocol. And user just creates a pod with the project voting pointing to a secret in the bucket access. And the secret contains the bucket info that is mounted at the specified location. SEAD has been leading this project. There are also many other contributors for this project. This project reached our first stage in 1.25 release. So the COSY team has been having meetings every week. They are busy working on fixing bugs, working on documentation, and also trying to get more SODI vendors to write drivers. If you are a SODI vendor, you have a object storage. So you are welcome to join the COSY team and the writer driver. I added a link here for our blog post for your reference. Next, I'm going to talk about COSY and UNQUEST hooks. We need this to ensure application consistency, to be able to COSY application before taking a snapshot and UNQUEST afterwards. We looked into the COSY and UNQUEST mechanism of different applications. And they all have different semantics. We want to design something that is a generic, but the application specific logic is out of scope. We do have a cap called container notifier that proposes a part in a definition for you to run command inside a container. And this use case is actually general, is not limited to just COSY and UNQUEST. The cap is still being reviewed. Xiangqian and myself are leading this effort. So next I'm going to talk about consistent group snapshot. We talked about container notifier to achieve application consistency. So why do we still need this consistent group snapshot? But sometimes the application consistency is too expensive or you just not possible to acquire the application. So you don't want to do it frequently, but still want to be able to take a crash consistent snapshot frequently. And also some applications want to be able to take a snapshot of multiple volumes at the same point in time. There's also a performance element here that if you can take a snapshot of multiple volumes at the same time, that's more efficient than take one snapshot at a time. So that's why we need a consistent group snapshot. There is a cap that's being reviewed, targeting AFI 1.26, I'm working on this cap. So this cap introduces several APIs, actually two sets of new APIs. There are one set of APIs for voting groups and the other one is for voting group snapshot. Voting group, this is a user's request for a group and the voting group content represents a group on a stretch backend or it could be a logical grouping of volumes. And voting group class that specifies the type of the voting group. And similarly, voting group snapshot is a user's request for group snapshot of a group of volumes and a voting group snapshot content that represents a group snapshot on the storage backend. And the voting group snapshot class that specify the type of the group snapshot you want. We are planning to introduce controllers that manages the life cycle of the voting group and the group snapshot. And also we are proposing new JRPC interfaces in the CSS spec, including create, delete, modify, list, get, voting group and create, delete, list, get group snapshot. So moving on to application snapshot and backup. We already have APIs to take snapshot of individual volume but what about a application? How do you do a snapshot and backup for that? There is a cap that tries to define a stateful application and propose a way to do a snapshot and backup of that stateful application. This is still in a very early stage of design. So I'm showing this diagram again. As shown here, we have COSY that is alpha in 1.25 and voting mode conversion that's in alpha state in 1.24 release. And we are also working on change block tracking, consistent group snapshot and so on. And in this restore workflow, we see there's voting populator that is beta in 1.24. So compared to where we were two years ago when we first established this voting group, we have made progress. Like COSY and voting populator, they were originally in the yellow boxes meaning they are working progress but now they are green. So we hope that we can turn more yellow and orange boxes into green with your help. So here's our homepage of the data protection voting group. You can take a look and you can find a lot of information there. And we have bi-weekly meetings. We have mailing lists and Slack channel. If you are interested, please join us and get involved. Here's the QR code. Please scan that and provide feedback. That's the end of our session. Are there any questions? But John, I missed one slide. The voting populator is actually led by Ben. I missed that piece completely. My apologies. Cool, we are open for questions, please. For example, if you are running MySQL, yeah, so. Can I ask you a question? Yes, please. Can you elaborate the use cases of quests and quests? Yeah, sure. So MySQL, you can still hear me? Okay. All right. So before you back it up, before you take a snapshot, you want to quiet the application, right? So that you can take a application consistent snapshot. So you want to make sure your application is consistent. So when you restore it, you want to be able to still use that application. That's the reason for the quiet and unquiet. So I was just saying that every application, they have some different way of doing the quietness. Yeah, but the content notifier will allow you to specify what commands you can run. You have a, let's say MySQL database, right? While your database is still serving right, you will keep a lot of states in your RAM, right? That's to achieve better performance. So MySQL actually allows you to say, stop accepting rights, flush my RAM to my disk, and then do whatever you want to do, and then resume me, right? This quiet hook is to guide the process to be atomic before and after you're taking the snapshot of the entire volume. So you get a full picture of that. Without that, it is possible that you will lose the state in your RAM, and it may also corrupt your data under the persistent volume. That's the whole point of using quiet and unquiet hooks. Trying to provide a more general mechanism, provide some common APIs. So currently without this, of course they are doing it to themselves. Curious about why it is not skating, is that the question as well? The storage system itself, is that what you're saying? Yeah, you're not saying that. I think the concern is even just for that to pass through the API server, that's even a concern, right? So it's not. We are actually also, I want to emphasize that this is only the control pass, we're not really talking about the data itself. Yeah. Okay, thanks for coming, and please join us, we need a lot of help.