 All right, hello, and welcome to the talk about manual SFFS share backups. My name is Robert, and I work at CERN as a engineer, fellow software engineer. And I mainly focus on storage integration into container environments and their deployment at CERN. So this is the agenda for the session. First, we'll have a look at what kind of storage we want to actually backup. After that, we'll briefly describe the individual components. Right after that, we'll have a look exactly how we want to carry out the backups, what is the workflow we would like to follow. After that, we'll have a peek into the future work that still needs to be done. Because as you'll see later on in the presentation, there are still a couple of blocker issues that we are facing, and those need to be resolved first. So I really take this whole talk more as a project update rather than a showcase of a final ready-to-use product. And in the end, we will conclude with the summary. So quickly about CERN itself, it's the European Organization for Nuclear Research. It's located in Geneva, Switzerland, at the Swiss-French border. It was founded in 1954, and its main mission is fundamental science, which means it's trying to answer such questions as, what is the 96% of the universe made of? What is dark matter, dark energy? What was the state of matter just after the Big Bang? And a lot more questions of this kind. To try to answer those questions, CERN has built these very large machines called particle explorators. The one you can see on the photo here is the largest of its kind. It's called LHC. It's of circular shape, 27 kilometers in circumference. And it's placed in a tunnel 100 meters below ground. And in this tunnel, there are two beams of protons traveling in opposite directions and being accelerated very close to the speed of light. And they are made to be collided with one another at another kind of huge machines. This one is called a detector that is then able to monitor the aftermath of the collision. And this is basically what the physicists then study and analyze at CERN. Those detectors generate huge amounts of data. And even after all the filtering is done, it's still tens of petabytes per year that needs to be stored and analyzed. And for that, we are running a private cloud based on OpenStack. This is a pretty recent screenshot of our Grafana dashboard. You can see we have around 460,000 physical cores, 87,000 VMs running on almost 8,000 hypervisors. We are also running around 360 Kubernetes clusters. We have OpenShift as well. But Kubernetes leads by far in the numbers. For storage, we have those kinds of numbers. Block devices around 3.8 petabytes, file shares, almost 900 terabytes, and object storage at around 48 terabytes. This is all backed by SAF. It's mostly application and user data. For things like physics data, machine learning models, we use EOS. This one picks at around 500 petabytes. But this also includes the table cives. So this is all a lot of data. But for the purposes of this presentation, we will only focus on the file shares. So that's around 900 terabytes. But not all of this amount needs to be backed up. Because pretty big chunks of this amount is used by CEI, by QA, by testing environments, and development environments. So we don't really need to back up those. So we did some accounting. And by some estimates, we have around 65 projects that are deploying 159 production Kubernetes clusters. And those then store around 74 terabytes of ManilaSFS storage. So that's basically the kind of numbers we are looking at. So I mentioned ManilaSFS shares. It's in the title of the presentation as well. So what exactly are those? So SAF is a scalable distributed storage system that we rely on heavily. It offers three interfaces for this storage in the same package, ObjectStore, called Rados, BlockDevices, RBDs, Rados BlockDevices, and then SharedFileSystems, called SAFFS. And this is the last one is what we are focusing on right now. Then we have the OpenStackManila, which is SharedFileSystemService for OpenStack. And it's the point of where basically the shared storage systems are able to interact with the OpenStack Cloud. It supports a lot of different technologies, around 35 of those, both proprietary and open source. But for us, the main use case for Manila or the main backend that we use it with is the SAFFS fellow systems. It also provides multi-tent C quota management, all very important features that we rely on. Then lastly, we have CSI. It's the Container Storage Interface. And it forms this, well, it's an interface between some storage system and the container orchestrator, for example, Kubernetes. And it allows storage vendors to write their own storage drivers and those drivers are called CSI drivers. And basically, it forms like a middleman between the storage and the orchestrator. So there is also some integration into the orchestrator themselves. You can create PVCs. And those PVCs are then fulfilled by whatever that particular CSI driver is doing. Both SAFFS and Manila have their own CSI drivers that we rely on in our container workloads. So let's take a look at them right now. So for Manila CSI, this is what the structure looks like. It's split into two main components. The first one is controller plugin that handles the cluster-wide operations, like creating volumes using the Manila service at that particular storage backend. And then the second component is the node plugin, which then handles all of the node local operations, like mounting volume on a node and then exposing it to the workloads on that node. The important thing to note here is that Manila CSI doesn't do the mounts by itself. It relies on other third-party CSI drivers that are dedicated to whatever file system we are using on that particular Manila share. So for example, in case of SAFFS, the workflow is basically this. When a kubelet tells Manila CSI to mount a volume, then the CSI driver asks Manila service what sort of information is needed to mount that share. In case of SAFFS, this is the monitor IPs, the root part of the volume and SAFX credentials. And those are then forwarded to the SAFFS CSI node plugin, which is a completely separate CSI driver. And it has the tools needed to actually carry out the mount on the node and then expose it to the consumer pod. So let's say we have a PVC with Manila SAFFS share. How would we go about backing it up? This is the workflow we would like to follow. So it consists of six steps, quiescing the application, creating a snapshot, unquiescing, creating a volume from the snapshot, then backing up this intermediate volume and removing it along with the snapshot. So let's break down those steps one by one. Questing the application serves two purposes. First one is that it stops or poses the application from processing any further requests. This is because usually you don't want to snapshot a live volume that's still being written to at the same time as you are taking a snapshot of the data that you want to backup. The second point here is that the application of which you are backing up the volumes, it might need to be aware of the fact that you are taking the snapshot because it might store some in-memory buffers or caches that haven't been yet written to the snapshot. So it needs to flash those caches and then you can take a snapshot of the volume. Otherwise, you could get inconsistent data. This is, of course, very application-specific and not all apps need that. Some applications don't. Some applications would even consider this disruptive because they might prefer availability and not being paused rather than having consistent state on the disk being snapshot. So we have created a snapshot now. Then step number three, unquesting the application means resuming it and making it available to process the requests again. And now we can actually think about backing up the data that we have just snapshot it. The problem, however, is that you can't really backup a snapshot. Because as far as CSI and Kubernetes are concerned, snapshots are these completely opaque storage-specific objects that you cannot really access. You cannot access the underlying data from within Kubernetes. What you can do, though, is create a volume from that snapshot. And then this volume you can actually mount somewhere and walk the directory structure and copy the files from it into your backup location, for example, an S3 bucket. So that's number four. And step number five is basically just copying the data into the backup location. And lastly, we can remove this intermediate volume and snapshot because they are no longer needed. We have written all of the data we meant to backup. For restoration, the workflow is much more relaxed in terms of number of steps. We just have to download the data from the backup location and into the original volume. And then we somehow round the application. This is, of course, very specific to each case. But in general, as far as the volume backups are concerned, this is what we would do. So what this means for Manila and CFFS CSI drivers is that as long as they have the capabilities to fulfill all of the actions we have seen in the workflow, users can choose whatever backup solution they wish to use. And as long as this solution also supports CSI and is able to execute this workflow, then we are good to go. We see that the workflow is very heavily reliant on snapshoting. And the good news is that both of those CSI drivers support snapshots and creating volumes from snapshots. The bad news is, however, that this operation is extremely expensive. That's because CFFS doesn't really know how to create a thinly provisioned volume from a snapshot source. So what you would have to normally do is to copy the files from the snapshot and into the target volume. And as you can imagine, that's very inefficient. So that's one of the things we've been working on for CFFS CSI is to have the ability to create volumes from snapshots in constant time. Technically, this is quite trivial because CFFS exposes snapshots right there in the volume in this special dot snap folder. And then the individual, then the subdirectories of this dot snap folder are basically the snapshots that you have taken. They are read-only, but that's fine. So what we do basically is just store reference to some particular snapshot, store it in the persistent volume object during the provisioning phase. And when it comes to mounting this volume, we just navigate to the correct snapshot and present that to the bots on the node. This feature is basically done. And it will be part of the 3.7 release of CFFS CSI, scheduled sometime next month. So that's good. Similarly, the same feature needs to be implemented in manual CSI. And in this case, the implementation of mounting the snapshot is even easier made by the fact that, basically, if this guy already implements the logic to mount CFFS snapshot, then manual CSI only needs to pass in the correct parameters to let CFFS CSI know that, yes, mount this snapshot. And that's it, basically. So in theory, this is very easy. In practice, this is also the time where we hit our first roadblock issue. And it's connected to the fact how CFFS exposes snapshots within this special.snap directory. And, well, because of a bug or incorrect handling of snapshots names, we are basically unable to access snapshot data within this.snap directory. So we are sort of stuck. At this point, there are likely already patches for this issue upstream. And it's being worked on. But right now, manual CSI cannot implement this feature. But as soon as this is done, we can move to the next step, which is then choosing the correct backup tool to carry out the backup procedures. Because what we are doing is implementing the functionality needed by the CSI drivers. The other part of the equation is to actually have some tool to actually execute the workflow of the backup. And that's one of the tools you can see up here. Some of them are proprietary. Some of them are open source. And the approach we are taking with our users is basically it's up to them whatever tool they wish to use. Because they differ in quality in the features they provide. And it's really the users who know their applications well, and they can decide what features are important for their use case. Right now, we are evaluating Valero. And it's an open source tool. It provides scheduled backups, pre and post backup hooks, data retention. So things that our users would be interested in. And actually, our colleagues from the Drupal infrastructure team, Edson, they are already using Valero without snapshots, though, because they cannot use them right now. And they were kind enough to share their experiences with us. So in general, Valero works very well for when you want to backup the cluster resource definitions or the objects. But as soon as you want to backup persistent volumes and the underlying data, day point two, that's when things start to get more interesting. So the rule is basically that if the backend storage system is able to upload the data to the backup location by itself and in the background without you having to deploy some sort of external tool in your Kubernetes cluster that does the copying for you, so as long as you don't need this tool, you're fine. And this is the case for EBS. Google persistent disks, Azure managed disks, and similar. There are storage systems where this is not the case. Ceph is one of them. And if that's your case as well, you need this external tool that does the copying from your volume and into, for example, S3 bucket. In case of Valero, the tool is called RESTIC. And so it does what I just said. Copying file system into an S3 bucket and also the other way around downloading the data from bucket into volume. And it also provides deduplication, encryption, all very cool features. But this is also the place where we see most of the issues. So the first one is large memory consumption. OK, this is not very visible. But so our colleagues from the Drupal in Fratin have seen peaks of even 25 gigabytes of memory on the note. But as of Valero 1.71, it's gotten significantly better. It's around 8 gigs. And but still, this is expected because RESTIC needs to do the deduplication. And that takes a lot of memory to keep the indices around. So it's understandable. But the problem is that the note on which the RESTIC pot is running might not be large enough to carry out the backup operation. At which point it just runs out of memory, it is killed, which brings us to the next issue that failed backups, stay failed. There are no retries. So when a RESTIC pot goes out of memory, it is killed. And then it is restarted because Valero deploys it as a demon set. So it is restarted like any other demon set pot would be. And you would expect that RESTIC would be able to continue where it left off before it was killed due to running out of memory. But this is not what we saw. What we observed was basically it just got stuck until the backup action timed out, at which point the Valero controller just moved on to the next backup item. And this all was done silently. We didn't see any errors only after the whole backup job finished, and that's when we could see the issue. But it wasn't a very good user experience. And then the last issue we have seen is scaling issues. And this is linked to the fact that Valero processes backup items in sequence one by one. And if you have a lot of PVCs and all of those PVCs need to be copied out using RESTIC, then you can imagine you might have some problems. So again, our colleagues from the Drupal infra team, they are managing around 1,000 PVCs for their infrastructure. They are fairly large. And they wanted to have daily backups. But because the time it takes to backup the whole infrastructure takes almost 48 hours, they literally cannot have daily backups. So where to go from here? It's not all that bad. Valero developers are working hard on improving all of those issues. You can see on their repository page the community is very active. And from what I could see, they are planning to improving their CSI snapshoting capabilities, hopefully also providing support for the backup flow we've seen a couple of slides earlier. After that, they are considering, I think, alternatives to RESTIC, for example, Copia. And I'd like to point out that RESTIC is by no means like a bad tool. Our SAF team uses it in their internal operations on millions of files. And it works just fine. It's just the nature of container environment is perhaps too volatile for it to be running. And it's nice that we get to see some alternatives for the cases where it makes sense. And because we were pretty curious about this comparison between RESTIC and Copia, we made some benchmarks. And those are very much just preliminary benchmarks on a small data set in very controlled conditions. So not really something to be taken very seriously. But it gives an idea of what things might look like once Valero receives this Copia support. So what we had was a volume with about 1.5 million files. It was just uncompressed copies of the Linux kernel. And it was the same Linux kernel. So all of the files were the same, using which we could exercise this deduplication feature and compare RESTIC to Copia really well. And as you can see on the numbers just judging by the elapsed time, Copia seems to perform better. It spends half the time of what RESTIC needs to have to complete both backup and restore. The memory consumption, there is even larger difference. I'm not exactly sure what's going on here yet, because it's quite significant. And the S3 bucket size is also lower in case of Copia, because Copia does maybe better splitting the data into objects. And it also does compression. So that's one of the reasons why the bucket size, in case of Copia, is smaller. And for restoring the story is very much the same. So to conclude, what we wanted to achieve was providing our users with the ability to have consistent backups. For that, you need snapshoting support. And sadly, right now, the snapshoting support is not there yet. But it's being addressed. So if you need snapshot support as well, you have to wait. And so do we. If you don't, and your application is not that sensitive to have maybe inconsistent state on the disk and that being backed up, then you're good to go. But also be mindful of the limitations that we have experienced. So pretty large memory consumption and scaling issues. But in the end, they are all being addressed. So we are looking forward to that. All right, so that's it. Thank you. And are there any questions? There is Mike, by the way. How do you quiet your applications? How do you quiet and unquiet your application? Because there is no native support in Kubernetes. There is none. But in Valero, you can run batch script that does whatever the application needs to be done for it to quiet. So the support is there in Valero. I'm just wondering why you're not using the SF support for snapshots. So mirroring to another SF cluster. If I'm not mistaken, this is not yet available in Manila, or is it? So we have Manila developer right there. No, SF supports mirroring the snapshots to another cluster. Right. But in some cases, you might want to back up to something else than SF. And in those cases, you might need to just have some general tool that does S3 bucket clone. Just saying it would save you lots of time. It was not clear when you do a backups here, right? There was one backup. But what about multiple backups, like schedule backups? Are they incremental, not incremental? What are these different tools provided in related to? So both RESTik and Kopia, by default, that's the only thing they can do is incremental backups. And on the restore side, do they restore all the content before they allow application start? It depends. But yes, in general, you first need to restore your data before the application is able to be resumed. Yeah, because one of the slides, you mentioned that almost 65 terabytes of data on one of the volume, right? So restoring that could take probably many hours. Yeah, well, there is no other way. I mean, you lost your data. You need to first to get it back and only then continue. But there are tools other than Valero that enable you to define your own workflows. For example, this one, Canister, which I found was a very interesting project. You can define your own workflow. And if your application has specific needs, you can declare them in that workflow. And maybe that's the way. Thank you. Hi. What do you do with the application that you can't pause? Like, that's impossible to stop because you keep on processing. Well, then you might suffer from inconsistencies in the backups. But maybe having some backups are better than to have none. Maybe having inconsistent backups is just what you have to deal with. But yeah, it's an issue that we don't have clear answer to. OK. Thanks a lot for the presentation. My question would be, have you built any automation to verify backups? Like, to verify that you can actually restore the data after the backup? So right now, we are at the point of just evaluating this in our Kubernetes offerings. So we are not at the point of having a proper CI for this. But definitely, this is something we would be aiming to have. OK, I think we have actually run out of time. 35 minutes. Thank you for your attention and enjoy your lunch.