 Welcome, and thanks for coming. This is complete disaster recovery of stateful workloads, persistent volumes, and CSI snapshots via Flux and with Vault. My name is Kingdon. I'm a Flux maintainer. I work at Weaveworks in the developer experience department as an open source support engineer. Take a picture of this. If you want to visit us later or keep abreast of all the GitOps things that are happening, we have a nice website where you can see the Flux centric view of GitOps days and everything else at KubeCon. Okay, so here's our agenda. We're gonna learn about Kubernetes CSI, and we're gonna use a lovely little tool called Helm, which you may have heard of, that makes working with CSI volumes a little bit more approachable. Okay, so CSI, over by and large, not a beginner topic. So why is this a beginner talk? This actually is in the schedule as a beginner talk, if you didn't notice that. So how is this a beginner talk? Well, we're gonna start small, and we're gonna build upon what we know or what we've learned. So here's our scenario for today. This is what we're gonna start with. We are going to install some chart using Helm. I have picked Bookstack, completely random. It has enough persistent volumes to be a useful example for this demo. Okay, and how does this work? Well, a persistent volume claim in the template is instantiated, and then Kubernetes fulfills the claim with a persistent volume. Okay, so why is persistent data different than other declarative resources? Well, they're stateful. Persistent volumes are stateful. They may contain information. They may contain information. Persistent volumes contain a state. They have an identity. They cannot be wiped and recreated without spoiling the statefulness. So I bet you can guess what happens when the persistent volume goes away, when the claim goes away. I bet you can guess, but we're not here for guessing, we're gonna see. So we are going to install Bookstack. That's the example. And I actually had to search for a while before I could find the maintained Helm chart for Bookstack. So does it have to be Bookstack? Again, this is just for example, we've chosen this chart. No, it does not. What are the requirements to make a useful demo here? Well, we wanna see something in the values.yml file of our chart. We wanna see the words existing claim because that will allow us to use an existing volume later. So, and would you look at this? Actually, there's no such field in the values.yml. So we're going to have to, we are gonna have to pick a different card. So why is this not a good example? Because if you look, here's the persistent volume claim. It's not wrapped in any conditionals. So this one creates a PVC, it's not moving. Okay, there, that one. Creates, you have everybody see? Okay, it creates a persistent volume unconditionally. So we've chosen poorly. We need that existing claim field for a reason. I'll be illustrating shortly. So let's, it's not moving forward. Okay, here we are. This is the slide. We need that existing claim field. And sometimes a bad example is a good example. Okay, so I was actually aiming for a good example. So here's the good example in the Helm charts repo from the stable disk, which is yes, it's a bit long in the tooth. But this is the example that I was actually looking for when I thought of Bookstack. Can we all see, no, it's not moving forward. I'm not sure what to do about this. Okay, so here's our good example that we're looking for. It has storage section for configuring the storage in the Helm chart. And you can set an existing claim. So we're gonna do a Helm install now. The old fashioned way, actually. Yes, this is GitOpsCon, but I promise before too long we will get to GitOps. So we are intentionally, we don't need GitOps to have this problem. Let's put it that way. We're trying to get ourselves into a jam. And we don't need GitOps to illustrate the problem. But it's part of our ideal solution later. It looks like I have to flip back and forth a few times. I'm just gonna try to do that and not talk about it. Okay, so we started with this question. What happens when the PVC goes away? We're gonna answer it. Right now. Okay, so here's the screenshot. Hopefully this is visible, or at least some are visible. I'm doing Helm install. And just to note, at this point, as a consumer of the chart, we actually may have no idea that it's doing this. It's creating persistence on our behalf. And that's good and bad. The keen observer at this point may notice something else interesting. We have persistence stores for storage and uploads, which the distinction is not important here. But what you should notice is that we have no persistence for MariaDB. That's not good. So that's okay. We're just asking questions right now. We wanna know if the persistent volume can be deleted and whether that's going to become a disaster for us. And so let's uninstall and find out. That's right. It wasn't deleted. Wait, they're perfectly safe. Both volumes, wait. Our one volume is perfectly safe. What? I can't hear you. And it's gone. So, all right, what did we learn? It's not a disaster yet. It's a policy decision. We have chosen, or Kubernetes has chosen for us to make the default policy delete. Retain, recycle, delete are all valid reclaimed policy settings. But recycle is not very interesting. So let's say our choices are retain and delete. So we told Kubernetes to delete it. Yes, it's the default. And where is that default from? Here it is in the storage class. You see that reclaimed policy of delete in the storage class. Did we not mention before that each persistent volume inherits its settings from a parent storage class? No, we did not. Okay, so we can edit it in place? No, actually we can't edit it in place. Updates are forbidden. So we have to define this when the storage class is created. So we'll just have to patch the PV after it's created. So that's what we'll do. All right, so back around to where we started, but we're adding one thing we learned. We have to enable persistence for MariaDB. It doesn't come out of the box that way. And another problem, all these resource have reclaimed policy of delete. So there's a piece of Kubernetes docs and let's talk about what problem did we solve? We should not be unceremoniously deleting our important data. So what we prevent from being deleted through the regular Helm lifecycle, we can do that by setting the reclaimed policy to retain. So, and also we've added another persistent store. So we have really only dug the hole deeper at this point. But one thing at a time, we're building up to it. All right, let's uninstall the chart. Ooh, that's interesting. They did not all get the same status. I wonder why? That's interesting. I bet that has something to do with it. This one still has its claim. Maybe it's a stateful set thing. Let's have a look at that PVC. Oh boy, that is a lot. Anything we can do to clean it up. Yes, there is kubectl-neat. kubectl-neat. We'll trim it down a bit. This does, what it does here is worth pause to talk about for a minute. So this is kubectl-neat, a crew plugin. This just goes through that resource definition and it takes out anything that's been added by Kubernetes as an expanded default. And it does a few other things. To make this resource safe to apply and reapply on a continuous loop, it doesn't quite do everything we need for persistent volumes. So we're gonna take a few of these annotations away from the persistent volume claim. Where there's one for bind completed and one for bound by controller. We're just gonna wipe those out. Then we are going to, we have another detail. We can't really have known in advance that Kubernetes is about to fill out which node the persistent volume should attach to. It has this wait for first consumer setting back in the storage class. So when the pod is created on a node and scheduled that's when the persistent volume is created according to the configuration in the storage class. So we now have solved another problem. We know how to prepare a PV and a PVC definition for rebinding. But we just understood the relationship a little bit better. Well, this isn't actually a problem we've encountered yet, right? But Helm creates a persistent volume claim which spawns a PV and we need them both to complete the puzzle. So now we can recall that our PVCs were deleted except for the Maria DB. So we could reconstruct them but we're still just poking around. So let's put the delete policy back and just wipe them out. Okay, and then when we delete that claim the volume is really gone again. Okay. So we haven't really solved the problem yet but so, okay. There is a video here of, we're gonna try this again. And so we pipe each PV and PVC through cube cuttle neat and save them on disk. Make the changes as we illustrated here. I'm not gonna play all these videos because we don't have time but hopefully you have an imagination and there will be more help later for in case you don't have an imagination. So now my screen is not updating. I'm still on slide 56. All right, I'm not sure why it's doing that. Here we go. Okay, so once more we're gonna do a helm install with the same options as before. We're gonna patch the resource of the retain reclaim policy. We're gonna pipe each PV and PVC through cube cuttle neat. We're gonna save them on disk. Is this going to work this time? There's only one way to find out. Let's reapply those definitions and finally, well, the short version is it did not work. There is more behind this video to see but we missed one thing. We needed to erase the node affinity. It's not pictured in the previous video and my slide is just not updating. Okay, so why did this not work? Here's another important thing. No, it did not work because we forgot the database password no longer matches what we have in the database itself. That's pretty complicated. But Helm is generating us a random password for each install and the first time MariaDB runs, it actually records that information in the database and if it doesn't match again later, well, MariaDB won't start. So if you're following along at home now, it's starting to heat up. So we've got a nice collection of values here. We're actually gonna put them in a file, values.yml. And if you'd like to give this a try here, yeah, so we know we need to set the password. Actually, I'm gonna skip one step here and say there's two passwords we need to set because there's a root user in the database and there's also a user user. So we're gonna do that. We are going to uninstall and then apply these definitions to the cluster with Helm. So here's our amended list for now. Is this going to work? We've added passwords to the list this time and if you follow the video, this time you know it's working. You better believe me, it took two tries to get that right. Yeah, the two password thing that I tripped over that at least twice. So now all we're missing is git. So if you're new to GitOps, I hope that this whole beginner's journey hasn't scared you away but remember that we set out to do just one thing for now and we've done it, right? So we're gonna do this bedron, actually, of how do we get this into Flux? How do we get it into Git and use that as part of our disaster recovery solution? So there is another video that should go here and I apologize I've been working up to the last minute so, but if you've read the Flux docs from cover to cover, this one might not be familiar because it's a bit new, it's about encrypting values.yaml and storing it for use in a secret. Then how to use that secret with a Flux Helm controller, Helm release to drive the operation of your Helm chart. So here's the Helm repository, we're gonna create it using the Flux CLI and then we're gonna use Flux export source git and here is that Helm release repository a little bit zoomed in with some details highlighted. We've actually, because there's a reason that the stable repo was deprecated, it's large and unwieldy. So we actually have to increase the timeout to make this work and we're gonna do the same for the Helm release, convert the Helm install command to a declarative form, okay? So there are some details from that last page that you ought to read the Flux docs to understand if they're not immediately obvious but there's one that I'll talk about here. The values from is part of the guide I linked to a few slides ago and this is straight out of that guide. This is customize config.yaml. This points to a field in the Helm release. It doesn't need any changes. This one does change just a little bit, okay, prior to this I was using pod info and so let's change the book stack and then let's make sure that all the file names match what we've created. We're gonna need that values.yaml file that we created earlier also so and this reminds us we're gonna have to declare the namespace if we want this to work with GitOps. Now, I think that is everything except one more thing, all right? We said we're gonna have secrets in a values.yaml file so that means we need to encrypt it and that means we need a SOPS configuration. So if you do follow that link that I posted before, this is the same link and it has information connected to the SOPS guide and flux to tell you how to set up SOPS from scratch with flux. So, all right, so we're gonna do that. Just gonna follow the guide and not talk about it and commit everything to Git and fluxes Helm controller is going to take over this release and if we've lined everything up correctly then you should see Helm controller taking over the Helm release. You should have flux bootstrapped already at this point. Thanks, Pinky and Sentochi. Throughout the day, you'll see flux bootstrap. We're not gonna do it here but hopefully we've already seen enough of that or we'll see enough of that and hey, it worked, cool. So there's our Bookstack wiki. We've restored it and it actually has all the data still on it. So what's missing at this point? Actually, we still don't have a backup. That was the first thing we wanted to do. So that could be a problem. But we're well prepared now for what comes next. Okay, so check out this talk from last year's KubeCon that already covered CSI snapshots at large. We can use them in GitOps, I'm sure, although unfortunately I can't prove that right here as I'm about to tell you more about why. And there are still some problems for us so we cannot manage our Azure snapshots. I don't think I mentioned that I am on Azure here but with GitOps we cannot manage our Azure snapshots without getting another third-party tool involved. Kubernetes cannot do it directly. Azure does not support it. And this is actually not too uncommon. I did not start on Azure when I was preparing this. I wanted to make it accessible so I tried not to make it about this analogy file server that I have at home that has an open source CSI driver that does support the common snapshot interface. So the demo is basically all the same there. And if you don't have one at home, you can try it on these clouds. And if anybody shouts a name right now that also supports external snapshots, I will happily add it to this list. No questions asked. So, okay, let's move on. So there's some documentation that I've been working on to go with this talk. Disaster recovery runbook. And this is the beginning of a runbook that will eventually cover everything we've discussed here and more. But it needs some work yet. So I don't think I can do this by myself. But you can help. I'm sorry the slides are not in better sync with what I'm saying. But hopefully it's still possible to follow. So, get used to saying that you need help. I need help. Future, you need help from present you. Or maybe present you needs help right now. If you need help, you know where to find us, I hope. We're in the CNCF Slack, or I am, Kingdom B, and I'm usually in the Flux channel. So this runbook has a section that is focused on what to do in a disaster. It assumes you will have already followed all of the architecture guidance. So the architecture section is much longer. The runbook is short. I thought for a while about how to publish this. I decided not to use Bookstack. Okay, so lesson number one. Actually, let's back way up. Don't get yourself into jams if you can simply avoid it. In this case, it would have been absolutely avoidable. MariaDB can be run externally. You can refer to it as an attached resource in a secret, just like we've been doing since 2011 or earlier, as described by a 12-factor. If your organization is large and fairly competent, it is highly likely that they have backups worked out already to a science. So if figuring that stuff out is your job, then I hope this talk has helped. If you don't have such guidance, you may want to start with a better resource than my runbook that I started three months ago. So there is prior art without a focus on GitOps. Valero actually solves most of these problems and others too. About snapshots, another thing to consider is if your snapshots are housed in a resource group or another analog like we have on Azure, they can be cascade-deleted. So if you're creating snapshots as your plan for a backup, well, that's not the only limitation, right? We talked about mixing secrets in our values.eml file with the rest of the configuration. Is that really the best way to do it? I don't think so. I think you should put your secrets in a purpose-built secret store like Vault or any KMS solution that you can access with an external secrets operator. But it does have limitations when you use that with Flux. Flux Customize Controller needs those things together. So that example that I showed and the other examples that are in the family, the config map and there are a couple of examples there. If you follow that link and just scroll up a little bit, there are examples without encryption for things that aren't secrets on how to use a config map generator or a secret generator in Customize. And you need the secret right there in order to do that. It needs to be decrypted in the context of Customize. So there is some stuff to think about here. So let's talk about the risks again. You might have a single cloud account. Say the mode of failure is a total account takeover. This is actually a kind of similar scenario in that we may have lost access to the account and we're unable to recover it or we've lost access to the snapshots like we would have if we deleted the resource group. Case in point, you need a backup outside. So what else can we do for backups? Well, if your architecture for backups and disaster recovery planning includes these possibilities, you probably are gonna wanna check out RESTIC. It has many scenarios for copying in ways that weren't supported or desired to be supported by upstream APIs like the Azure API. Now we can't show them all here today, unfortunately due to time constraints, but we will add these things to the runbook soon and we will continue adding them as people are interested if that should come to pass. If we're not prepared to go outside of what our cloud provider supports natively, there are other solutions we can approach too. So you can lock the resources, again, using the Azure API or the Azure portal to prevent their deletion. I'm sure other cloud providers likely have a similar structure that's available and Terraform also has the prevent destroy lifecycle setting which if you're using Terraform, but again, one can just comment that out and run it again. So that's really not a complete solution. So what else can we do? We said offsite backups, something we wanna consider. They should not live in the same cloud account if we wanna protect against a total account takeover or really anywhere that's accessible from that account. So we've reached the end of our content and you can take a picture of this again if you missed it earlier. Thank you very much for your attention. I think we might have time for a question or two. We do, yeah. Yeah, one over there. We've got a microphone on the way here. So all the steps you were going through to create a backup manifest of the persistent volume claim, persistent volumes. So ultimately what'd you do with that manifest? That was a little bit, I wasn't sure about that. Was that to put it back into your own version of the Helm chart or put it somewhere else? What we did with that, maybe I did skip over that a little bit. So what we did, we uninstalled the Helm chart or we, yeah, we did uninstall it. So we have taken an initial definition, the values.yaml with, here, this is the slide. This is what we did. We took on the left where we have no existing claim defined in our initial installation and we have storage enabled so that the volumes will be created by Helm and then we took them over. We captured their definitions, we stored them in Git and we changed a few things. We changed existing claim to refer to the existing claim and at this point I uninstalled the chart and reinstalled it. I had a problem when I tried just straight upgrading it there but actually the problem somewhere in here you'll notice if you didn't believe me when I said it all worked at the very end here, here is a mistake. It's not called my book stack release, it's called my book stack wiki so I did have an error at the very end and I had to fix that. I wound up uninstalling and reinstalling but after that, now your volumes are part of your definition, they exist and as long as they exist or they can be restored to where they were or another place, now you can do a full restore on the new cluster. So you tear it on the cluster, you bootstrap a new cluster and it should restore all of the data from all of your persistent volumes. This works basically on every CSI driver with minor variations. Thanks. Yeah, all right, great. Come visit me at the Flux booth. Thanks for your attention. Thanks, Keenan.