 Hey, everyone, this is KubeCon and CloudNativeCon North America 2021. And this session is about what you need to know before using local persistent volumes on Kubernetes. My name is Sebastian. Most people call me Seb, so you can just do that. I'm a software engineer at Elastic, where I mostly work on our Kubernetes operator for the Elastic stack that we call ECK, as in Elastic Cloud on Kubernetes. So on the agenda today, we'll mostly talk about local volumes and basically start with explaining how they work and how things are plugged together in the Kubernetes world. Then we'll mention how you can provision those local volumes in different ways. And finally, I think the most important part of this presentation is going to be the last part about what I call operational gut chats, which is basically a list of things you need to pay attention to to ensure you are using your local volumes and especially your stateful workloads along with them the right way. So let's get started. How do things work? What are persistent volumes at all? So usually when we talk about persistent volumes, we associate that concept with the stateful set concept, which is a way to deploy stateful workloads in Kubernetes. So let's take an example here. On the left in that pink box and deploying a stateful set with three replicas. Let's imagine this is an Elastic search cluster, but it could really be any sort of distributed database in this example. And in this stateful set, I'm specifying that I want 100 gig of storage per pod. So the stateful set from folder internal will create three pods because I want three replicas. So the first part is going to be with ordinal number zero, ordinal number one, and then we have another one, the last one with ordinal number two. And because I requested a claim of 100 gigabytes, each of this pod is going to be associated to a persistent volume, so the storage unit of 100 gigabytes. And the nice part about this stateful set and the persistent volume concept is that we have a direct one to one relationship between the pod and a persistent volume here. So for example, this pod my ES cluster zero is associated to this persistent volume called data my ES cluster zero. This probably is very nice because whenever, for example, that pod number two gets deleted, either on proposed because you want to upgrade it or either accidentally, then the relationship with its volume stays intact. So here the stateful set controller would just recreate that missing pod for the stateful set on that pod will automatically be associated to the same persistent volume. So that's a very nice way to treat stateful workloads because whenever you need to, like, for example, upgrade the version of that database or do a change in the configuration, you basically roll out the pods one by one. So if you stop those pods, stay bound to the same volume and stay bound to the same data. So there's no data loss associated to that operation, which makes it very nice. So if you get into more details, you know, Kubernetes is all about general resources, right. So let's take a look at this specifications. We've got our stateful set with three replicas. That's what we have here has a name, my ES cluster ES has an elastic search but the game could be anything. And I specify volume claim template so that each of the pods of the stateful set has a particular persistent volume with here 100 gigabytes. And the stateful set controller reads that resource, and it's going to create for each pod, the pod itself, of course, along with an associated persistent volume claim. So we have the pot the pot here in the top and the persistent volume claim here at the bottom. And really the claim itself is not showing the volume yet it's just, it just expresses the desire for the pods to acquire a storage units which is going to be the persistent volume. At this point we have a pod, we have a claim. The pod is basically waiting for that claim to be bound to a real volume. And you can see that there's a very nice naming convention in all that system. So the stateful set here that is called my ES cluster determines the name of the pod so the first body is going to be my ES cluster zero. The second volume claim is going to reuse that name of the pod my ES cluster zero that prepend to it the name of the volume claim that we declared in the stateful that is also used in the in the pod definition itself. So we have this this relationship between the naming of all that stuff so there's some sort of implicit relationship between those. So we have this claim saying, I want to hand read this story. And what usually happens, and we'll see it's not like the case that if you, for example, create a stateful said or persistent volume claim on the cloud provider. What's usually going to happen is that a provisioner, an external thing that small controller is going to provision a volume that matches the claim. And that's 100 gig volume expressed in that claim and something external provision is going to create that volume for you. Right and the volume is here, for example, or persistent disk in DCP. And it's going to be the same size as the claim you requested. So now we have system volume on one hand system volume claim on the other hand and the volume was created especially for that claim. And the next step is for another controller in communities which is called the PVC controller to bind the volume and the claim together. So whenever there's a claim that controller will look for any available volume that can match that claim. And in our case here in this example the volume was created just to match the claim. So it's very easy for that controller to bind the two together. There's a relationship between those two that we can inspect in the YAML file. Like here we see that in the volume we have a reference to a particular claim. So that's the claim that was bound to that volume. So it has the same name here as our claim on the left. And we also see that the claim itself was updated with the name of the volume to which it is bound. I usually like to separate two categories of volumes. One, which is local volume, local volume, which is really what all this presentation is about. First is the second one, which I like to call network attached volumes. And usually if you try out persistent volumes in a cloud provider or you try out stateful sets with those volumes, you'll usually get the default type of volume, which is pretty often a network attached volumes. There's a lot of different implementation of those network attached volumes. And that's what you get by default, usually depending on your Kubernetes provider. So there's always a trade off between the choice like you can choose whether you want to use network attached volumes or local volumes and there's a big trade off and it's a lot of research to do in order to make sure you understand the trade off here. But basically it all comes down to the performance you need from that volume. If you use a locally attached NVMe SSD volume to the VM, you're going to get much better performance than if you mount a file system using the network, although the performance of those network attached volumes is getting better and better. It's also possible that the price is very different, like you may get a cheaper volume by using the local hard disk directly rather than paying for the units of storage they use over the network. But again, this depends on your provider. Another big difference is that those local volume are usually not there by default when you deploy Kubernetes. You probably have to provision those yourself or to install a provision as you get access to those local volumes and that's what we're going to see next. But really the main difference between those two is that a local volume is really bound to a particular host. Since it directly matches a physical disk like that is plugged to that virtual machine, that volume can only exist on that particular host. And it's a big difference because on the other hand when you use network attached volume, you could simply delete your pod, your workload, and that pod could be recreated anywhere with the same volume because the same volume can be attached over the network again. Which makes things much simpler for operations, whereas local volumes need to be operated with this constraint in mind that they are bound to a particular host and this host kind of change for a given volume. So how does this work in practice? Well, it's just using the affinity mechanism of Kubernetes, the same you would use, for example, to stick a particular pod to a particular subset of nodes in your Kubernetes cluster. So here, for example, in this local system volume we see that there's a local path defined, so that's really directory on the file system onto which this volume is mounted. And an affinity, and here this affinity setting, I'm using a GKE cluster here, specifies that this volume can only exist on a host with that particular name here. So whenever the scheduler schedules a new pod and that pod is associated a persistent volume claim, which itself is bound to persistent volume, then the pod itself can only be scheduled on the same host as the persistent volume. So how can you make use of those volumes yourself? You have to provision them because they are not likely not just there by default. So I think there are, in general, three different ways to provision those volumes. The first one, which I call manual provisioning, where you basically want to create that persistent volume resource yourself, like write the YAML file, create resource for that volume. Then what I call static provisioning, where you run to run a program or an agent on each host that will automatically discover the disks that are available on a given host, and create the corresponding volumes automatically. So for example, if you have a virtual machine with three different hard disks attached, you could run this tool to automatically create three different volumes of the same size as the disk. That's static provisioning. And in both these cases, manual provisioning and static provisioning usually create the volumes of the size you want in advance, and then the pods and stateful sets, et cetera, you create can make use of those available volumes. Whereas the third category, dynamic provisioning, allows the controller, so it's usually a controller, like you run an additional controller or operator in your Kubernetes cluster. This controller is responsible for automatically creating the persistent volumes on demand, depending on the claims that were created. So if you work with dynamic provisioning, you have no volume at first, but as soon as you create a workload that claims, for example, 100 gigs of storage, then this dynamic provisioner is going to provision and create a volume of 100 gigs of storage. So really three different ways going from the simplest on the left to the more complex on the right. So let's take a look at manual provisioning and how you can create a persistent volume yourself. Well, basically it's just a YAML file, like you just create this persistent volume resource to give it a name, to specify its capacity, and you sort of indicate the path in the file system that you want to use. So you need to be careful here that the capacity you advertise in the spec technically doesn't have to match the real capacity you have on the file system. Like you could very well declare volume of 100 gig of storage here, but behind the scenes that that file system is actually one terabyte of storage. There's nothing that is going to block you from doing that and nothing is going to prevent you from, for example, using 500 gigabytes instead of the advertiser 100 gigabytes of storage because that file system behind the scenes is much larger than that. Nothing is going to check that size. Be careful. And of course you want to define the node affinity to make sure that particular volume belongs to that particular host. And basically that's it. Just apply that YAML file and create that local volume. And this volume can then be bound to a claim as soon as you create the claim that matches the volume. So as you see that's very much of like a lot of manual work that is required that you can script off course to create all these volumes in advance. But what's maybe smarter to do if you know that you want to run a particular stateful workload on your Kubernetes cluster is to automatically create the volumes that correspond to the disks you have on each virtual machine. And there's a very useful tool maintained by the Kubernetes special interest group, the storage special interest group, which is called the local system volume static provision. And really this is some sort of demon set that runs on your Kubernetes cluster that is going to inspect on each host the petitions and disks that are available on the file system of the host. And for example if that tool sees that there are five disks that it could use it's going to automatically create five different persistent volumes matching those disks. So you basically just install the tool and after a few seconds you have all the persistent volumes that are available matching the disks you have. So it's very useful if you sized your machines and their disks according to the workload you expect to have in the future. And finally maybe the more advanced way to provision local volume is to use dynamic provisioning. So you have a lot of options there I'm just going to mention two of them that I find interesting. The first one is the ZFS CSI driver of open EBS, which allows you to provision volume based on ZFS file system which is created on demand for the exact size that is required in the claim. And the second one is called topo LVM is actually very similar in nature, except that instead of provisioning volumes based on ZFS it's going to provision volume based on LVM and then format this volume as you desire. So both allow you to kind of the same thing right. And really with this dynamic provisioning, it's more about creating the right volume of the right size automatically whenever a user requests a particular volume of a specific size. So it's much more dynamic and you can do much more advanced things like for example in a given host you could build a red out of 10 disks and then out of those 10 disks a smaller or maybe 10 smaller volumes are going to be created with different sizes. All right, so now that's going to be a bit more about what system volumes are, what local system volumes are especially and how you can provision them let's look at how we deal with operations. And actually, you'll understand why using local system volumes can be much more difficult to operate than network attached volumes. So this part of the presentation is becoming more complicated so I included animated GIFs to make it more entertaining. So we're going to look here at a list of cases and see how the system would behave with local system volume. So let's take the host failure case. So we have the same example here with a stateful set of three replicas. The host here now see that is holding the workload of second, so this part number two, so really the third replica but original number two, suddenly becomes pending like the pod cannot be started on that host again and the reason is that that host is dead like it's unavailable it's disconnected from the fleet entirely. Maybe the virtual machine was like completely recycled and we cannot make use of it anymore. Let's assume it's completely unrecoverable and the data also itself is completely unrecoverable. There's no way you can get it back and bind it to a new boat. So usually when you use stateful set for distributed workloads, distributed database, there's some kind of replication system where the different members of that distributed workload are able to replicate their data all around. So here chances are if you configure this properly that all the data that used to be in that volume is also replicated somewhere else on another volume like we can imagine we have several chunks of data there that are replicated to at least another member of the system. And that's fine. It's very useful. It's probably of the application you ran on this way are losing a single member of the stateful workload doesn't cause the total loss of availability or data loss of your storage basically. So when that happens, you'd like the system to sort of recover automatically, right? That's what Kubernetes is supposed to be like say you run a deployment if one pod is killed, Kubernetes is going to recreate another one automatically. But with stateful sets and especially with local volumes, it's more complicated. You'd really want that pod to be recreated automatically on that node D here. That's not happening. And the reason is that that pod is bound to a volume that is bound to a host. And even though that host is now dead, there's nothing that is going to change that relationship. So the pod is going to stay pinned. So you could say, well, that's easy. Let's just delete that pod and then Kubernetes is going to recreate it again. But it doesn't work because the pod is going to be recreated bound to the same claim which itself is bound to the same volume and that volume is bound to particular host. So in the end the pod is going to be scheduled on that particular host that doesn't exist anymore. So it doesn't work. What you really have to do here is first delete the claim and then delete the pod. So there's no pod anymore and there's no claim anymore. So the stateful set controller is going to recreate that new pod but it's also going to recreate a new claim and that claim is going to be bound to a different new volume. So here that's a really your way to start fresh with a new pod with a new volume and then you can rely on your application to replicate and recover the data the way it's supposed to do. Be careful here that there's a small race condition when you delete PVC and pod in all the versions of Kubernetes. What could happen is that you delete the PVC then you delete the pod but that pod gets recreated before the PVC deletion is complete. So the pod will still be waiting and pending because it's bound to that PVC that you deleted afterwards. So you may need to delete the pod twice actually to make it work again. But that's fixed in the recent version of Kubernetes just just a greater version. So you see here that Kubernetes is not really optimizing for us the way to recover this missing member of our stateful workload. Even though the primitives out there there's still some sort of manual action required to acknowledge the loss of persistent volume claim delete the pod and let it be recreated normally. And of course this can be automated because maybe you don't want to be walking up in the middle of the night because just one host failed out of your 15 notes cluster and you'd rather want that pod to be recreated somewhere else and recover its data normally. So what you could do here is write a small script or small controller that you deploy in Kubernetes and that that controller would basically just watch the Kubernetes not resources. And whenever there's a chance on any of the nodes like you could indicate that one of the node is suddenly unavailable, what you could do is list the existing persistent volume claim and check if one is on one of those failed dead nodes. And if that's the case, then you just remove that volume, remove the pod and both are going to be recreated somewhere else automatically. And this can be made automatically. You should note that here it's important to distinguish between a host that would be completely dead and unrecoverable as in there's no chance ever that the same pod is going to run with the same data on that host. It's over. It's done not possible. In which case you really want to recreate it somewhere else versus a case where there's a chance you can recover the data and you can recreate that pods on the hooks like say for example, someone just restarted the virtual machine. And after a few seconds we know that the virtual machine is going to come back. So that pod is going to be started again with the same volume that is still there. So you really need to differentiate those two cases. And I think one useful way to do that is to consider that if the node is completely out of the fleet, you might as well set up a process to remove that not resource from communities. But that small automation here we just have to check whether the node exists or not in the API and if it doesn't, then you know you can safely remove PVC and pods. However, if you just, for example, check the healthness, the health of that pod, then chances are maybe it's going to come back later. So you need to be really careful about distinguishing those two cases. Let's take another example where we don't have a certain failure of one host but rather we know we want to take out one particular host away from the fleet. And the good use case for that is Kubernetes version upgrades. What people tend to do because it's simple is to treat all the Kubernetes nodes as a new table. So whenever you want to upgrade the Kubernetes version rather than upgrading the kubelet in place on the same machine. You'd rather spin up a new virtual machine, the commission, the old virtual machine, right. What people usually do that is spin up that new VM, drain the node that you don't want anymore. Drain really translates to marking that node as unschedulable, so no new pods can be scheduled on it. It also translates to deleting all the pods that are leaving on that node so that Kubernetes can reschedule those pods elsewhere. The main property and feature of Kubernetes is to combine this drain concept with the pod disruption budget. So the pod disruption budget is another Kubernetes resource that you created in advance, where you specify how many pods are the same workloads you allow to be taken down at the same time. For example, here if we have an elastic search cluster with three pods, it can be very useful to create a pod disruption budget where you would specify that you only allow one pod to be taken out. So you're sure that that drain command is not going to delete two pods at once of the same elastic. So once that's done, once you have no more pod running on that host, you would have, again, as we've seen before, to delete both the PVC and the pod. So the pod was already deleted, but you also need to delete the PVC to make sure that this pod can be recreated somewhere else with a new fresh value. And then relying on the replication achieved by the application itself, you sort of recover your workload and you didn't lose availability and you didn't lose data. So that's what we have here with this red box representing a node that is being trained. Before that, we replaced that node with a new node using the new Kubernetes version. Now, unfortunately, if things are not that easy, so first we've seen that someone something needs to be the PVC is because that's not done automatically, right? And usually if you enable the automatic version of grade of cloud providers, they're going to do the drain mechanism automatically without you required to do something on the computer. But that mechanism is definitely not going to be the persistent volume claim. So this is something you have to plan for. The other important thing and important limitation of those automated version of grades is that it's very common for the pod disruption budget to be respected for a very short amount of time. Short being something like 60 minutes in GKE, for example. So you may think 60 minutes is big enough for most workloads, but say you have an elasticus cluster and each node in that cluster is holding maybe five or 10 terabytes of data. It's possible that 60 minutes is not enough to migrate all that data over the network to the other pods, right? And because that pod disruption budget is not respected anymore after this hour, the 60 minutes, then the upgrade system is going to move on to the next pod. This is where you can lose availability of the data or even worse lose the data entirely because the application may not have enough time to recover from the deleted pod. So you might want to recover the data, like replicate the data from the other members to the new member in the system. That's very dangerous. So what you should rather do instead on the safest option would be to manually create this new node pool with the new Kubernetes version and make sure that you migrate your workload here, right? Make sure that you have migration, like change the affinity rules for the new pods to leave on that new node pool. And then finally, since there should be no pod anymore on the old node pool, you would delete the old node pool, right? That's much more custom. Really, when you use local volumes with a lot and a lot of data, chances are this automated Kubernetes cluster upgrades from cloud providers won't be good for you. The problem with this mechanism is that it really relies on the fact that you delete one pod at a time and then let this pod be recreated fresh with a new volume somewhere else and rely on the application itself to replicate the data there. This is some sort of plant maintenance, right? Like you know you're going to remove that pod and it's a bit sad to sort of rely on the mechanism of failover of the data, like of recovery of the data for something you know is going to happen. And even worse, what if at the same time with this plant maintenance you get another unplanned disruption, like another host dies, for example. Suddenly you have two pods that are taken down at once, and this is why you could get potential availability of this loss or data loss, which could be dangerous. And actually we'd be in a better place if we could just, instead of deleting the pods and recreating it somewhere else afterwards, first create an additional replacing pod, migrate the data there, and then remove the old pods. But unfortunately that's very hard to do with stateful sets, because stateful sets have this concept where they use an ordinal for each pod, and here if we have pod zero and pod one. If you want to get rid of pod zero, what maybe you would like to do is create pod number two, migrate the data there, and then delete pod number zero, and that's it, and we're left with pod number one and two. In practice, that's not possible. Stateful sets don't work like that. They work the opposite way. They work with the assumption that we delete the pod and then recreate that rather than what is closer to the deployment resource where you do a rolling upgrade by creating new pods first and then progressively deleting the old pods. But I haven't found a good solution to that problem yet with the stateful sets. Some people manage pods directly. Some people just can't do that. So another thing you could do is increase the number of replicas. So here we suddenly have three replicas, which is nicer. And then we remove that pod zero, but then it's going to be recreated. So if we want to get back to zero replica, we have to scale down to two replicas. To delete that pod again. So we end up with some sort of double data migration where we first migrate the data to that new pod, and then once the old pod is gone and recreated, we migrate back. So that's double data migration. That's not very useful. No good solutions to that. There's an interesting project from the open cruise team that is called the clone sets and advanced stateful sets as well that sort of gives more features to the stateful set with a new CRD to do things like that, but nothing built in the stateful sets important. Another thing I wanted to warn you about is about what I call concurrent stateful set upgrade and concurrent pod scheduling at the same time. So here's say we have a stateful sets with two pods. What happens is that you do an upgrade to the spec of that stateful set. And when that happens, pods are deleted, then recreated with a new spec one by one, and they rely on the same volumes to preserve the data. So first the pod is deleted, then it's recreated on the same host because there's still this constraint that it needs to be bound to the same volume. And what if in the middle of these two operations that are not atomic, and now the pod gets scheduled on that empty slot here. And that can really happen like say here, this host has 16 gigs of RAM, and there's a new pod that gets created in the meantime with two gigs of RAM. And when the stateful set controller tries to recreate that pod number one that has been updated, that pod cannot be scheduled on the same host because something else took the empty space. And then you're in sort of really bad situation where you just wanted to upgrade one stateful set that one of the pod stays pending. So there are several ways you can deal with that. The first thing is to give higher priority to the pods that are of local volumes, you can use this priority class name thing can also work with stains and toleration to isolate workloads with local volumes to a particular subset of nodes. So you know, other pods are not going to be scheduled there. And finally, what I think is the probably the simplest solution to this whole is to sort of plan for your workload in advance. So either you know a node is going to be completely dedicated to a particular pod with a single local volume. Either that host can hold several pods with local volumes, but then you use a fixed ratio between RAM and storage, so that you know that, for example, if all the storage is occupied. Then, no new pods can be scheduled because the RAM is sort of reserved for that storage, right. You can schedule a pod with 10 gigs of RAM. While another one is being recreated because there's not the available storage to schedule something with 10 gigs. And that's a way to avoid that problem altogether. One other important thing is the awareness of storage capacity in Kubernetes. And so, when you use dynamic provisioning, for example, it's possible that a pod gets scheduled onto a node that doesn't have the necessary storage space available. And I think it has been fixed starting Kubernetes 1.21, right. And before that version, one way to deal with that again is to have this fixed ratio between RAM and storage so you know that if the pod can be scheduled because of its RAM, it means the storage is also filled. And finally, another maybe concern is that there's no enforcement that you get the exact volume of the size you requested. Kubernetes is going to do its best trying to match the claim size with the volume size that it can happen that, for example, you request one gigabyte claim and the only available volume is one terabyte. And Kubernetes is going to do the match here. And a larger problem is that the scheduler sort of ignores this priority over the closest possible size of the volume across the different nodes. So the scheduler could pick a node, but it's not the best pick because there are other nodes with closest volume size available. And again, this is going to be fixed in new versions of Kubernetes starting alpha in 1.21. All right, I hope that with this presentation you got a bit more intimate knowledge of how local volumes work and especially the things you need to pay attention to if you decide to use them in production. And likely, those involve a bit more work and a bit more thinking than what you'd have to do if you were using network attached volumes. Thank you for attending.