 Okay. Hello, everyone. Welcome to our talk. We'll tell you our story of our journey migrating entry volumes to proper CSI once. So we'll start by a quick reminder about how storage work in Kubernetes, then what's the CSI migration and how we made it work for us. So I'm Antoine. I'm a senior software engineer at Datalog. I joined the company six years ago and it's been two years now that I work on our Kubernetes stack. And today I'm with... Hi. I'm Baptiste, a software engineer at Datalog, where I do everything related to storage inside Kubernetes. And first, let me give you a brief introduction of how storage works inside Kubernetes. Well, if you want to have access to some kind of persistent data inside Kubernetes, the most basic way is to use what's called the node local disk through, for instance, an MTD or a host pass. But if you want access to more advanced storage, a pod can create what's called a PVC, persistent volume claim, which is the way for the pod to ask access to some kind of volume with a certain size. And the storage class, which is the way for the pod to define what kind of volume it wants. For instance, here, the pod can ask for GP3 volumes in AWS EBS, and the cluster will try to satisfy this claim using persistent volume, which is the way for Kubernetes to represent the backing disk used by a pod, which can use multiple kind of volume providers like AWS EBS or GCP persistent disks. And once we have a PV and a PVC, they can be bound together, and the pod can have access to the underlying data. Initially, in Kubernetes, the way to get persistent volumes was through entry volumes. Entry because the code responsible for creating those volumes inside Kubernetes was part of the entry Kubernetes source code. So it was the historical way of getting volumes inside Kubernetes. But it had several drawbacks because of the fact that it's inside of the Kubernetes source code. Mainly, storage drivers would have to open source their code inside Kubernetes and follow the release cycle of Kubernetes, which is a bit cumbersome when you just want to make some small changes to your driver. And how entry volumes worked was, well, you have a pod that can ask for a PVC. And the controller manager, in particular, the volume controller, will watch for those PVCs. And what you notice is one, it will invoke, like it will call the volume provider to create a disk inside the volume provider. Once the disk is created, the volume controller can create an entry PV and link it to the volume provider disk. And after that, it will bind the PVC created by the pod to the entry PV. Finally, the cubelet present on the node where the pod is running will mount the volume inside the pod so that the pod can access it. And after entry, there was a need to actually create a more versatile way to get persistent volumes inside Kubernetes. And so CSI was created. CSI stands for container storage interface, which was introduced in 1.9 Kubernetes and has been GA since 113. And CSI creates a way for volume providers to have a common interface to create volumes inside Kubernetes. And it gets rid of the main drawback of entry volumes, meaning that the CSI is maintained out of tree from Kubernetes. So each volume provider can have their own driver and maintain it at their own pace. Also, CSI brings a bunch of new features to the volume management type Kubernetes. Mainly, it can have access to volume snapshotting, volume cloning, or volume resizing. And CSI works in a very similar way compared to entry, except that you have CSI components deployed in your cluster, so CSI node or CSI driver. And those components will take care of managing the volumes in place of the cubelet and control manager. So the core Kubernetes components just become some kind of path through for the CSI components. And now we have two kinds of volumes available inside the clusters, entry and CSI. There was a need to get rid of the old entry volumes. And so here comes the CSI migration, which is a migration activated using a feature flag inside the API server and the cubelet, which has been GA since 125. And when you activate this migration, the entry volumes are now managed by CSI the same way as the CSI volumes are. So initially, you have entry PVs managed by the control manager and the cubelet. And once you activate the feature flag, you get still entry PVs, but there are no managed by CSI node and CSI driver. So it looks like a great migration, right? So yeah, this far it looked easy. You have official migration, just a couple of flags to switch on. But after going through the process, we looked at our cluster and we think that maybe we might have misunderstood the migration term here. We were still left with volumes that were still entry. So they had extra annotation indicating that migration happened, but they were still referencing the entry driver in the spec and still using the old storage classes. So was it a big deal? Well, maybe it could have been fine, but we wanted to use the new CSI features. And the new CSI features were working fine with the CSI volumes, but with the migrated ones, it was fading. So we looked at GitHub issues and we found that it was not a bug. It was a deliberate choice. CSI migration was never intended to support new features for entry drivers. So this also made that any new feature in the future wouldn't be supported by those volumes. So it was a prime for us because we had a bunch of volumes already provisioned. So it would turn our platform into a very heterogeneous state. And we would have to deal differently with legacy volumes and the new ones. So we could imagine recreating all the volumes by hand, but we couldn't afford to have this kind of disruptive operations involving application teams, especially at Oskar. So we needed to have a true migration feeding Oskar. So I mentioned Oskar. What is Oskar exactly? So at Datalog, we have self-managed clusters running on three different cloud providers. We run hundreds of different clusters. It represents 10,000 of nodes. And our biggest cluster can have up to 4,000 nodes. For the stateful parts, we have thousands of stateful cell running. And they are owned by a dozen of different teams. It represents hundreds of different services and 10,000 of persistent volumes. So we started looking into how to make this an actual migration. And of course, we needed this to be done in place without any downtime. And if possible, we didn't want the application team to even notice something was going on. So first obvious step was to make sure that we were not provisioning legacy volumes anymore. So we made sure that the CSI storage classes were the default ones. And then after some testing, we realized that it was not an issue to have mixed stateful sets. So meaning that we could have stateful sets both with entry volumes and CSI ones. So we decided to replace our storage classes in place. So taking the entry storage classes and turning them into their CSI equivalent. Then we had to take care of the volumes objects themselves. So we wanted to go from what you can see on the left to the one on the right. So actually there was already code written for that. The code that was used by the CSI migration process in a CSI transition live. The problem was that you can't change the volume specs. It's an immutable object. So the API server will reject your changes. So we thought, how can we work around that? So first idea was to patch the objects themselves into a HCD backing store. So we thought maybe it was a very hacky. And also we were not sure about possible race conditions, the interactions with the different caches of all the Kubernetes components. So we thought that maybe it's better to just patch the API server to allow the modification at the API server level. And that's where this operation will be just like any other update operation on an object. And it will like do all the necessary check to avoid the race condition. So the patch itself, yeah, it's very, very small actually. It's just like commenting out seven lines of code. So just like basically removing the equality check in the API server. And so now we had to integrate this properly to run the migration safely on our fleet. So the migration tool implementation, very straightforward. We had to connect to a patch API server, loop on all the entry volumes we had, use the transition lib, turn them into CSI volumes and then process the changes. So how does it look like? So yeah, the API server connected to your HCD backing store. We deploy a patch API server pod. We make sure that it's not part of the regular API server pool to make sure that the only client is our migration tool. And the migration tool will loop on all the entry volumes, translate them into CSI ones, and then process the changes. And now you can tear down everything and the migration is done. Well, not so fast. We wanted to use a new snapshot feature then, but it was still failing on GCP. But why? When we had a closer look at the results of our transition, you can see that there was a small difference in the volume handle. Instead of having the correct project, we had an unspecified like basically a placeholder. And so it was making the volume handle invalid and it was preventing any operation to be successful including the snapshot ones. So we were kind of back to square one. So looking at the entry plug-in interface in the transition lib, we saw this method called repair volume handle. It was only implemented for GCP persistently. So it didn't look very nice, but we had broken the volume handle, so why not repair it? So we added this step in a modified script and write it on an experiment cluster and then, boom, we got an advantage of detached events while the pods were running. So it was a catastrophe, like all the pods couldn't access their disk anymore. So it turned out there was a small side effect calling that method. So if you look at the volumes we had attached now to a node, we had actually duplicates. There was the unspecified one and the one with the correct project. So it had two consequences. When a pod was rescheduled and mounting, the volume twice was causing errors and the pod was stuck in terminating. And on a Kubernetes control manager, it starts while trying to reconciate the state. It was detaching all the volumes. So we have the repair volume handle fixed that allows us to create snapshots. However, it doesn't work as we expect. So let's embark on a new journey to fix the fix. So first, let me get back to how CSI works. Well, I lied earlier. It's just a bit more complicated than what I explained here. So let me get back to square one. We have the control manager inside the cluster that will create a pod and a PVC in HDD. Once it's created, there is a CSI component called the CSI provisioner that watches for those PVCs. And once it notices one, it will call the CSI driver, which in turn will call the cloud provider to create a volume inside the cloud provider. Once the volume is created, the CSI provisioner will create the CSI PV inside HDD and will link it to the cloud provider volume. Then the pod can get scheduled on a node by the scheduler. And on the node where the pod is scheduled, there is a cubelet that knows that there is a pod that needs to be here. And since the pod needs access to a volume, the cubelet will create a volume attachment object inside HDD. This volume attachment is just a way for the cubelet to ask for one volume to be attached to one particular node. And there is another CSI component called the CSI attacher that watches for those volume attachment objects. And once it notices one, it calls the CSI driver, which calls the cloud provider to attach the volume to the node. When the volume is there, the cubelet creates a pod and calls the CSI node component present on every node to mount the volume inside the pod. And there we have it, like a more complete working of CSI. But what is the role of the unspecified that we saw earlier in all this working? Well, when a PV gets processed by some of the components in the cluster, namely the cubelet or controller manager, those components have a local cache where they store the state of the cluster. And the key they use to store the volumes inside the cluster is called the unique volume name. And it is just a co-contination of the driver name and the volume handle name. So in a CSI PV, we have a CSI driver with a CSI volume handle. And so it gets stored inside the cache. Everything is fine. But then we have an entry PV. An entry PV does not have a driver or a volume handle in the spec. So the cubelet and controller manager need to call the same CSI translation leap that we used to translate in memory the spec from entry to CSI. But they don't use the repair volume handle that we did. In that case, we store in the cache with the unspecified. And when we run our own migration script, so we change the entry PV to be CSI, we now have a proper CSI spec. And it means that we have another volume that appears inside the cache. Because the key is different. So the point of view of the cubelet and the other components, we have two unique volumes which are actually the same PV behind the scene. And this is our issue. So we have a solution for this. Well, we thought maybe we could try to hack some stuff somewhere and fix it. So if we take a look at all the components present inside the cache, we have a bunch of components present inside the workings of CSI. And get rid of all the components that we don't really care about right now or we cannot act on. Well, we are left with the three CSI components as well as the cubelet and controller manager. Ideally, we would like to avoid patching the cubelet and controller manager because it will take a long time to roll out. Remember, we have like hundreds of nodes. We have multiple versions of cubenities in our cluster. So we would need to backport the fix in every version. And we had already patched the API server. So we were like, maybe if we can try to avoid patching another cubenities core component, just to avoid some unintended side effects could be a good idea. So we embarked on to patch the CSI components and it didn't work unfortunately. But why? Well, we took a deeper dive into how a volume is managed on a node by the cubelet when there is a pod that needs one. And I'm not going to go too in depth here because it will take a long time. But what is important is this part. It's like what happens when a pod gets terminated and what does the cubelet do to get rid of the volumes. And this part is done once for every volume present inside the cache of the cubelet. So in our situation, we have two volumes in the cache. And we have this host file system on the host. And so the cubelet will try to read the pod volume data file to get rid of the first volume, which is called the vault data.json. And it's just like some metadata on the volume. Then the node plugin will amount the volume, removing the volume content, do some basic cleanup on the mounts directory. And finally, the cubelet can get rid of the vault data.json and the volume directory. And we can get rid of one of the volume in the cache. Nice. But then we move on to the second volume and try to read the pod volume data file, which does not exist anymore. So at this point, the cubelet goes into like an infinite error loop trying to read the file, but it cannot. And the pod gets stuck in terminating. So at this point, we are kind of stuck because we don't want to patch the cubelet in control manager or we would actually try and avoid it. And the other components, patching the other components doesn't work because they don't act at the right moment in the cold stack when trying to amount the volume. So at this point, we are like, okay, maybe we could ask some folks at GCP because, you know, it's a GCP only issue. Maybe they have an idea on how we can fix it easily. Unfortunately, they had the same issue as us. So they were like, ah, that's annoying. But one interesting thing that came out of the meetings with GCP is that maybe we can do something if we migrate the volume while the pods are not present on nodes. Because if we are not on a node, it means that the cubelet cache does not matter anymore because, well, no node, no cubelet. And the control manager cache is not an issue because our issue with the control manager was that it was trying to detach volume from nodes where the pods were running on them. So if we're not on a node anymore, it doesn't matter. But then how can we not have pods present on nodes? Well, it turns out it happens naturally when we roll out a pod. And it's fairly easy to catch this situation using an admission webbook. So we came up with this solution where we have a usual working of our cluster with an API server and everything. And we deploy in this cluster an admission webbook with the patched API server. The admission webbook will watch for pod creation events, which happens when we start a rollout. Then we roll out a pod using like a stateful set or deployment. And it's important here to deny the recreation of the pod while the volume is still present on the node. Otherwise, you could have like some kind of race condition where the pod gets recreated, gets rescheduled on the same node, and then you have a race condition with the cubelet trying to detach the volume from the previous pod and trying to reattach the volume of the new pod. And it breaks. So then you wait for the CSI components to get rid of the volume. And once it's done, the control manager can recreate the pod. It goes through the admission webbook where we call our patched API server to migrate in place the entry PV to CSI. And then the pod gets recreated and scheduled on the node with its volume and everything is fine. Our migration is done. So yeah, to put it in a nutshell, so we went from starting with having no CSI support at all. So we deployed all the various CSI drivers and then we went through this official CSI migration, quickly realizing that it was not going to do it for us. So we evaluated different solutions and eventually came up with this migration along the way we had to work around a couple of issues. And so in the end, our custom migration process took us six months and it didn't cause any incidents. So if you need to go down that road or just curious, we made the code available on GitHub. So what did we learn? It's okay sometimes to take an orthodox path as long as you make sure you mitigate the risk as much as you can. And also make sure you discuss with your team and everyone is on board. You don't want people to learn about your innovative approach during an outage that was caused by it. Also, the more you go down on the stack, the more you're going to have to deal with some cloud provider specifics or implementation details. The repair volume handle is a pretty good example of this. And also there is a lot of value in talking to other people in the community. When we started this, we saw that maybe we were the only fools trying to do that. And talking to folks at Google, we realized they had the same kind of issues. And by presenting our problem, it was a good way to take a few step backs and it led us to find a working solution. Thank you for your attention. If you have any feedback or questions, now is the time. Thank you for the presentation. I think this is like hackery at its finest. So what do you think like this would be easier, for example, on the AWS and EBS in general? Because I think you can create CSI volume, but from the snapshots, so from the actual snapshot in the cloud provider, and would it be like an easier approach to just like have like unmount the pod, like have like a new, not even unmount, but basically create a new volume from the snapshot, but it's going to be like a native CSI volume. Yeah, I think it's the recommended way of doing it, but at our scale it's just not possible. I mean, if you do that, you have to involve the application teams and they have to do the process or you have to do it for them, but I mean, it's going to, you're going to have to have some kind of meeting with them and coordination. And in the end, we just want to change some metadata. I mean, there is no real reason to snapshot the data to put the same data in another object. Okay, thank you. Thank you for the presentation. I have one question regarding the process that you went through. So did you like, what guarantee that you have like resolved all the issues? What's the process that you went through for the testing? That's it. Well, so honestly it was fairly manual process for us because we, well, what we did test was like trying to migrate a few volumes in experimental clusters, trying to check that the features were working and obviously that after weeks there was no issue with the volume and that's how we noticed the avalanche of detached events that we had. So we are not sure that there will never be an issue, but like now the migration has been done for four months in our clusters and in our production clusters and we did not see anything. So we are fairly confident that it's okay right now.