 All right, welcome everyone. Hope you're having a wonderful KubeCon so far. And it's the last day, so thanks for dropping by before you head home. So today, they're going to be two of us. I'm Deep from Apple and my co-speaker is Feng from Databricks and we'll be talking about, so what if I don't want my persistent storage to be yet another bind mount? So here's a quick introduction to our talk. First, we'll start off with describing how the current lifecycle and flow of events work for pods that mount persistent storage. We'll go over some of the key assumptions that are there in the KubeNet as well as container runtimes around how volume mounts are set up for pods that do mount persistent storage. Next, we are going to go over some of the typical bind mounts that take place as part of the user-facing persistent storage to pods today. After that, we are going to go over some of the alternatives to the standard flow for mounting persistent storage into pods. We'll look at this alternative mainly from the perspective of alternative runtimes to run C, such as micro VMs, specifically Kata. We'll also go over some of the challenges that arise when some of the base assumptions in KubeNet and the container runtime are no longer true. Finally, we'll be going over some of the enhancements that we have implemented in CSI plugins to basically surface this alternate flow, which we call the direct assignment model. We have initially prototyped it by adding a fair amount of awareness around Kubernetes and Kata runtime within the CSI plugin, and we'll go over a future direction where we're exploring a cap to make this support a lot more generic so that the CSI plugins do not need to be aware of the fact that they're operating within a Kubernetes framework or that the end runtime is actually Kata. And finally, we'll be concluding with some takeaways and how you can contribute to this overall effort. So let's start all the way at the beginning and look at how a pod comes to life from the perspective of the KubeNet. So typically, you either have a pod being submitted by KubeCuttle or a workload controller, such as stateful set. And once the pod gets submitted, it gets created in the API server. The Kube scheduler picks up the pod and picks the most ideal node in the cluster that the pod should get scheduled to. Once the node name is set by the scheduler, the KubeNet on that designated node picks up the pod and starts to process it and bring it to life. One of the first things that the KubeNet starts processing is the volume section of the pod. So there are two main categories of volumes that pods can mount in Kubernetes. The first kind are the inline volumes and they mainly present ephemeral data to the pods. And this include things like config map or secret, which are typically powered by a piece of code that's already within the KubeNet itself because they're so common and pervasive. However, ephemeral volumes may also be backed by specialized CSI plugin in case you have a very specific secret store. And in that case, the KubeNet works with the specific backend through a CSI plugin. The more common category and the focus of our talk here are volumes that are backed by persistent volumes, which are bound to objects called persistent volume claims, which are actually what is specified in the pod spec. Now, persistent volumes can be backed by a variety of different storage backends. So it could be say EBS in case of the AWS cloud or GCPD in case of the GCP environment or your own SAN environment in case of on-prem setup. Now, the KubeNet, it's impossible for it to know how exactly to discover these various storage backends and how to mount the volume from these backends. So for doing that, it depends on plugins associated with these backends that are shipped by individual vendors and install on specific nodes and these are called CSI node plugins. So by utilizing a CSI node plugin, the KubeNet is able to discover a storage, a piece of storage that's attached to a node and then instruct the node plugin to get the volume prepared so that it can be mounted in appropriate location from which point onwards, the KubeNet can start processing it. Now, once the file system on the volume has been mounted, the KubeNet executes a bunch of very important actions. Specifically, the first one involves looking at the FS group specification in the pod and applying it based on what is called the FS group change policy. Now, what is the FS group? FS group is essentially a supplemental group ID that a pod may specify so that different pods that run with different user IDs can share the data on the same persistent volume. In older versions of Kubernetes, the KubeNet would sort of blindly mount a volume, look at the FS group and do a recursive file system traversal through the entire volume to apply the group ownership. In more recent versions of Kubernetes, there's the new FS group change policy called on-route mismatch that allows this process to be a lot more optimized. And so in this mode, the KubeNet just looks at the top level directory, sees if it's owned by the right supplemental group ID that's specified as the FS group. And if so, it just doesn't, it skips the entire file system traversal. Next up, containers within a pod may wish to not mount the whole volume but only a part of the volume. So these are defined as subpads and the KubeNet needs to make sure that the subpath is safe and secure and the pod is not a malicious one trying to escape out of the pod sandbox by specifying relative paths with the string offset dot, dot, slashes. There have been some vulnerabilities associated with this and therefore it's quite important that the KubeNet probes the subpath, ensures it's safe and then locks it in a position so that it does not change between the evaluation time and when the containers actually are brought up. And then finally, if SELinux is enabled on the node in enforcing mode, the KubeNet needs to figure that out and make sure that if SELinux label has been explicitly specified, that label gets passed down to the container runtime and applied on the entire file system tree so the right SELinux labeling is provided to the files and directories so that containers within the pod can read the files and directories. If SELinux label is not specified, KubeNet still instructs the container runtime to go ahead with SELinux labeling but figure out the SELinux label at runtime dynamically allocated by the container runtime. So once the file system mounts are ready, the KubeNet moves on to the next phase which is invoking a CRI implementation typically container D or Creo and which in turn goes on to invoke a lower layer, a lower level container runtime such as RunC or Cata to create the pod sandbox environment which could be just a bunch of namespaces in case of RunC or a micro VM in case of Cata pull down the container images and create the individual containers specified within the pod. As part of the first step, one of the other things that CRI implementation would typically do is interact with the CNI plugin to also set up networking for the pod sandbox. Now, once the pod and the containers are up, it's not like the job around file system operations is not quite done. Throughout the pod's lifetime, there could be other volume management operations that need to be executed by the KubeNet. So specifically, there are Prometheus metrics around file system stats for the volumes that are mounted and the KubeNet needs to basically query the file system stats for all the volumes through the corresponding CSI node plugin to report these up. Also, the administrator may wish to expand persistent volume in which case the KubeNet again needs to work with the appropriate CSI node plugin to ensure the mounted file system is expanded and the right backing size is reflected on the file system mount. So here's kind of like a complete picture of how all of this fits in together. So again, just to recap at the very left, we have the KubeNet talking to a CSI node plugin to first discover the backing storage and then get it mounted with a specific file system onto what's called a global mount path. This typically happens as part of a CSI API called CSI node stage volume. Next, this global mount path gets bind mounted to a pod specific mount path as part of a CSI node publish volume API. And finally, the KubeNet works with the CRI container runtime and a lower level runtime handler like Run-C to prepare the pod sandbox environment and bind mount the pod path into the mount namespace of the pod sandbox. So the key thing to notice here is that there were a couple of bind mounts that were going on as part of the entire setup. First, there was the bind mount from the global path to the pod specific path and from the pod specific path to a path inside the container. And again, to recap, one of the things we saw is that all the KubeNet assumed that all the file systems for all the volumes are fully mounted before it went on to the container bring up stage. This is important because the KubeNet as well as the container runtime makes these assumptions because it needs to execute the following three actions on the file system mount. So that again is applying the FS group settings, making sure the subpads are sanitized and probed and finally applying the appropriate SLNX labels. Now, while the sequence is very standardized and it works across the board, can we have a different sequence? Specifically, how about a model where we decide to mount the volumes after the containers are brought up? And we want to do it without using the raw block mode that's available in Kubernetes today. We want to continue to use persistent volumes that specify a file system mode. And a specific use case of this is micro VM environments, specifically like Kata. And to describe this in a lot more details, I'll switch over to my co-speaker Feng. Thank you, Deep. Next, talk about how do we mount a PV in a micro VM based runtime? So micro VM based container runtime is a category of runtime that runs the container workload in a virtual machine. Typical examples include Kata containers and file cracker. Compound to traditional container runtime that's based on Linux namespace and C-group, such as Ron C. The Kata containers provides better workload isolation and much better security, which is really important for use cases, such as multi-tenant workload in one cluster. It's also more lightweight, compared to a full ground virtual machine. For example, Kata containers has made optimizations to improve the startup time and reduce the memory footprint. Kata containers is also fully compliant to the OSI container format and then the container D-cryo interface. So users don't need to change the application to migrate from a Ron C container to a Kata container. So a quick recap of the architecture of Kata containers. Kata containers integrate with CI container runtime, such as container D and cryo through the shim interface. So under the hood, Kata shim translates OCI spec to a set of VM spec and launch a virtual machine through the hypervisors, such as Kimu and Cloud Hypervisor. And inside the guest VM, there's also agent called the color agent. The color shim can interact with the color agent through a Vsocket using TTRPC protocol. The agent is responsible for managing the container lifecycle inside the guest VM. The color shim also translates connect the IO stream between the CI runtime, such as container D with the container inside the guest through the TTRPC protocol. So how do we mount the volume in Kata container? Traditional way is very similar to what Deep just described. The CSI will prepare the volume, will by mount the volume to the per part mount. And then the Kata shim will do the magic. It will actually by mount again the per part mount to a shell location. And then the shell directly is done to the guest through the shell file system called VIOFS. So this approach is very straightforward. It works out of a box, but it does come with a few trade-offs. First is the performance. VIOFS has a much worse performance compared to a power virtualized rock device such as VIOBLK and VIOSCSI. It also sacrificed the isolation because now you need to mount a file system on the host. There are also other gaps existing in VIOFS. It doesn't implement all the features that exist in a native file system such as open by handle at. So we actually did some micro benchmark using the FIO. As you can see, compiled the VIOBLK with SBDK as the data plan. The VIOFS has much worse performance in the random write and random read. The sequential write is similar. The sequential read is actually faster. So I think this is largely due to the caching and the prefetching. So because of this caveat, we explore alternative approach, which is to delegate the PV mount and preparation to the container runtime inside the guest. We call it direct assigned storage. So the main difference here is instead of relying on VIOFS, the color scheme will actually attach the block device to the guest through the VIOBLK. And then inside the guest, the agent will handle the mount and the file, the preparation. So here is a much more detailed view of the sequence. So to, when create the volume, the correct will call the node publish volume. And then the CSI Java plugin instead will invoke a new command line we implement in color runtime called the direct volume add and pass the mount information to the color runtime. The color runtime then persist the mount information on the disk. So when the color scheme is ready to start the pod sandbox, it will read this mount information from the file, then it will pass this information to the color agent. Before that, it actually first, we attach the VIOBLK device to the hypervisor first. And then the color agent will mount FS by mount to the volume to the container and finish the whole process. So the amount is actually simpler. When the pod is gone, all the mount will be destroyed inside the guest. So on the host, only thing left is basically the mount info file. So when the Kublat called the node unpublished, the CSI Java plugin will basically just invoke the direct volume remove CLI. The color runtime just removed the mount info file. So with this approach, we also implement the volume stats. So the CSI plugin can invoke direct volume stats, the color runtime then send a TDRPC request to the color agent, the color agent then collect returned volume stats back to the runtime and to the CSI Java plugin. Similarly, to resize the volume, the CSI plugin can invoke the direct volume resize, and then the cart runtime will first resize the block device through the hypervisor, assuming the hypervisor supporting this feature, and then send the TDRPC request to the color agent to resize the volume, which the color agent just refreshed the file system metadata. So next, I'll just do a quick demo, a live demo. So I already have a YAML file here. So this YAML file basically has a very, as a container, then it has two volumes. One is the CSI volume, which used direct assigned storage. The other is the VALIOFS volume. We also implement on the storage class and the PVC and we can skip loads. Apply this YAML file. Okay, let's wait a couple more seconds. Okay, it's already running. So we can exact into the part. So now we are inside the container in guest VM. We can do a mount, we can show all the mounts. You can see here there's the VALIOVLK volume, which is mount to slash dev slash VDC. That's the VALIOVLK device that's attached to the VM. And then the type is EXT4. That's what we designed a native file system. And then there's also this VALIOFS volume. So this is, has type VALIOFS. So this is showed from the host. So we can also log into the node to take a look at the mount info file. So the mount info file is stored in a color container directory under the run color container shell. So the name is basically the pass encoded with base 64. You can take a look at the mount info. So it's very simple. It only, right now it have the device, it have the FS type. There are also a few other options we support such as changing the user of the FS owners. So this is basically a V host user block device. So that's the demo. So the current approach requires specific CSI implementation. There's no change in Kublat, CI, OSI, or CSI specs. And most of the post-mount configuration works. But there are also a few drawbacks right now. For example, the sub-pass support is not supported right now. The CSI plugin also need to be customized and need to be aware of the runtime class. So next, we will talk about a future direction that can make this more generic. Great, thanks a lot, Feng, for the wonderful demo. So let's look at how the future of this would look like. So we are basically exploring Kubernetes enhancement proposal, specifically 2857 upstream with mainly six storage, with a little bit of signal involvement as well to address some of the limitations Feng just went over. So the idea there is, can we enhance the Kublat, the runtime class, and the CSI spec in a way that allows first of all the Kublat to pass down all the necessary parameters around post-mount configuration down to the CSI plugin. And basically not have the CSI plugin need to do a cube API server lookup. That'll help solve one of the negatives. The other is, one of the other feedbacks have been like, if the runtime class that's already there to specify the micro-VM runtime details, can that be enhanced to have fields that allows the runtime to basically advertise to Kublat that, hey, I am capable of doing this file system mounting and delegation from a CSI plugin. So the Kublat will essentially be enhanced to lookup and match the capabilities of the runtime with that of the CSI plugin. And if all the capabilities of the pod spec match the capabilities of the plugin and the runtime around all kinds of mount and post-mount configuration, Kublat will basically automatically ask the CSI plugin to perform the delegation of the file system mount and the post-mount operations. And one of the final aspects of this enhancement will be that we want all post-mount configurations to be supported, including things such as sub-path handling. There are a couple of gotchas, though. One is, if you have scenarios where there are multiple pods that are associated with affinity relationship that get scheduled on the same node and try to mount the same volume, then there might be a problem. So the standard access mode of read-write once on PVCs may not be safe in such scenarios, especially if you're using a standard file system like XFS or EXE4. There will be file system corruption because the same file system will get mounted by multiple pods at the same time. To work against that, what we are recommending as part of the enhancement is to use a new CSI access mode that has been recently introduced called read-write once pod. Typically, in our experience, most pods do not have affinity relationship and they should just work out, but just wanted to call this out. The other interesting scenario to consider is that if the mounting of the volume requests a secret, then it's not recommended to use this approach because the mount info.json file that FANG was going over, that contains all the mount options that are necessary and we do not want a secret to be persisted in the OS disk. We also considered some other alternatives on how we can potentially get this going. So one question that came up is, how about enhancing CRI and OCI and basically all the CRI runtimes and OCI runtimes as well to plumb this storage support all the way through? The main limitation we ran into is that storage primitives are more or less absent in the runtime interfaces today. And there are no ways or no API primitives today through which mounted file systems, stats can be queried on a pod that is already running. It's also impossible to say, resize this volume that is mounted by a pod. However, this could be a much longer term effort once the runtime interfaces are made a little broader and some scenarios like this becomes part of the set of problems they are trying to address. So to quick recap of the takeaways, we mainly explored an alternative to the standard mount workflow for persistent storage. We mainly looked at it from a micro VM perspective, specifically Kata. We went over how exactly the delegation of mount, post-mount configuration and obtaining stats and resizing of file systems would work through a container runtime handler. We mainly looked at it from the perspective of block-based PVs. And the key thing we avoided is mounting the file system in the host and avoided all the associated bind mounts and file system projections through things like Vortio FS. We would love to get your involvement and feedback. So feel free to reach out either through the CAP process, which is CAP 2857. And you can reach us in the six-story Slack channel in Kubernetes as well as in the Kata community Slack. Thanks a lot. That was our presentation. Any questions? Awesome. Thanks for the demo. I have a question around the usage of block storage which is file storage. So in the demo, I'm assuming it was using a block device and that's how we were avoiding this cubelet interaction of bind mounting. But in the CAP, is that a proposal to allow for delegation? I just wanted to clarify that point. Or would users of this feature still have to use block devices? So in the CAP from the six-story community, there's a bit of hesitance to open it up to network file like NFS scenarios as well as block. Their recommendation is to start off at the alpha stage at least with just block and then move over to shared file systems. However, with the approach that Feng went over, there's nothing that prevents you from just, you know, applying the same mechanism to NFS or SMB as well. Essentially, the entire mount will take place within the micro-VM environment and cubelet will just be skipping both of the blocks. Got it, okay. So does that kind of limit the user's configuration then if they're unable to kind of pass through some of the other semantics of a file system in their storage class? And I guess it's kind of dependent on like having another way to pass information to your CSI driver? Exactly, yeah. It's mainly file system information from the CSI plugin to the container runtime handler. Okay. And the idea is like storage class parameters should already get passed down to the plugin through cubelet through existing means. Cool. Any other questions? Going once, twice. That's it. So before we wrap, you also want to recognize a lot of contributions into this effort from Ebo Zheng and Eric Ernst, both of who are here. So thanks a lot guys.