 Hi everybody. Welcome back from lunch. Thank you for joining us for Data on Kubernetes Day second half. Thanks for making it this far. I hope you get some tea, coffee, whatever you need. Do some calcinex, wake yourself up. We have a Jianxiang song who could not make it in person with us today. So we have a recorded talk from them and they will be starting that just in a moment. Enjoy. Hello everyone. My name is Jiaxun stone. I'm a software engineer on the GKE state for workload team. Our team focuses on integrating various storage solutions with Google Kubernetes engine. Today I'm excited to present a Kubernetes object storage solution for AI ML data portability. Object storage fused SSI drivers. This innovative solution is derived from a project developed by our team and I believe it will be beneficial to the Kubernetes community. This is today's agenda. Firstly, I will discuss the background by asking a few questions. The answers to these questions will help you have a better understanding of today's topic. Then I will discuss some problems with the current existing designs in the open source community and then present our solution to address all the issues. I will use our team's product Google cloud storage fused SSI driver as a concrete example to demonstrate some technical details. A demo will follow to show you how a typical AI application development process can be streamlined using the object storage fused SSI driver. Finally, I will discuss the constraints of this design and our future plans. First of all, why are SSI drivers important? SSI stands for container storage interface. It provides a standard interface between Kubernetes and storage providers, which makes it easier for different storage providers to deploy and manage storage on Kubernetes clusters. As the diagram shows, a typical Kubernetes SSI driver usually consists of a couple of components that allow Kubernetes to communicate with and manage storage providers. The SSI driver allows developers to easily consume the underlying storage using standard Kubernetes API, such as persistent volumes, persistent volume claims, and SSI infamilar inline volumes. Next question, why is object storage a big deal, especially for AI ML workloads? It is because object storage provides a couple of advantages, which are critical to AI ML workloads. The first point is scalability. AI workloads often require a massive amount of data to train and deploy models. Object storage can scale to petabytes and beyond, making it ideal for storing and managing AI datasets. Secondly, object storage is designed to deliver high performance for concurrent access to large files. This is important for AI workloads, which often need to access data quickly from multiple compute nodes. Moreover, object storage is typically more cost-effective than other types of storage. Last but not least, object storage is easy to use and manage. It usually provides access control, object versioning, and geo-replication features. It is also typically compatible with a wide range of AI platforms and tools. Let's talk about Fuse drivers. Fuse, which stands for file system in user space, is a Linux framework that lets someone develop a file system in Linux user space. Fuse drivers can be used to translate object storage buckets into virtual file systems, so that they can be mounted as local file systems. The major cloud providers have all released Fuse drivers for their object storage solutions. For example, Microsoft Azure has Azure Blob Fuse, Google Cloud Platform has Google Cloud Storage Fuse, and AWS S3 recently announced their solution called Mountpoint. So why does Fuse driver matter? The object storage Fuse drivers provide two main benefits, portability and cost-effectiveness. Let's take a look at portability first. Running the same workload in different environments usually requires code refactoring. For example, data scientists usually use a subset of the training dataset from a local file system to develop and tune their training code. When they need to lift the code to cloud or deploy the code to the production environment to train the model, the training job needs to consume the entire dataset stored in object storage buckets. In some cases, they will have to refactor the code using cloud provider SDK or API to access the data. Using object storage Fuse driver avoids this by allowing AI applications to access data in buckets directly using file systems without dealing with cloud provider SDK or API. Another huge benefit is cost-effectiveness. AI applications usually require expensive computing resources such as GPU or TPU. Before the training job actually starts, the application needs to make sure the data is ready to feed the computing resources. Downloading or prefetching data from object storage to local storage can result in compute resource idle time, which is a huge waste of time and money. Object storage Fuse drivers allow AI applications to take advantages of these cost savings by eliminating the need to download data to local storage. All the data can be streamed and the training jobs can start right away. So how about we combine all the components together? The object storage Fuse size-size driver could be a great choice for mounting object storage buckets as local file systems on Kubernetes clusters. There are a couple of existing implementations in the open source community and many platforms and products are relying on these drivers. However, there are a number of challenges that can arise when using this combination. These challenges prevent this solution becoming a widely accepted choice. So what are the problems? In the current implementations, the Fuse drivers are typically ran either inside the size-size driver container or directly on the node VM. Both approaches have some drawbacks. If other Fuse driver instances run inside a size-size driver container, it can be a single point of failure. This also makes it very difficult to gracefully crush size-size driver containers without interrupting all the Fuse mount points, especially during the size-size driver upgrade. Besides the availability issue, another drawback is about scalability. It can be difficult to scale to large numbers of concurrent users or large datasets because all the Fuse mount points rely on a single size-size driver container, which can become a bottleneck for larger workloads. Running the Fuse drivers directly on VM nodes could be a viable solution to address the availability and scalability issues. However, because Fuse drivers can consume significant resources on VMs, it can be difficult to manage and report the allocable resources on the nodes, if the resource consumption is not tracked by Kubernetes. Either running the Fuse drivers inside the size-size driver container or on the node VM, they usually use unified authentication for all the Fuse mount points. However, it is more ideal to use workload-specific credentials to access the Fuse volumes. Now introducing a sidecar-based solution. The Fuse instances ran inside sidecar containers along with the workload pods, and it takes advantages of kernel cross-process file descriptor transfer. Basically, the solution divides the Fuse mount into two steps handled by two different processes, and relies on file descriptor transfer to connect the two steps. I will show you details later. The solution addresses all the below issues. First, availability and scalability. Running as a part of the sidecar container, the lifetime of the Fuse process matches that of the consuming pod. It can easily scale as each pod has a separate sidecar container. Resource attribution. The sidecar container resources can be individually configured. And security. The sidecar container can leverage workload identity features. Specifically, most cloud providers allow developers to use Kubernetes service account as an AAM service account to authenticate with a cloud provider API. Furthermore, the file descriptor transfer technique allows the sidecar container to run as an unprivileged container. Additionally, and most importantly, our POC proves that this solution can be adopted by any Fuse drivers as long as the driver supports the mounting using the file descriptor number. Let's take a look at our team's implementation as an example. This design follows the common SSI driver architecture. The SSI driver node server runs as a demon set pod on each node. Sockets are used for the communication between the SSI driver node server and KubeNet. For simplicity, the SSI driver controller server is not included in this diagram. The Fuse drivers run inside the sidecar containers along with the workload pods. The workflow starts from an admission webhook controller. The webhook monitors all the pod creation requests and injects the sidecar container into the pod if the Fuse volumes are detected. Users can configure the sidecar container resource limit using pod annotations in our design. The Fuse drivers use pod service count with the workload identity feature enabled to authenticate with the cloud provider API. The SSI driver node server implements the SSI spec to allow KubeNet to mount or and mount volumes. Then the most interesting part of this design is the so-called two-step fuse mount technique. In the first step, the SSI driver opens the depth fuse device on the node VM and obtains the file descriptor. Then it costs the Linux command mount fuse 3 using the file descriptor. In the end, the SSI driver process costs Linux send message command to send the file descriptor to the sidecar container via an UDS Unix domain socket in an empty dirt. In the second step, in the sidecar container, a launcher process connects to the UDS and costs Linux receive message command to receive the file descriptor from step one. Then the process costs the Fuse driver passing the file descriptor to start to serve the Fuse mount point. Instead of passing an actual mount point path, we can pass the file descriptor number to the Fuse driver as long as it supports mounting using a file descriptor. In this process, the sidecar container does not need to be a privileged container. Only the SSI driver container requires privilege to open the depth fuse device on the node VM. Now let's take a look at the demo. In the demo, I will deploy a Jupyter node book server, a training job and an inference workload. All the applications consume the same bucket. The demo will use the MNIST database and TensorFlow to train a model that can recognize handwritten digits. First, here is the Jupyter node book server spec. We are using a load balancer so that the node book server is publicly accessible via a public IP address. On the pod spec, the GCS bucket is specified in a CSI informal inline volume. By using the CSI informal inline volume, we don't need to define PV or PVC objects. The pod has some annotations that tell the webhook that this pod uses the Fuse volume and to specify the sidecar container resource limits. With the workload identity feature enabled on this cluster, the pod uses a service account as a GCP IAM service account to authenticate with the Google Cloud Storage API. Here is the Jupyter node book server pod up and running. Let's inspect the pod. As you can see, the sidecar container was injected. Now let's take a look at the bucket we are using for the demo. Here are some data uploaded to the bucket already. Now let's move on to the Jupyter node book server. As you can see, the bucket is mounted as a local file system. Because the dataset is already uploaded to the bucket, we can easily use the Jupyter node book to explore the dataset and do some sampling. You can also develop your training code against the dataset, just like the dataset is in the local file system. The training code is also stored in the bucket. Now I'm going to create a PVPVC pair for the training job and inference workload to access the bucket. The GCS bucket name is specified on PV object using the volume handle field. Here is the training job spec. Similarly, the pod has the annotation and uses a service account. The difference is that this time the pod consumes the bucket using a PVC. The training job will fetch the training code from the bucket, train the model, and periodically write the training checkpoint and logs to the bucket. In the end, the job will persist the output model in the bucket. Here is the training job pod. Let's take a look at the logs. The training has finished. Model was saved. Let's check the artifact in the bucket on the Jupyter node book UI. As you can see, the models were saved in the bucket. And here are training checkpoints and TensorFlow logs. Lastly, I will deploy an inference workload. The inference workload consumes the same PVPVC, that point to the same bucket. The inference job will use the pre-trained model in the bucket to make some predictions. The inputs are also stored in the bucket. Here are the inputs. Here is the prediction inference pod. Let's check the logs. The prediction has completed. The prediction result looks pretty nice. Here are some takeaways from today's talk. Object storage used as a driver could be a great choice for AI workloads on Kubernetes. Our sidecar based solution and the two-step fuse mount technique could address the pain points in the traditional design. Our solution has various advantages. Okay, no design is perfect. Let's talk about limitations in our future plans. When restriction of the design is that it does not support fuse volumes in the innate containers. Moreover, the implementation has to handle the sidecar container auto-termination after the main container access. Fortunately, the Kubernetes sidecar container feature is a great solution to remove all these limitations. The sidecar container feature is available in Kubernetes 1.28, and we are looking forward to using the feature on GKE after it is promoted to beta. On Thursday, Todd and Sergey are going to talk about the sidecar container feature. If you are interested, please check it out. Here are some references. You can Google our project repo on GitHub. Google cloud storage fuse site driver. Feel free to create PRs, start a discussion, or create an open issue. Thank you very much and see you guys there.