 Hello, everyone. Today we are going to talk about data locality in Kubernetes. So we will discuss some research results and open source solutions to take advantage of data locality for workflows and data intensive pipelines from Kubernetes. I'm Chen Wang from IBM Research. I have been actively contributing to several Kubernetes projects, including Kubernetes vertical pod autoscaling, Kubernetes scheduler plugins, and other projects. Recently, I'm more interested in sustainable computing projects, and I have been actively contributing to Kepler and Klaver under sustainable computing IOLOG. Besides, my research interests mainly lie in resource management and methodologies. And we usually use, including reinforcement learning, AI time series predictions, and the various data driven approaches for container cloud resource management. And my co-speaker Xiaowei is from Alaxu. And please introduce yourself. Yeah, thank you Chen. This is Xiaowei. I'm the co-minterner of the Alaxu project, and at the same time, I also take a responsibility for the open source product manager. My interest is mainly at the distributed file system and also with the co-design of the Kubernetes. So nowadays, a lot more AI and machine learning workload migrate to Kubernetes. And Kubernetes has become a very natural fit for running those AI and machine learning workflows. There's a lot of reasons to use Kubernetes for AI and data intensive workloads. For example, Kubernetes naturally provides the scalability to scale resources to meet the needs of AI and machine learning training workload. It can also provide the elasticity needed through auto-skaters to scale production in first workload. Kubernetes also gives you the nice support of continuously development, which is required by nature for your AI and machine learning workload. Kubernetes can provide you a layer of abstraction that allows data scientists to access various services without worrying about the underlying infrastructure. It can provide you high availability and build over protection to improve your service level agreement and resilience. It also allows you to operate your AI and machine learning workload across different clouds, including public clouds, private clouds, on-premise data centers, and even secure air gap locations. And the cost of migration workload across boundaries becomes less costly. So it can also provide a single consistent platform for different stages of your AI and machine learning workflows and pipelines. Besides, there are all types of various open source frameworks available in upstream community, running on Kubernetes, including, for example, Argo, Kubeflow, Ray, TorchX, etc. So when you start running your AI and machine learning workflows and pipelines on the cluster, and you may want to answer the following questions for the data needed for your machine learning and AI workloads. So the first is, where are data stored permanently? And then you want to answer where are the data cached temporarily, whether it's in Kubernetes or not. And then you want to know if we know where the data is. So given the data locality information, how should we place our computing paths? And how do we distribute the load data to those paths faster and efficient enough? And how can we scale resources if the data volume changes significantly? And when we write the pipeline of AI machine learning job, what APIs should we use to get our data? And how can we guarantee the persistence and resilience of the data? And how can we forward data between containers in workflows and pipelines? And how can we manage the whole life cycle of data in Kubernetes? And if we are running a multi-tangent environment using Kubernetes in a multi-tangent way, how can we isolate and secure data access among different targets? So I will first start with some research experiments we did, which try to answer the highlighted questions shown here. And then Sho Wei will introduce other questions and describe what have existed open source tools we can use to solve those problems. So we start with trying a data intensive AI benchmark workload called Tesseract. And Tesseract is an open source optical capture recognition engine. It operates on images, which is data intensive. It involves a lot of stages and tasks with non-trivial computational and data graphs. And between different stages and components, it features a smaller data transfer compared to more IO intensive workload. However, it also involves a significant amount of data at scale. We first convert this workload from monolithic to a macro services benchmark. So we convert Tesseract from a monolithic application to two benchmarks based on its call graphs. The cost-grade benchmark presides pages of documents in five different macro services as shown in those granularity column. Similarly, we partition the Tesseract into 11 micro-service deployment, which serves as a fine-grade benchmark. For each macro service deployment, it can operate in different levels to provide them. For example, per page, per block, or per text line. We started the processing time and how long each macro service takes in the whole pipeline. For example, the major component that takes a large processing time is LSTM. For the data side, back into the pipeline, we use two single page documents to create a variant size of input from 1 to 128 pages to emulate different sizes of input. So more details can be found in the reference and publication show below. So both benchmark run as a pipeline of micro service jobs on Kubernetes clusters. So the Kubernetes cluster has three workers, each with eight cores and a 32 gigabytes of memory and 50 gigabytes of disk. We set up an NSF server in the same data center in the same rack to emulate the performance of remote storage access, and that should be the ideal case because usually the cloud objects storage server are much further away. The memory and disk has featured a maximum of IO rate of 17 gigabytes per second and 175 megabytes per second respectively. The networking connections between servers are 10 gigabytes per second network switch. We first did a very simple study on where we should cache the data. In the data pipeline benchmark, we basically have two types of containers. One run as data producer tasks, who writes data, and the other runs as data consumer tasks who reads the data, and we assume between producer and consumers data is accessed only sequentially. So intuitively, there are two ways to share the data between producers and consumers. These are sharing through persistent volume, mounted on local memory, sharing through persistent volume, mounted on local storage, and sharing through persistent volume mounted on remote storage. And from the results we show on the bottom left, we find out that the local memory and the local disk solutions exhibit similar performance on the initial read and write operation. So the first in the ledger denotes the initial read and write operation, and next denotes all the border read and write operations. So remote SND denotes data transfers between containers for those tasks that are co-located. We can see that local read and write had achieved much higher bandwidth than the remote read and write. And subsequent local read and write bandwidth up to like 4.4 times and 2.1 times at least compared to the remote read and write. So local data can significantly improve the achievable bandwidth than the remote read and write. So similarly, we compare the local disk memory solutions with an anonymous cloud object storage. So the effective bandwidth we achieved is over six times higher than the object storage. And we believe it will not vary a lot across different cloud providers. So from this study, we can see caching data locally will significantly improve read and write bandwidth for your data, and eventually it will improve your computing resource utilization because you will load data faster. So then assuming you already have the data cached in your cluster, then how can you utilize data locality information to determine your pod placement strategy? So we first separate the computing time from the IO time for those tasks and take a look at how IO time is impacted by your pod placement strategy. So we can see that the average IO time for local data is significantly lower than the average IO time for the data in remote storage. And the average IO time for the remote data is even five times more than the average IO time for the local data. We then compare the end-to-end processing time for the whole pipeline with different pod placement strategy. Here the 50% 100% denotes the number of concurrent task executions for specific parallelism. And the consolidated solutions try to place tasks at the pre-node level in a sequential fashion, trying to saturate the resources available on each node. And the data-centric placement strategy tries to place paths to reduce the internal data movement and considers the number and amount of data to be shared and exchanged between functions. So we try to evaluate the end-to-end time for both coarse grain and fine grain benchmarks. We observe 4.3 times and 2.6 times improvement for consolidate and data-centric placement solutions correspondingly. Though for fine-grained benchmark, it is smaller, but data-centric solutions would still improve at least 1.5 times in terms of end-to-end performance. Besides, it also tells us that the right granularity of decomposing your pipeline is very important. You do not want to petition your application to thin, so you minimize the data transfers between stages. From this study, we can see how much performance improvement you can get about a data locality aware placement strategy. The third question is, if there are data passing between different stages in your AI and the machine learning pipelines, how can you pass the data? Naturally, there are four ways. And we didn't experiment for those four solutions that compare with the baseline approach of using the remote storage to share the data. Those four solutions are like reading and writing data through a remote storage like MFS and reading and writing data through a local memory-based volume or local disk-based volume. And then the fourth solution is we can read and write the data between paths through a direct pipe or socket. And we can see the improvement of different solutions over the remote storage varies across data sizes. And the biggest improvement is for the direct communication between paths on the biggest data chunk, which is 16 gigabytes. If we want to enable direct data forwarding, there's a trade-off between the resource utilization for your buffer and the size of the data forwarding, which is also determined by the buffer size. So we discussed more details about the trade-off for choosing the buffer size, and it is presented in the paper called Flocking. It has been published in middle-world workshops, so you can check more details from the paper if needed. So we later test the direct data forwarding with different data passing patterns, for example, one-to-one, one-to-one, one-to-one, and one-to-one. So the speed up of using direct data forwarding for sizing can go up to 60 times, and it can at least achieve two-times improvement for the largest data chunk with the full end-to-end data transfer patterns. Of course, the data direct data forwarding can significantly reduce the end-to-end execution time of a workflow. It can reduce up to 60 times, but the disadvantage is you may lose the persistence of those intermediate data if you're computing power data. However, for those parts that process a huge amount of data, but there is very small processing time, this may not be a bad scenario, as you can always quickly restart your part. So that's all the experiments I did to answer those questions about data locality, and I will then pass it to Xiaowei, and he will talk more about open source solutions and tools for those questions we want to ask. Thank you, Chen. In the following session, we want to give some idea about how you can answer this question and implement the real system with Alashu. And first of all, let me give some basic introduction about what is Alashu and what is Alashu for. And Alashu is an open source project start from the UC Berkeley Amplab in 2014. Till now, we have more than 1,200 contributors, and we also recognize by the many different rankings, which is the most critical Java-based open source project among the GitHub world. So what is Alashu? Alashu is a new layer between the compute and the storage. Because you know, it's very hard to adopt the compute with the storage, since there's a lot of revolution for the compute side and also for the storage side. So every time for you want to do the data migration, or you want to access the data from the heterogeneous architecture, it will meet a lot of migration problem and also meaning like performance and fitting and all this kind of issue. So actually, we have built this new layer in the middle of them to simplify this kind of thing. And in this talk, I will mainly focus on the data accessibility, which is provided by the Alashu cache functionality. We have a lot of questions to say like how we can really support our very efficient machine learning workflow and data pipeline in a Kubernetes. So I will give the detail answer how we can build this data pipeline with Alashu with the following questions. Before we dive into the solution, I want to give some idea. What is the challenging for building the machine learning data processing from the data perspective? First of all, actually, when you do the machine learning training and machine learning data processing, you will fetch the data from the storage. From this part, we can call it fetchStore if you meet IO bound problems there. For this part, the meaning challenge is if you have an efficient way to fetch the data from storage. For example, you fetch from the hard disk drive, maybe the speed is not enough for you to meet the preprocessing in the next step called prepelStore for the hardware accelerator, most likely like CPU or FPGA is kind of for hardware. And for the next part, you want to fit the data into the GPU. So they also have to have the GPU bound problem there. In this talk, we will mainly focus on how you can efficiently fetch the data from the storage, no matter it's cloud storage, on-premise storage or local storage to overcome the fetchStore for the machine learning training. Just imagine where you store the data nowadays because they are a tremendous of the data generated every day. So the most likely the data you want to use for the machine learning training, actually it's not only for the machine learning training, which means it also can do the data analytics, it can do the ETL trial, and also it's also for maybe many, many different customized compute before you do the machine learning training. So it's likely like you will choose the most cheapest, like most affordable storage there. It can be an object store, it can be a HDFS or it can be a cloud storage there. And when you want to calculate this kind of data, so where is your computer storage? Actually computer storage nowadays what we saw is like people meaning doing the training cluster with the Kubernetes, because it's giving you many good like multi-tenancy management, it gives you a better deployment and actually it will give you much better ROI for your computer to run you on this kind of environment. So the gap we see is like because we have to fetch the data from the object store or HDFS this kind of cluster to the training cluster, actually we have to fetch and remove data. But just imagine like your data is really like small files with the image, maybe just like 10k or even hundreds of bytes. So actually it's very efficient for you to fetch the data from this kind of storage system. So we just come up with the idea if we have to do so, it's not affordable because it will just make the fetch store very heavy in the machine learning training process which will make the CPU utilization or CPU utilization very efficient. It can be like maybe 10% of 20% of the CPU or GPU utilization in this kind of scenario. So we see like let's add one layer there. So we want to see like once you access this file, I want to catch this file in the training cluster which means every time when you fetch the data, it will fetch the data from the local instead of remove data, instead of like slow permanent persist the layer. So the first question you want to answer is like because actually this object storage or HDFS actually provides very standard way for you to access data. It can be HDFS a compatible interface. It can be S3, S3A compatible interface. So actually, but what we saw from the machine learning training side is that people don't really implement all this kind of interface to fetch around the no mapper. It's a test for a pytouch or it's in a not so efficient way. So all it is not very covered every of the corner case for the machine learning training. So actually we do a lot of investigations here. We find like most of the time like people still want to use the POSIX interface. It's like a local file system to access this kind of data. So they don't need to change anything from the application layer. The second question is as I mentioned before because this all this kind of issue from the removed data is not so efficient. So how can we give the data locality in this kind of new architecture? The first of all, if you storage is like cloud storage, which is can be multi-ranging across the organization or even like that in the same region actually because there have throttling there. So it's actually it's not so efficient if you want to run many different training job at the same time. So the first part actually we have the Alashio cluster there which included Alashio master and Alashio worker. Alashio means response for the metadata operation, request response and Alashio worker meaning for the data request and the data store in the Alashio cluster. In this part every time you access data actually we will catch the data in the Alashio worker and the catch the metadata in the Alashio master and this part actually will give you a better performance with the cluster level locality which means like when you access the removed data or cross region data actually will keep the data in the same region for you to access and afterward we find it's not so efficient because in a Kubernetes environment most of the time you have to launch the Alashio pod there and you have to mount the POSIX interface from the actually training pod and with the Alashio fuse pod which means like if you want to fetch the data from the Alashio master and worker actually still the removed call which well in Alashio actually we use the JIPC as the transportation layer so which means they still have an overhead there. In this architecture in order to provide the better performance we also provide another locality layer called the node level locality in the Alashio fuse pod which means the Alashio fuse will also catch the metadata for the file at the same time with the data in the same host machine with the machine engineering pod which means which may help you to access data faster. Actually we saw like all the hallway improvement of performance here but as we mentioned in the previous slides because the process layer still is the object storage or hdf as this kind of system so how we guarantee the persistent is resilient of data. Actually Alashio here is just a catching system which means we will not deal with the data loss or this kind of thing but in this case every time if you find the data is inconsistent from the Alashio namespace and the cloud storage namespace actually we will fetch the data back to the Alashio namespace to update the data and for the right we will just persist the data into the object store and hdfs to make sure data will never lost. So as we can see from the previous slides actually we still like see we have to for the first you just can imagine because it's a catching system so which means you're still very slow with the first time data load so how can we overcome this kind of problem. So we want to say like even for the first time I still don't want to give a lot of slow access for the machine engineering pod which means like slow means like with the money there so we want to say like we can prefetch the data into the Alashio namespace so actually we provide one functionality called the distributed load in our Alashio but first of all from the client side you can issue a distributed load command to Alashio master pod and in most of the cases because you know which model you want to train and this training data actually is there so just load a certain path into the Alashio namespace so the training data will first of all it will load into the Alashio worker and all this data will recede into the Alashio namespace and actually it will overcome the first thing as we call it before it will provide a cluster level locality in order to avoid the copy the data from the remote cluster to the local cluster. And the second time like when you machine engineering pod want to really access this kind of data Alashio fuse pod will passive catch the data from the Alashio worker pod to the Alashio fuse pod which means like at this time we will catch the file metadata and the data into the Alashio fuse pod and for the next part like when you finish the data fetching and you want to do the machine engineering training for the prefetching actually it's already in the CPU memory so actually this training data will uh through the process volume uh feed it to the machine learning training pod in the memory to do the prefetching so actually we have a go down the standing so what is going on how Alashio can help you to build this kind of system to solve all this kind of problem and the next question is like how do we manage your lifecycle of the data in the Kubernetes? First of all at the beginning actually we think about many different ways for example at early stage we use the mode called the sidecar to do all this kind of thing and afterward we use the Alashio CSI driver to do all this kind of thing CSI is a container storage interface in the Kubernetes and at the beginning we see like for example you have a pytorch or you can call it a business pod in your environment actually every time when you crash a node server or it's have some problem for the Alashio fuse which means you have to restart all this kind of pod to make sure you can recover from this kind of disaster so afterward we see that it's better to separate also do the better management with the application and the Alashio fuse pod which means making it more reliable for running the service in the environment so actually we introduce the CSI component into the Alashio Kubernetes story and when you want to do it you just as a component will create the Alashio fuse pod and it will also mount the persistent volume into the Alashio fuse pod and at the same time it will mount into the application container so which means like every time it's have some disaster you can just restart the application pod and Alashio fuse pod separated and if you want to upgrade all this kind of thing you don't need to kill the full business pod together with Alashio fuse pod to do this kind of thing you can just keep them upgraded independently so afterward I want to share some various cases so what we saw in the Microsoft this is not a best use case to show the performance but actually we really appreciate because actually this shows the beginning of the GPU utilization utilization of the measurement training and the end of the GPU utilization as the measurement training so we can see what has happened there so overall actually Alashio speed up into and training time by 18% actually as you can see when it's reached about 11 am actually it's already I mean it's already not the IO bound application so which means all this data already fitting into the CPU memory and also fitting into the GPU memory so maximum GPU utilization for this model training with their existing workflow is about 75 which means actually even you like reduce the IO bound to the zero actually the performance of the utilization of the GPU is still will be there and for the Alashio you can see actually you can see the below figure it shows that at the beginning actually it will not have so many to pre-fetch the data or warm your data from the on the storage which means like you can directly access data and bump the CPU utilization from the zero to 75% in very short time so which means actually it will save a lot of money for you to warm the data and also write the data back and another part is like because previously people still have to replicate the data from the for example from the object store to the to the like the local storage system these can be a hdfs cluster or like a plus it's the compatible display file system or network file system there so you still have to do a lot of error pool and you have to do the data copy by yourself after you set up the Alashio there you don't need to worry about it okay very appreciate everyone attending here and if you want to contact with us feel free to join all a slag channel and also join our