 Yeah, good afternoon, everyone. I really like happy to stand here to share our topic, how to bring the DOTA liquidity in the Kubernetes cluster. And I'm showing actually as introduced, like I'm the commentator as Alashu. And I started working with this project since 2016. And nowadays I also take some responsibility to do the product manager to build a roadmap for this project. So hello, everyone. I'm very excited to be here. And then today I'm Chen Wang, I'm a staff research scientist from IBM Research. And on my part, I contribute to open source community a lot. I've been involved in like Kubernetes auto scalers, Kubernetes scheduler plugins, and some sustainable projects that we'll talk on Thursday called Kepler. And then besides that, I have my research interests that mainly lie in two sides. One is trying to use the machine learning to improve the resource management and system management on Kubernetes. The other side is I try to improve Kubernetes platform to support better machine learning and AI workload. So I'm going to start with a brief background and why we started this problem as a research problem. So Kubernetes is actually a natural fit for running AI and ML workloads for a lot of reasons. So it can help you to scale to meet the resource requirements for your AI and ML training and production inferential workload. It gives you better support on continuous development, which is by nature required by AI and ML workload. And then it helps you provide an abstraction layer that allows data scientists to access services without varying the underlying infrastructure. It provides you high availability, field over protections that will impact your SOA and resilience problem. It allows the AI ML operations to operate across public clouds, private clouds, on-premise infrastructure, and also includes secure air gap locations. And it also gave you a consistent platform for different stages of your AI and ML workload, including pre-processing, training, and production inference. And in addition, we have all types of open source tools available to be used on Kubernetes, including like Argo, workflow management, Kubeflow, Ray, and TorchX, et cetera. So when you start running a pipeline of data analytics or machine learning pipelines on Kubernetes, there are a lot of questions coming into your mind, like where do we store our data? Permanently, where can we cache our data in Kubernetes temporarily? And then, where do we place paths, given the data locality we know in clusters, and then how to distribute and load data fast enough so we improve the overall cluster resource utilization, and how we scale resources upon data volumes when the data volumes becomes really big, how we scale resources upon that, and then what APIs we should use to gather those data into our computing paths, right? And how to guarantee the persistence and resilience of your data when, like for example, you're computing a path that's completely gone, and how do we forward data between different stages of your pipeline, and then how do we manage the whole lifecycle of data in Kubernetes? So in Kubernetes, we actually don't have a lot of support of those questions, and we don't either have the isolation and the data security guaranteed among tenants in the same cluster. So we will, I will cover several questions of those, and then shall we will present you what's available in open source community. So we start all the study based on building a simple benchmark workflow, workload called, we build the benchmark based on the existing monolithic Tesseract application. So Tesseract is an open source optical character recognition engine. It's used, it's a very popular open source application, but it's monolithic. We try to decompose it into different macro services into two types of benchmark, and then we did all those evaluations and experiments based on this benchmark. So as I mentioned, Tesseract is a complete workflow of machine learning pipeline. It includes different process, like it's read the TIF image in, and it does the recognition stage with finding the lines and running all those machine learning inference to recognize words and detect the paragraphs. And then we build two benchmarks based on this. You can check more details in our published papers in actual cloud 2021. And then the cost-grained benchmark includes five macro services for different stages of the tasks. And then the fund-grade includes 11 macro services. And then we use two types of input data to generate different size of the input for the home machine learning pipeline to run on Kubernetes. And so our environmental setup includes three workers of, for the three nodes for the worker nodes. With the 32 gigabytes of memory, eight CPUs, 50 gigabytes of disk space. And we also use one NF server deployed in the same rack of the cluster to emulate the data access pattern as the cloud object storage. And we will have some results on the cloud object storage as well. And so the maximum IO speed we have is like 17 gigabytes per second for the memory and the 175 megabytes for the disk access and the network connection is 10 gigabytes. So when you want to load the data to your computing parts there are several options intuitively. So you can directly read the data from cloud object storage or NFS server or you can mount the data in the local volume for the, from the disk or from memory. And then we basically evaluate the speed to get the data and then those are some results like we read from local memory, local disk, from local persistent volumes or we load it from the NFS server. And we can see over the different size of the data the maximum speed up we can get that the bandwidth throughput improvement we can get over the remote read and write can be improved up to 4.441. Bandwidth throughput increase. So meaning you need much less time to load the data when the data is already in your local volume. And then we also compare the local disk, local memory read and write versus the object storage. And then for different sizes you can see maximally we can get almost a 10 acts performance improve. And minimum you will get like at least a two for the very large data size. And then all of those loading time in your cluster for your workload will take your resources. And at that time you are not completely using the resources so you would end up with a very low utilization on your expensive GPU resources. So the second question if we assume we can load all the data in the cluster, do the prefetching and then what we can do about those data when for example at the platform layer in Kubernetes. So where we put schedule our place of our paths in Kubernetes. So in this experiment we have two setups. One is a local setup which is we try to place all the paths using the data on the single node. The other is we try to map the, doing something like load balancing the default schedule will do which is try to spread the path across noting the wrong robbing fashion. And then at this time we just want to take a look at the IO time it takes. And then for example we did some experiments on one page input or 10 page input for both the cross-grained benchmark and fun-grained benchmark. And then you can see the maximum improvement can up to five times. So the seconds, the time you spend to read the data using the local storage can be five times faster than using the remote storage. And then this experiment shows the end to end processing time of the whole workflow. And then for the cross-grained benchmark we can see the maximum input can be four times. If you try to consolidate all those paths using the data on the same node. And then if you develop some data-centric solutions meaning you try to place a path to reduce the intern node data movement then the improvement is also like like around four times or three times. At least. So next when you run very large workflows in your clusters you need to transfer data. There are a lot of intermediate data between stages. You want to get the output of the stage one and forward the data to stage two. And then how can you forward data between stages in workflows and pipelines? And then similarly the simple solution is we read and write to a remote storage. And then this serves as our baseline showing the NFS read and write. And the other solution is we cache the data locally in the local disk or local memory. And then we can see for different sizes of the data as the data size get larger the speed up can goes up to 1.4 times. And then if we do something like direct data forwarding so we don't store it we just enable data communications between paths and then the speed up can further be 2.5X. And then of course we have some star we started like if you want to enable direct data forwarding there is a trade off between the resource usage and the transfer time the performance you want to achieve and you want to determine an optimal buffer size so you don't use a lot of resources but at the same time you achieve the good performance for data forwarding. And for this we build a firmware called Flocky and if you want to know more details check out this publication. So the end results of enabling direct data forwarding for different patterns of data transfers like one to one, one to N to N and then those are the results and maximally we can improve the speed up up to for example 60 times. So enabling data forwarding in the pipeline is important. Of course it will significantly reduce the end to an execution time of their whole machine learning pipelines but you would also lose the persistence of the intermediate data once the pod is down. So it's very good direct data forwarding is very good for stages like it doesn't take a long time for computing or processing but it's loaded a larger amount of data. So then I will pass the slides to show away and he will introduce more available open source tool to solve those problems. Press the very result of the benchmark and I will give the real solution how we can solve this kind of data locality problem with all open source software called Alashio. First I want to give a short introduction. What is Alashio? Actually we are the open source projects start from the UC Berkeley amp lab since 2014 and nowadays we already have like more than 1,200 contributors in our community and actually we are very active at Slack and welcome to join us and solve the problem together. And what is Alashio? Actually we are the virtualization layer between the compute and the storage. And at the same time actually we're also the catching layer to like help you to ignore the data locality problem in any stack and any environment. Actually today because actually we talk about the locality actually we are mostly focused on the catching capability from the Alashio. Yeah Chen mentioned that there are a lot of different problems. How can we bring the data locality in the Kubernetes environment and I will try to answer most of them in my deep dive. First of all I want to give some basic introduction what is the performance bottleneck in the machine learning data processing. Actually in general this is a very typical machine learning data processing pipeline. Actually you will load the data from the under layer storage. It can be the cloud storage, home promise storage, object store, and even the local disk. Actually you want to load this data into your CPU memory and afterward you want to do some preprocessing like you want to resize your image and you want to clean this kind of data. Then you'll feed this kind of data from the CPU memory to the GPU memory or TPU memory to do the folder training with the machine learning algorithm. And but actually there are several points here. First of all when you load the data from the under layer storage to your memory actually there have a lot of problems called fetch store. Most of the time it's IO bound because actually your storage cannot provide enough IO bandwidth to the CPU memory. And afterward you will see like you will see the prepare store actually is a CPU performance bound and afterward you will see because of your algorithm or because of your characteristic of your workload you'll always see a CPU bound there. And if you want to see more analysis actually there is one very good reference on the layer from the MSR actually they give a lot more explanation there. And this is also very typical like people how they do like machine learning training in their environment. Actually from the left side you can see there can be like object store disk or even like very traditional data house data warehouse with HDFS cluster. And the right hand because of the benefit of the Kubernetes allowed you to do the multi-tenancy fine granularity control of the resource actually people always doing like machine learning training in the Kubernetes environment. And previously like you want directly access the data from the object store you can copy do the data replication from the storage system to the training data which will involve a lot of manual work including the replication and the data validation and also you have to guarantee the data problem at this process. Actually it's cause a lot of difficult problems there because there's no data scientists that really want to deal with this kind of dirty work from the data pipeline and also it's really delays the delivery of the product. And another problem is actually as Chen mentioned before if you're directly acquired from the underlayer object storage all this kind of fast system actually performance is very bad it's not only about the IO throughput between storage system and the training cluster actually it's also about the metadata operation like list of schedules get schedules it's also very like slow from this kind of system. So actually what I think about yeah actually because there's no data locality guarantee from this kind of architecture so actually we have to bring it back otherwise the GPU utilization is always very low it can be like 10% or 20% from the real production environment and actually it's cause a lot of waste because actually you know like GPU single node can be 30 or 60 dollars per hour actually you just utilize one fifth of its resource actually it's a big waste of money there and afterwards like we come up with idea so let's see we have to bring this intermediate layer into this kind of architecture so the first of all we are thinking about what kind of interface should we provide to this kind of computing framework no matter it's touch, TensorFlow or even like MxNet this kind of framework so we do a lot of investigation and we found like no matter which data loader you are using they always have some thing called they will use local file system actually because a lot of people when they do the development actually they will not consider to use like a data loader compared with the cloud storage or data loader compared to the HDF as this kind of file system but they always use the POSIX interface so first of the answer to it is actually we also develop the POSIX interface with the Aligio Fuse to provide the best capability to serve this kind of training model and the second problem how we can bring the data locality back into this architecture first of all we will have the cluster a cluster level locality at the left side which means you can ignore where is the under storage it is it can be storage like cloud storage as same region, cloud storage at the different region and also it can you can access data from the cloud to your own promise data center which even a geo-distribute a data center which have very high latency this is the the left part actually is guaranteed this part of the locality the second locality actually is we want to also bring the locality back to the node level because you know like even you bring back this kind of locality from the remote storage to the local storage you still want to see like it's a lot of performance difference between your access to local compared to the remote like distributed file system so actually the second part is we will co-locate all our actual fields part together with machine learning training part to make sure like we can catch the metadata at the actual field side and also catch the data from the actual client side to serve the best performance for the machine learning training and another thing is like because actually we are catching layer where the actual itself will not never guarantee the persistent for the data which means when you write the data back or you access the data we will not guarantee like you will not loss the data because we always do the invocation if the catching space is insufficient so at the under layer we will write back the data to the under layer storage it can be the cloud storage or on-premise storage system to guarantee this kind of persistent and when you get the performance the temporary data is stored in the actual space the big problem here is like although we have this kind of catching system but you know like if you don't know any access pattern of your machine learning processing you're always to say like what if for the first time the first time always very slow when you do the fourth epoch it can take forever to load the data from the cloud storage or the remote object store and we say like okay we provide a way for you to do this kind of pre-fetch we call distribute load in our system first of all you can initialize the distribute load command from the Alachio client to the Alachio master the Alachio master will aware oh okay I need this kind of data from the under file system so we'll load the data first from the cloud storage into Alachio worker so which means it's overcome the first the remote fetch from the remote data center or the cloud storage then second actually the Alachio fuse pod will passive fetch the data when I mean when the machine learning pod will want to access this kind of data it will catch the metadata at the Alachio fuse pod and also at the same time it will also catch the data after the first time you access the data from the machine learning job and afterward we will feed the data into the machine learning training pod and this data will receive it in the CPU memory or the GPU memory okay the last problem we want to answer because actually machine learning job actually itself have a life cycle characteristics actually we want to see how we can manage this kind of life cycle of the data in a Kubernetes environment at the beginning actually we use the sidecar mode to manage all this kind of data but afterward we realized it's not the best way to do it so we adopted the CSI which means the container storage interface to provide a better life cycle management in the Kubernetes the problem of the sidecar mode actually we will launch the Alachio fuse pod from the node server when the node server have some problem actually you have to restart your application pod and also same as the Alachio fuse pod to restart the whole service actually it's not so friendly no matter for your daily maintenance also it's about like when you want to upgrade this kind of service they actually they are like hard-bidding so actually it's not very friendly to the data scientist but afterward we find that it's better ways to use CSI when you launch the when you launch the application pod actually we will also launch the Alachio fuse pod from the CSI component at the first time and we also mounted the PV into these two pod to provide the data in the local machine and at the same time if you launch more application pod in the same host machine actually the only thing you need to do is like launch more application pod with mounting this kind of process volume into this container yeah and afterward I want to see share some results of what will happen after you adopt this solution in your production environment I will not say this is the best case to show the performance because we show we see much better performance in the environment but this is very typical like average use cases for the real production environment as you can see when you load the data directly from the object store you can see here actually it has very low GPU usage the maximum GPU usage after you fully capture data in the memory of the CPU the GPU utilization is about 75% which means it actually stuck at the prepare store which means bound by the CPU performance but at the beginning you can see it's increased the GPU utilization from the 0% to the 75% take quite a long time which means actually it's take super long time to warm the data which causes a lot of results are wasted there and after you embed the actual into this kind of architecture you can see actually the performance that directly jump from the 0% to 75% and the performance is very consistent in this kind of environment and the second problem is like previously you have to do the data replication to guarantee the performance and they have to do the warm the data before the training actually it's actually when you launch the GPU cluster you have to pay for the money then nowadays you don't need to launch the data to do a launch this cluster to do the warm up you can just warm up the data and then launch the GPU cluster afterward yeah this is all from our talk from also together with Chen is there any question from the audience great thank you Shuei and Chen yeah maybe one quick question and then we'll move on because I know we're running a little bit behind yeah go for it so I'm curious in your research which was very thorough incidentally have you looked at some of the kernel the Linux features like cache files D for example that lets you cache NFS or SIFS data locally to see if something that's maybe less intrusive or less complex as the product displayed here but still gives you good returns on your on your performance great actually the next because it's getting miked up so if people do have more questions let's let's do them anyone hands up yeah how much work is necessary in enhancing the Kubernetes scheduler for data locality whereas a lot of this handled through like this the CSI in those items also gave a quick answer for this one actually because I've coordinated this self don't provide the data locality schedule which means like maybe you can provide the IP for the whole machine and the computing framework also have to aware of this kind of change and give it back to them actually previous we do a lot of work to enable this but actually the difficult party so you also have to push into the computing framework make their schedule to aware of this kind of change so that's also another reason we want to use fields here but actually for the fields itself we can guarantee the data locality from the long running process great any final questions perfect right another round of applause then thank you