 Hi there, so today we will share some topic about how we manage data in AI, especially in Kubernetes. Like, we will introduce two solutions together. One is the CSI fields, the one that you guys already see in the actual, in the schedule. And the other one is a newly come out one, it's called ETCD FSBAC. The next kind of a little, like confusing the beginning, but I will go through them one by one. So first introduce myself, so I'm Lu. I am the open source PNC maintainer of Alasio, previously named Tat Yang. And also is the AI platform, Tat Lee at Alasio. So we will quickly go through the AI data management problems in Kubernetes. And we will introduce the two solutions, the CSI fields and the ETCD FSBAC. So quickly go through the problems, like Kubernetes, how is the deployment hard days. But for later scientists, they still need to consider so many problems, like other than the AI logic, like where is my data and why my data is so slow, why it costs so much money, et cetera. Then how can we get away all those storage overhead from data scientists and let them to focus on AI logic more than before. So we will talk about the first solution, the CSI fields. So first we will talk a little bit about what's fields. Fields have a kind of a magic title basically, it can turn your cloud storage into your local file system format. Considering you are open folder in your local Mac, it may look like just a folder or file in your local file system, but it can be backed by some cloud storage, like S3, like GCS. So it has much bigger capability than your local disk. But you can use it just like your local disk. So all the data scientists, they really like this kind of logic. It's something that is wonderful. But everything is not without a cost. So one cost that can't with fields is that it will need more resources, the CPU, memory and disk, because it will launch some of the kernel threads to help deal with the operations. And also like you will use some kernel cache and also some disk for local cache performance. So all those come with some resources cost. And then our data scientist because saying that I only need the fields like the data set which we presented, you are using a field spot in my AI training logic. If I don't need it, can you just not launch it? So like we, so CSI makes it really good that it can launch the fields nodes when the data set is needed. So field spot deal with the storage logic while application part deal with the AI logic. But CSI maintain the same life cycle between these two. So the field spot will only be launched when your application need the data set. Everything sounds perfect, but like data scientists consider like have another question like why my data access is so slow? Like as we all know, like assessing cloud storage data is not your local disk speed. But when data scientists, they are assessing your local data, they is fast sensing similar to your local disk performance. Then come up with like how can we deal with the slow performance? Like one solution is local cache. Like different field solution like S3FS fields, they have their building local cache. You can also use other like GCF fields, they also have their building local cache to solve this problem. But sometimes the AI application, you actually need to data to be shuffling and share between nodes. This come up with the need for distributed caching. So on the left side is the training node. On the right side is the storage system. And we kind of break the logic between the training node and the storage system. Like to two parts, S3FS fields, which can turn cloud storage into a local folder for your training nodes. And in the middle, we can add another like distributed caching layer in between to provide high performance data access. So in this case, you can assess the cloud storage data like the local data, while also enjoy the performance benefit coming from the distributed caching system. And we done some of the benchmark, like the CV data loading between Alasio and S3FS fields and also BOTO S3 API. Alasio is five times faster than S3FS fields and more than 10 times faster than you use BOTO S3 API directly. And sometimes like our user will show us the graph like TensorFlow showing like how much data is actually used in data loading and how much time is actually used in the GPU utilization. So the higher the time spent in data loader, it could directly result in a low GPU utilization rate. And with Alasio field solution, we can reduce the data loader rate from 82% to 1%, which directly improve the GPU utilization rate from 17% to 93%. This sounds amazing, but everything comes across as I mentioned. So there are two problems with fields on Kubernetes. The first one is that you actually need to have a separate container for your fields part because it costs resources and its logic is readily complicated. And on the other part, it needs you to have the system of the main security privilege. Many Kubernetes folks taught to me that we don't want to allow you to have the system of the main privilege because we don't know what you would do in our cluster. Like we don't want you to ruin the host machine or other things. So these two are actually an overhead for us to maintain fields on Kubernetes. Then we jump to the next solution about ETCD FS-Back. So first, what is FS-Back? Any of you guys hear her about FS-Back? Please raise your hand please. Oh, there does have some people here, which I'm really surprised. Anybody of you have heard about Arrow? Raise your hand please. Oh, much more hands, I can see that. Thank you guys. So starting from Arrow, Arrow connecting different data format together, you can read from one data format in one line in Arrow and then write the data format to another data format in one line, like basically write from Arrow write data format. And on the other hand, Arrow has connecting different frameworks together. You can read data from one framework and write it to another framework with only like simply by several lines of code, which makes everything connect together. But when Arrow need to deal with the story system, it will go to FS-Back. It basically dedicate all the cloud story operations to FS-Back. FS-Back define the file system interface for Python and it's kind of learning from the two framework. One is Python IO and the other one is Fuse that we just mentioned before. But compared to Python IO and Fuse, its interface is much easier. It especially made it so easy for the cloud storage vendor to implement it. So basically, for example, if you want to implement a story system just for read, you only need to implement probably like three APIs. List status, get file status, and read range. Basically, that's it. So it's really simple to implement from the engineer perspective. And also, we have different FS-Back implementation on the market as well, like S3FS, AzureFS, GCFS, HackingfaceFS, nearly all the storage vendor you can consider off. Compared to Fuse, it's easy to implement. It's designed for cloud storage. It's easy to use. It does not require Azure container or system admin permission. And talking about performance, it has the same issue that's similar to Fuse. You need to have some kind of local cache to be able to provide faster data access. And it also is not designed for AI workloads that with data shuffling and sharing logic. So it turns, so we bring out our cloud native distributed catching system, which is our new generation of distributed catching system that built for Kubernetes. So on the left side is the FS-Back, which is the client side. On the right side is the Alasio system cache, which is the servers. And for read, we will first check, okay, whether our data is local. If the data is already cached locally, that's great, we can directly get the data from local. And if not, we will go to the distributed catching service. Then we will need to know, okay, I have so many Alasio workers that probably catch data from me. Which worker are more likely to have the data cached there so that I can get my data at a faster speed. So it includes some kind of worker selection logic. For to be able to select which worker, we need to get some information of those workers. And those worker membership information is stored in ETCD. ETCD responsible for periodically get the worker list from all the worker basically say hi to the ETCD. And then ETCD provide those membership information to our clients periodically. And based on those worker information, we will create a constant hash ring. And so that we can know that, oh, worker zero is more likely to have the cached data. So we go to worker zero to serve our request of a certain file. And if anything happen like, if the data is not cached, if the read goes wrong, if a network have some issue, if you guys actually not doing read, you guys want to do write, then all those operations, we will go directly to the underlying FS back. Like S3FS, GCSFS, Hacking Face FS. Basically, we add in one layer of distributed caching logic on top of the existing UFS, like the S3GCS, et cetera. And one question that many people may ask is like, is the distributed caching system then why cloud native? Oh, Alasio previously have the previous version of Alasio and when we actually running in Kubernetes cluster, like many Kubernetes folks like do complain to us, it's like we don't want something stable. Why stable? Can we do stay list? Our whole Kubernetes cluster does not allow any stable data system. And that kind of like may have shot because it's a caching system, we think it's a data system and data need to be sit there like persistently. But considering from other hand is the caching system, the data is able to last just as your kernel cache. The data can be lost, it can be added. So we develop the new system that built for stay list. So like consider some of the workers here, consider you have a Kubernetes cluster, many service make sure the same cluster. And sometimes like when the training jobs gets higher priority, we want it to give a more caching resources. But sometimes like the other jobs gets higher priority, we may want to kill the whole allows your system catch or we may want to scale it down. Or maybe this several nodes are under maintenance, we want to move it to other nodes. Then in this case, for this architecture, like if you do change to the allows your system catch, it will affect temporarily the catch he rate. Basically the newly launched worker, they don't have the previous catch data. And because the caching can shift a little bit, it have a little catch he missed. So in that case, it affect a little bit of performance. But as time go by, when you re-catch all the data, the performance will catch up. So that's a trade off that we did for our users. They can decide that where they want to get stable so that the data is more stable or they want to go stay list so that like it give more control to Kubernetes logic like scheduler and all other resource management logic. And on the other hand, this architecture is highly for Torrent and highly available. For everything that Chris thought, we basically fall back to the underlying file system. Underline file system is always the source of truth. Even if the whole allows your system catch is cured by the Kubernetes cluster, you are still able to serve a request. Basically all operation will be fall back to the underlying file system. And now we will talk more about like how to use this kind of architecture in the real world. So one example is to use Ray with Alasio. So we may talk a little bit about Ray. How many of you guys know Ray? Please raise your hand please. I can see some similar false Ray's hand again and other false, yeah, the same. So Ray is designed for distributed caching. Ray basically really use a distributed scheduler to dispatch chaining jobs to available workers and enable the seamless horizontal scaling of chaining jobs across multiple nodes. And it's have a great group part that it can parallel your data loading, data preprocessing and the chaining logic. So for example, when you are chaining on partition two, you can partitioning and partition one and then loading the partition zero, something like that. So it can do those things parallel to fully utilize your GPU resources and CPU resources. But it also have some performance and cost implementation of Ray. Because like for example, when you use Ray with PyTorch in when you're doing the AI chaining, in each app, you need to load the whole data set. And it's reusing the whole data set at the cluster level. But on each node level, you will, sorry, right? So basically like for each app, you need to reload the whole data set again. Because Ray doesn't tell you to catch the data. It only makes what data is needed to be used by the immediate task. And also you cannot catch the hottest data among multiple chaining jobs optimistically. So basically you cannot have some data sharing logic. And you might be suffering from a costar every time. So we go into the Ray's lab channel and do find some of the users how they about their data reuse problem. Especially when they are doing a large amount of data. And so in the framework is like in the Ray ecosystem, Ray is the unified compute the machine learning pipeline observation. And it obtains like different framework together. Like some data preprocessing framework and some of the chaining and inference framework like PyTorch and TensorFlow. And so it originally directly load data from the remote storage. And Elastio has a high performance data asset layer between the compute layer and the storage system to provide better performance for the compute job. And the usage is also pretty easy. You create the Elastio file system and then you read data using the original Azure URL or original like UFS URL and you say your file system it'll go Elastio. So the usage is pretty simple. And by using this framework especially with the large pocket file we are able to achieve like two times to five times like better performance compared to same region S3. And it not only improve the performance but also reduce some of the data transfer costs like because the data is catch. So originally the large highly concurrent AI workloads they would directly heat to your storage system. And for your storage system it may just say that hey you guys you see my like data access rate you cannot do that much read in that short period of time. And it may just arrow you out. Or they may say that oh we only can give you certain throughput, certain latency. So that's one thing. And the other thing is about the data transfer cost like both in terms of like the cloud storage like the vendor cost and also like the people that managing the data transfer. And on the other hand like we found that there are many of the redundant like API costs like the S3 operation for list status and get file status. So for the storage system we usually found that like even you just read a really small file the read probably really quick but to able to read that small file it need to have at least three metadata costs. So consider you are reading someone something like image now with 1.3 million of files. Maybe the recall is one million and then the metadata cost can be about to four to five million tons. And that is a huge rate for the storage system. They may be under heavy load or they may be give you bad latency. And that's it for our talk. I think we still have several minutes to answer any question. And feel free to scan the QR code if you have any follow up question. And we have all the engineers on our Slack channel and I will have all your live meet us and also like learning materials on the line. Thanks so much. Any question? Thank you.