 So today, we will be sharing our journey along the way to data locality on cloud for machine learning and AI workloads. My name is Sean. I'm a software engineer at Luxio. And my colleague Lou, who will be talking about the second half of the deck later, she's a machine learning engineer at Luxio. And at the same time, she's also a PMT member of our Luxio open source project. So here's the agenda of today. First, we'll talk about the benefits of data locality and what are some of the existing solutions. Then we'll bring up a new design along with this implementation and finally about some production use case and the integration between Luxio and Ray. So first, about the advantages of bringing data locality to cloud. There are two main reasons. The first is the performance game. Compared to remote storage like S3 or Rger or GCP, you have faster access to the data. Therefore, there will be less time spent on the intensive applications, especially machine learning and AI workloads. The second main reason is about cost saving. Because by reading from us, there will be fewer API costs made to the cloud storage. This includes both data and metadata API costs. And because of the performance game, we'll have higher digitalization of your GPU. Less GPU time also leads to less cost. Now we'll look at some of the existing solutions out there. The mainstream has like four solutions. First is to read data directly from remote storage on fly. Second is to copy data from remote to local before your training. The third is to use a local cache layer for data reuse. And lastly, we can use a distributed cache system. Let's go over the moment one. So first, if we always read data from remote storage, there will be no data locality. This is the easiest way to set up. But at the same time, every epoch of your training needs to read all the data from remote. And because multiple epochs are always, always needed for better accuracy, I mean, there's no way you're training only in one epoch. And then reading from remote can take more time than the actual training. Here's a screenshot that we did a test. This is a PyTorch training on a subset of ImageNet. So you can see 82% of the time is actually spent by the data loader instead of the actual training. By the way, this test data was on S3. So the second way we just mentioned is to copy data to local before training. So now the data is local so that you can get all the benefits of data locality. That includes cost saving and performance gain. But then the management is hard. Because you must manually delete the training data after use because you have limited disk space. If you don't delete data, then the next person or even yourself, there's no place to store your next training data on your local disk. And also, because we are just cache data, stored data in the local, the space is limited. Nowadays, we don't know the data set can be huge. So only partially or even a small amount of data will be stored locally. So you only get limited benefits of data locality. Now talking about local cache layer for the data reuse, some of the examples including S3FS built-in local cache and a lot of field SDK. Now reuse data is local as well because after we read the first time, that part of data is now cached locally. And this local cache system can help you with data management so that you don't have to do manual deletion or manual supervision to make sure everything is deleted and on track. But the same problem is cache space is limited. Now the fourth way that we mentioned earlier is to use a distributed cache system. This is what a logical two-point-axe looks like. We followed a traditional distributed system architecture where there's a centralized master's. And master's means we use some of the algorithm to make sure there is high availability. Now with multiple workers that can distribute among different nodes, we can store much more cache data compared to local cache. And also the system provides data management functionalities. However, we all know on Kubernetes or on Cloud that nodes sometimes need maintenance. And there will be master fillovers if anything happens. So during that time, master is not doing a straw. So master is now a single point of failure. And with AI and machine learning workloads, we are seeing the data that gets larger and larger. It's very common to see a training involves billions of files. And all the metadata of these files are stored in master's. So the huge number of files is making the master the bottleneck of overall performance. Now just to summarize all the challenges we're facing with all these existing solutions. The first, local cache storage space is limited. And because of all the data is growing fast, this is now becoming a problem. The second is reliability, because on Cloud and Kubernetes availability is the key for every single service. The third point is scalability. Number of the files for training is now huge, the other billions. So if the metadata or some part of your system is becoming the bottleneck. And making a system is even slower than just reading remote storage is certainly not acceptable. And last point is data management. We don't want any manual work, nobody wants. Now let's talk about the new design. The new design involves using a cache for caching data. We completely remove the master's, which were responsible for storing all the metadata of the files. So instead we're now using consistent caching to cache both data and metadata on worker nodes. Now the worker nodes have plenty of space for cache because we can horizontally scale them. Now because there's no more master's, there's no more single point of failure, there's no more performance bottleneck of master's. And the system has data management system. It looks great until using consistent caching may bring load in balance. Some worker can be very busy while others being idle. This not only hurt overall performance, but also may lead to outage because one worker or several workers are under such high load that can bring down all the system. So then we use this new soft affinity caching solution. Instead of only storing one copy of the hash ring, after storing the first copy, we use this worker's information to hash again to find the next. And the next upgrade is configurable and the following workers. Therefore, when the client comes for data and if the first worker is under high load or is unavailable because of any maintenance work, it can ask for the second worker or the third worker. This prevents the high load or the unavailability of a single worker. So now let's look at the implementation. In a lot to three access, we implement this soft affinity data cache scheduling algorithm. And we achieve much higher scalability. Now one worker can support 30 to 50 million files without any performance degree. We also have much higher availability. We have 99.99% off time and there is no single point of failure. We also have collaborative operator for easier deployment and management. We also newly implement this SIA with fields for training. A lot of fields can turn remote data set into local directory for training so that you can look at a lot to store data or remotely store data just as a local directory. The SIA is responsible for launching a field file only when the data is needed so that the field file will not always be there to replace your CPU and memory resources. So with this, we have three layers of caching. We have fields kernel cache. We have fields local cache, local disk cache. And then lastly, we have distributed cache. So we can treat this as like L1, L2, L3 cache layers and kernel cache faster and local cache faster than distributed cache faster than remote storage. But at the same time, the space is also growing. There's more stored data in distributed cache than local cache than kernel cache. We're looking at some of the benchmarks. So here, we are testing one single worker with 48 threads reading files of size of 10 kilobytes. And here are three data points. The total file numbers are, I believe, this 24 million, 48 million, and I think it's 4 million. So we can see that there is apparently no performance downgrade here. Even though the file number, we have 10x. We also did a data loading performance test. So the first one is CV training data loading. We are using a subset of the ImageNet compared to S3FS fields and direct reading from S3 with Python Bodo 3. We can see a lot of the fields has much higher IOPS. And for NLP training data loading here, we're using a YOP academic data set. We can see three APIs of the log scale all have better throughput than using S3FS fields and AWS S3. Now we'll be looking at some real world production use case. And I'll hand to Lou here. OK, thank you, Sean, for introducing the design implementation. Now I'm going to tell us a little more about how we actually do that in production use cases. And take the large language model pipeline as an example. So this is shared by one of our users that they do face some problems in their model training and model inference. They mainly face three problems here. So they normally have multiple clouds that they are actually doing the training with different frameworks like PyTorch, like Spark. And they have different star resistance. Some of the data sets are star in object storage. Some of the data sets are actually controlled by their own kernel like HDFS cluster. And when training needs to load the data from star resistance, they find like mainly two problems. So first is that because every time the training needs to get the data from the star resistance, it needs to go through the network and so that the further away the data from compute, then the longer time it takes to get the data for your training job. And training job need to repeatedly fetch the data from cloud storage. It will also put the star resistance either on the really heavy load. So for some object storage like F3, they will basically request limit, they will error you out if you request too much. And we can know that like training job, they can easily see like 10,000 queries per second. And it will put the on print cluster like HDFS on the really heavy load. That's the time that the star resistance team will yell at the training team because you guys put us under risk. Especially for the right workload. Nobody wants the right workload to fail. Nobody want the data to be lost. So that's the time that they say that, okay, whether we can have some solution to improve the GPU transition rate while keep my star resistance stable. And on the inference cluster is another problem. So after your model is changed by your training framework and the model is present to the star resistance, you want the model to quickly deploy to your model inference cluster. So the quicker that you deploy, the faster you may see some business match increase. Or if the model deployment doesn't work well, you want to quickly roll it back to the previous version so that you help make the, I think that everybody happy there. So that's the three main problem that we see in the LN pipeline. And one of the solution is actually to have a catching solution closer to your model training and model inference. So by using a catching layer between the training job and the star resistance, we bring the data and model closer to the training job and inference job. So instead of like doing each iteration, you always go to the star resistance to fetch the data. You can only fetch once and then catch it locally and then providing it to your training jobs. So buying this way, because a lot still can be co-located with the training job. So the latency is much lower and the throughput can be much higher. And similarly on the model inference side, we can use a lot still to catch the different version of the models so that a lot still help to fetch the model from the remote storage once and then provide those models to the model to the model inference cluster. So the deployment plan can be much quicker. I think this one like Sean already showed before, when changing directly with data from the star resistance, sometimes we can see that there's a lot of time that actually spent in the data loader and the data loader rate and the GPU utilization rate are actually like the higher the data loader rate, the lower the GPU utilization rate. And one way is that when we have a catching solution that closer to our training, we can directly reduce the data loader time which directly result in a higher GPU utilization rate. And this is also one of the broad that share about our users, it talks about how long it takes to deploy their model. And so basically the higher the number, the longer time that it takes to deploy the model. So we can see that from the left side, the blue and green bars, it basically showing how long it takes to deploy the model when there's no catching solution involved. Where you always need to go to your star resistance to fetch the model that you need. And the middle part is that when they first onboarding a catching solution to catch the models, we can directly see that there is only takes one server time when they first onboarding a catching solution. And the right part is where we actually do some cold optimization with their workloads to get rid of the unneeded calls and then to improve the overall performance. And after optimization is only takes like one tenth of the time compared to original model deployment time. Okay lastly, we also want to talk about some of our newly like integration with Ray. I think there are several talks that already share you guys about what's Ray and yeah, there are plenty of things online. So from our perspective, I want to like talk more about the Ray training part. And Ray uses a disability scheduler to dispatch training job to available workers. So it allows you to seamlessly horizontal scaling of training job across different multiple nodes. And it provides streaming data extraction for machine learning training for parallel and distributed processing. And I will talk about the Ray streaming later. So Ray do a really good job in their Ray streaming. Basically, considering that you have like CPU and GPU resources and they didn't download the full data set and then do the full processing and then do the full training. What they do is that they actually break the task to smaller pieces so that when your GPUs are busy at training with batch data zero, batch zero. And then your CPU is idle. You can actually use that CPU to do the pre-processing for data batch one. And also you can use a CPU to do the data loading for batch two. So once your batch zero training is finished, you can directly move to training on batch one so that you can fully utilize the GPU and CPU resources. And on the other hand, especially when there's a large data set, it may not be able to download the full data set and pre-process and then training because you may have enough space on each node to start the full data set. You may want to overlap downloading the data with data pre-processing and training to fully utilize the CPU and GPU resources. On sometimes we also see some of the workloads they basically didn't use the full data set. What they did is that they're reading the different or random data set, random subset of the data set so that I think the maintainer of all the data sets, they don't even know like which part of the data will be used. So by using ray streaming, they only need to load the data that's actually needed. The story sounds great, but there are some problems. Otherwise we won't come in. So we actually talk to many of the ray users and see, because we see that the idea is amazing. We're sensing that we can help here, especially for the data part. And we actually talk to some of the users on their SNAP channel and some slack posts. We search the history and also talk to the people. And so some of the things that we found here is that the story sounds great, but you may load the entire data set again and again for each apple. And then one really important part is that when your memory size is much smaller than your actual data set that needed. So for example, you may have batch zero data and batch one data. And but you only have the memory able to hold your batch zero data. Then once you finish training with batch zero, in order to work on batch one, the full memory, the batch zero data need to be erased to have space for batch one. And then when you do the next apple again, you may need to reload the batch zero data again from your story system. So the larger the ratio, the more data that you actually need to reread from your story system. And also like some of our users, they may not only have one ray pipeline, they actually have multiple ray pipelines. Or they have some teams that work on high touch TensorFlow, something that using ray. But they actually share the same data, especially the hottest data inside one company. Then how can we catch the hottest data for those multiple jobs together? It becomes a problem. And some of the users, they want their model basically to start in a kind of share balloon so that it can be shared by all the rain nose. So basically for our users, they don't want to suffer from a cold start every time. They don't want to re-download data, re-process so that they can change on those data set. That brings Alasio into the Ray ecosystem. So Ray do a really good job in the machine learning pipeline. It abstracts some of the model training and inference framework so that you can do the different stages together and use different machine learning framework together. And Alasio sees between the model training and inference framework and also the story system. If you ask to batch the data from remote storage, catch it locally and provide high performance data access for the model training and inference. There's a really simple benchmark that we did actually use like the Natalie test that Ray provided to us. So it's also on the Ray public guitar repo. And we run the test to compare like, okay, what's Alasio pass Ray performance compared to Ray plus S3 derived rate. And we can see that with Alasio, the performance increase at least one sir. And know that this is with same region S3. So if your story system is even further away and if you have network congestion issue, it may bring small benefits. And one of the other parts that most of the user on Bonnie Alasio not just for performance, but also to reduce their storage cost. Basically like every data that you transfer between your story system to your training job, especially the cloud storage, they actually cost you the egress fee or the data transfer fee. So that especially like when you have a large data set to change on the data transfer fee may be really huge. And on the other part we found that it's actually really made it a heavy for some of the operations. So for example, when we are training on like image net data set, each file is really small. It's only like 100 KB. You can get the full file through one call to your story system. However, we do see that even one image read, it comes with like more than two, usually two, two, three metadata costs. So for example, where's my file? Is that a real file or is a directory? And then I get the file data. So one image read actually costs like two metadata costs and one recall, which is really heavy. So with Alasio we do see that there's a lot of metadata cost involving here. The cost, if they result in a catching system in a closer system, it can reduce basically latency and then improve the performance. And we are also working on the basically rate like they delegate some of the data loading logis and also the format translation logic to other like Apache project like Arrow and FSVAC, the Python file system interface. So we're also working on that part to see whether we can reduce on the metadata cost safely and ensuring the current needs while like improve the performance. I think that's it for our talk today. And feel free to leave any feedback on the QR code. I'm not sure what I can receive the feedback. If you have any question, you can feel free to go to our Slack channel. That's the easiest way to find all the engineers in our company. And if you have issues, then go to the Alasio GitHub like career issue there and anybody have any question? Actually it's more people than I imagine. Okay, thank you. Thank you.