 Thank you everyone for coming I know it's a pretty late session. I thought you're gonna be out when the Sun is already out Thank you very much for coming to our session and the end of the day We're gonna talk about data caching strategies for LM training and serving with Alexio My name is Jasmine I run the open source community open source the devrel team at our company and I'm a partner Lou she's one of our machine learning engineer and she's also our open source project member community member For those of you I'm gonna explain a little bit about Alexio since not all of you know about us So we are a open source product that is started at UC Berkeley samplap back in 2014 And then it's about seven eight years now You know right now we have about 13 actually it says 1200 so a 1300 contributors That in our community and it's still growing that number more than 11,000 right now on our slack community. So if you're interested we also have the slack link there We're recognized as one of the 10 most critical Java based open source project It was written back then so it's still Java based by Google and open SSF And then we're also named one of the more most valuable repositories on github So if you're interested to check out our community slack, I'm usually on there We recently also won one of the first I to believe open software service award just of earlier this year So that's how they open source. So what do we do? Basically, I've lay out a structure there So this is what we used to do in the past when there was the hybrid cloud and a multi-cloud era So on the bottom of you see is basically all the storage so that you can you know where you can get your data as three Azure Hadoop Minayo all of those and on top you see all of the compute engines So Alexios and it kind of sits between the storage and a compute So we bit basically build a virtual distributed system that is a layer That helped to virtualize across all of the data sources to serve data to the applications Regardless of where your data sources regardless of their storage So that solution is applicable across environment whether it's in cloud or on-prem bare metal or containerized We do like I said, we started as open source project We are however if for a product company so the company has both the open source team and then a close enterprise team We do offer two type of product regardless. So for example uber and meta. They are using our open source software They use a lot so local cash to help them to improve their performance If you are interested you can check out the raptor X You know article from the presto team in Facebook back then right now on the meta team And it will be also recent release to engineer blocks on how they use Alexio local cash to help them to build their AI data platform So those are the sort of the companies that we serve and as Alexio those includes both the open source Users and then the enterprise customers. So we do have some Differentiations between the open source one and then the enterprise one for the area enterprise customers They get different features and support. That's another team that deal with that All right, so what do we know we're now moving to the AI you might be wondering so what do we do now is Alexio We used to serve in the sort of hybrid cloud air air now in the air AI air We see a lot of training from the large volume data to learn from the data set So that's where we come to space because at the end of the day so we're back to So Data management is under spotlight as company seeks to out Competitions, so Hey, I was an expert. Okay. So in the new AI at the training large Language models requires the data readiness to have vast amount of data and then whole storage and processing and protection Can be costly now. What do we do here? What is that meant being by data readiness? So that it requires high scalability and a high availability and high performance And those happen to be the things that we can do So Alexio at the end of the day the technology that we build is a large distributed caching So the data cache in this case it can help to boost performance and Then to help with save your cost and to prevent network congestion fetch data ones from remote data storage to repeat it Data excesses and to offload ender storage by bursted highly Concurrent AI workloads and put challenges on ender storages. So that's where we come to help That's today's agenda I'm going to introduce my partner loose and she's been working one of our machine engineer working on the LLM caching strategy Okay, thank you. Just mean So today I will have like a three main agenda The first one is what's our caching strategy based on different traffic patterns that we collected from our users They're actually like production data set and their traffic patterns and how based on how we based on those traffic pattern to come to Get the Strategies caching strategy to mess out the performance and the other side We also have the LN caching strategies that we collected from the production wallows And not only we will share our to share our experience with you folks We also want to have some discussion about some future direction that we can integrate into the AI machine learning the full life cycle So the first one is the traffic pattern and the caching strategies from AI machine learning So we know that catching plays a crucial role in the accelerator for AI computation But as we work with different users, they have really different traffic patterns and leading to different performance oriented catch strategies and recommendations So the first one is what's the data assess pattern for our production wallows So we figure out that there are two Many two scenarios First scenario is that the data is stored in some large structure files. For example, the arrow files in this case We found that there's much more precision and render risk than sequential risk For example, you may want to recent columns of the whole file Or you may want to assess some blocks and we found that the block assess is almost evenly distributed And each read is quite small It can be four kilobytes read in arrow And we know that for small position render risk, even our local disk doesn't perform that well And on the other side, we also found another pattern that the data is stored in many small semi-structure or unstructured files Which is pretty common in computer vision chaining The file number could be pretty big like larger than 10 billion files and the file assess is almost Evenly distributed for example, maybe each changing each training client. They want to read a batch of files and Then based on the traffic pattern that we just talked about what's the Hatch strategy that we recommend first we target performance so for performance side, it's quite similar to what we did in our computers like in our computer like science they have L1 L2 catch to boost the performance and Similarly in training part it also use the hierarchical catching Like first we utilize the system file buffer catch Which is leveraged memory that and then can provide the training files the training data at the highest best performance But we know that like memory is limited resources, especially in the training I heard that many people say that like they only have like a really limited CPU memory And that's become their bottleneck And so on the other hand many of our users they actually have local MV and E So by utilize the local disk page catch like we have we can We can put the pages like one megabyte pages into the local disk to boost the performance and on the harder hand if local disk is still not enough a Remote catch a separate catch cluster that closer to changing can help us to provide data Much faster than remote storage, but still like have a much bigger capacity So another recommendation is about how we optimize the position and render read So once suggestion we give to our user is that we preload the data So instead of loading only private data, you can preload and then to catch the whole file And another suggestion is that not just so for example We want to read 100 kilobytes of data from offset one megabytes and instead of only reading 100 kilobyte We can read data in trunk for example one megabyte So it helps that if the user the client they continually want to sequential read the file It lowered the request rate from like 10 times to only one times if they want to follow in data set However, if the user they only want that 100 kilobyte and then they jump to 2 megabyte and read the following 100 kilobyte Then in this case we actually have a 10 times re-application So it kind of a trade-off between whether we want to reduce a request number or we Will care more about the re-application issue So that's why Usually when we work with our users we want to know what's their actual traffic pattern So that to see what's the better catching strategy and read strategies and On the other hand because of a different assessed pattern So our users may not know in the beginning that what's our what's the capacity they want for their catching So the scalability and elasticity For the catching the catch capacity is important that they can add more catch or remove catch when they need it and So also there's another pattern other than the data assessed pattern It's about like in recent five years We see the growing demand for the cloud strategy and many of the users They move machine learning infrastructure from on pern cluster to cloud cluster Out that hybrid or multi-cloud environment to serve their machine learning infrastructure So here we present a general idea of a hybrid or multi-cloud machine learning infrastructure So on the right side is the off-line off-line training platform and on the right side Maybe like run your your right side right here So it's the online serving cluster and in the middle is the storage system So in the first step we on the on the training platform So there is a unified catching layer which responsible for fetching data from the cloud storage and to serving the data to the training cluster and On the online like serving cluster So after the training down the model will be written back to the storage and on the online serving cluster There is also a catching layer to help fetch the model and quickly deploy the model to other serving nodes and Based on this new hybrid cloud pattern So what kind of suggestion will we want to give for this scenario? So the first one is because it's hybrid and multi-cloud environment We want to be cloud friendlies We want to be able to switch different like training platform and also switch different storage system and We also want a configurable catch at the mission and evasion policy That we can get the data when we want and then if it if we don't need any more and on the other side a Training job it can take like weeks to do so any single point of failure in the data Providing will cost them the basically the training job to fail So any catching strategy need to be able to fall back to the under storage This means that if some of the catch is not available or the whole catch is offline We are still able to provide the data from the under storage to the users and for the cost perspective like For so many of the users they can actually complain about the storage cost So one part is that many storage like as three today. They are basically charge you based on the API cost So and also the data transfer cost basically how much data your job actually read the data from the cloud storage and For training job, we may need the same data set again and again So if we every time we need to go to a storage to fetch the data So it will be a high cost for both the API call and also the data transfer and Here is a general idea of how we can integrate catch with AI training platform So we have a AI training node which running some training jobs And we have a catch client that is deployed in the training node and in the middle We have a remote catch cluster which have multiple catch workers to serve the catch So the training know may want to real specific data set It will talk to the local catch client if it has local catch Then it will directly return the local catch and if there is no local catch It will talk to the remote catch worker node to get the data that they want If the worker node have the data that's great It return the data if the data worker node does not have data It will talk to the under storage to get the data That's a general idea of how the catching can be integrated with AI training And we also have some evaluation results based on the pattern that we just discussed Remained the pattern that we discussed previously about the data access pattern Like the first one is some large structure files And for this case, we actually get some production machine learning data set from our users And also we replay their traffic pattern on different re-strategies To see which one actually give them better performance So in this case, we evaluate two re-strategies The position read and streaming read And in this case, we see that the blue one is the position read latency And the red one is the streaming read latency And in this case, the streaming read latency is much higher than the position read latency That's why like a properly black position read all performance streaming read When reading large machine learning data sets And on the other hand, we also evaluate the small unstructured files We also collected the production data set from our users The original data set is pretty large And we sampling like 10,000 files from the data set And also we replay the traffic pattern So here we see a pretty different pattern Still the blue one is the position read latency And the red one is the streaming read latency And in this case, the streaming read latency Is actually slightly better than the position read latency So basically like streaming read all performance position read When reading small unstructured files So previously we talked about the catching strategy Actually based on different like traffic pattern And how catch can be used in AI training And now we want to talk about the actual catching strategy That we collected from the different users Like Microsoft, Sharpie and Jihu So in the first case, like what's their challenge That they face in their large-level model pipeline So this structure basically catches like many of our users They have actually different clouds They have their training clouds They have their offline cloud Like their on-prem cloud And also they have their online cloud for serving And so this cause they may be far away from each other And they have different storage systems Some are in the object storage Some are still using like traditional HDFS And so for the training clouds So they want to do the model training With training data in the object storage And in this case They first directly get the data from object storage And indeed, so what the problem that they face is that They found that the model training The GPU utilization rate is lower than expected And they also have their own offline cloud Which they still using like spot machine learning To do the training on HDFS data set And in this case, they found that their HDFS is pretty overloaded So basically like training We usually get the data like in dozens of QPS And this will have a really huge pressure on the persistent storage Which are not target for the burst like training workloads So in this case, their storage like cluster Maybe like basically like when you have some right job That want to write for HDFS It may be blocked by the request Or maybe have a really high latency And also in other hand for the online cloud Which we do the model serving After the data is changed from the model training clusters And the model is returned back to your persistent storage And the model need to be as quick as possible Deploy to the model serving platform And the time is really important in this space But consider that deploy the same model to like a large amount of nodes You need to read the model from the storage to every node And this is really easy to have a network congestion issue And the strategy to have a distributed catching Is that we basically add the catching layer Between the training the model serving and your storage system So the training data can fetch once from the under storage And provide to model training again and again For example, your training may have like a large amount of app parts Like a large amount of iterations on the data set Then you load once from the object storage or HDFS And you catch it in the catching layer closer to your training cluster And then you provide those data from the catching layer to the training cluster To offload the storage and reduce the API cost And on the other hand, it also like improve the model serving model deployment phase By having Alasio to load the data from the storage system Have a monopoly replica in the catching layer And then quickly deploy the model to the serving cluster So basically having a distributed catching It can have a high performance metadata and data cache Assess with catching And you reduce pressure on the persistent storage and network And also like the catching solution Need to have industry standard access interface Like the process API to turn the catching into a local folder format And S3 API to assess the data storage Just like assess S3 data set HDFS API to assess the catching just like assess HDFS So the applications can assess data transparently Without the need to modify the cost So that's an evaluation that we do Like compared to training directly from storage Which have a really long time in the data loading Like data loading here takes out 82% of the time duration And the GPU utilization rate is only less than 20% And while having a catching layer The data loading rate is largely reduced Which directly results in a higher GPU utilization rate And this one is like we collected from one of the users Like they basically is the latency that how long it takes to deploy the model And on the left side is the chart Is the number, the latency that they deploy the model Without having a catching solution When they directly fetch the data from their store Fetch the model from the storage And in this case it normally takes like 15 minutes to deploy model And the middle part is that When they start onboarding the catching layer And the time you can see the model deployment Is reduced from 15 minutes to around 3 minutes And after some performance improvement That we do based on their traffic pattern And the need for their model deployment The time is reduced to maybe like less than one minute Basically like in our shared blog with NVIDIA The benefits of our GPU acceleration are limited If data access dominates the execution time Okay so we also want to share with you guys some of our future direction Like the future items that we want to do So this is a general idea of our future direction Like we say we already talked about How data cache can play in the full AI machine learning lifecycle We have discussed some of the data access pattern And how to integrate cache with AI chaining and serving But maybe we can go a little further What can integrate a unified data catching layer With different components in the machine learning lifecycle The components can be data processing Feature engineering, model chaining and model inference So the catching layer can help fetch the data from under storage And give the raw data to data pre-processing And the processed data can be returned back to the catching layer And the catching layer can synchronize the data With the storage that you want to like sync with Presist the data too And on the other hand the processed data Can be given to the future engineering engine So that the features can be also returned back to the unified layer And similarly for the model chaining It can load the data and feature from the catching layer And write back the model And for model inference it can read the model And write back some of the carry results for future catching So thank you for your attention And joining us today to learn about What's the actual catching strategy for AI machine learning And if you have any further questions Or would like to learn more about Alasio Please visit our website or join our Slack channel We'll be happy to answer any question that you have And provide more information Or how Alasio can actually help you with your AI machine learning workloads So does anyone have any questions? Oh, thank you I have a question So compared to other catching systems What's the benefit Alusia provides For example like Radis Pretty popular open source So can you actually share some live with us Thank you Thank you So yeah As many users when they first on-boarding They will spread Radis as a memory in memory catching But Radis have a limitation like being a memory catching So for Alasio actually our main target scenario Is memory plus disk and mainly the local MVIE Because nowadays the local MVIE actually can have much bigger capacity And the performance are still good Any other question That's it, thank you It looks like it's the last session in the conference I hope you guys enjoy Thanks