 My name is Jasmine, it's not working again. I run the community and the developer relations team at Aluxio. I did have my email there, but in case you need it, it's jasmineataluxio.com. And here's my partner Lou, who has been the engineer who's working on this project. She's a machine engineering engineer as well as Aluxio open source PMC member. Her email is louataluxio.com. We did have these slides uploaded, so if you guys need to go visit them later, you can download it. So I did have agenda, so we're gonna walk you through that. We're gonna start first to give you an introduction of what Aluxio open source project is in a rough detail. And then elaborate a little bit on what Aluxio is as accessing layer for analytics and AI. And then talk about Dora. She's gonna go on Dora, which is our next generation architecture for AI and machine learning. And then we will show you sort of a Dora's integration with AI and machine learning, different cases, dive a little bit deeper into the technical side and then provide some sort of our projection on the upcoming and future work. So we're good for now. A big picture of Aluxio open source is about. So we are an open source project that started at UC Berkeley's M-PLAB in 2014. Back then we were called the TAC Young Nexus Project, for some of you who've been in data science and data engineering. Great, it's on. So for some of you who didn't know? Oh, I thought it was on my bed. It's fine, we start from there. So for those of you who are not familiar with the M-PLAB, that was also the lab that gave birth to SMARC, HEMISOS and all in the array. So we're just anti-scaled at the company. So it's been a pretty productive lab. And back then there was a PhD student who started that project called TAC Young Nexus. And then slowly building to this big company and a project, when we changed the company, commercialized company, we did have to change name due to some copyright issues. Now as a project itself, we have more than 1,200 contributors and still growing. On the Slack channel, we have more than 11,000 people active. We are a Java based open source project. We're named one of the top 10 critical Java based open source project by OpenSSF. We're also GitHub's top 100 most available repo. I did put our Slack channel there. So for those of you who are interested in checking out this open source project and discussing with all the other contributors or users, feel free to find us there. I'm there all the time. So get on there, send you a message. Introduce myself again. Don't hesitate to say hi. All right, what is Alexio? To give an idea before she dives in. We build, you see the top layer is basically a whole bunch of compute engines that you're familiar with. It's the Spark, Presto, Trino, SQL engines, PyTorch, TensorFlow. And then on the bottom layer, where you see here are all the storages you may have whether it be local or on cloud, on-prem. And Alexio sits in the middle. So we are a sort of make it a virtualized one layer, a new layer between the compute and then the storage. And our layer provides a virtualization across all of the data sources, regardless of where your data source come from. And we serve the data to those applications on top layer. So by sitting in the middle, this application, our solution is applicable across all environments. Like I said, whether it be bare metal or it be on-prem, on-cloud, hybrid, multi-cloud or as a mix. Where we're best at is analytics and AI. So that's where we highlighted there. So that's what this data platform does, Alexio. All right. I give you a list of sort of the snapshot of a different kind of industry and companies that are currently using Alexio just to give you an idea on internet companies. So those are a mix of both CE and EE version users. What is CE is community addition. Those are free. That's the open source that we offer. Of course, as a company, we offer a EE version enterprise. So we are enterprise software, B2B business. There is feature difference, support difference, SLA and et cetera, between the CE and EE version. So this is a mix of both our community users and our commercial buyers. So on internet companies, so to give you an example, Uber and Meta, they use Alexio local cache that helped them to improve their performance on the SQL engine, especially on Presto. RaptorX is one of the projects we collaborated with Meta a few years ago before it was still in the name as a Facebook. And then A improves their Presto SQL with 10 times faster. And then Uber also deployed Alexio local cache in large scale where we actually see, I don't remember when I put a link of the Uber engineer block there, but if you're interested in looking into that to speed up your SQL engine, you can look into Alexio local cache and all of those engineer block would jump out. Another case will be Expedia. They actually use it in a slightly different way. They deploy implement Alexio tools, federated cross region data lakes in AWS. So that way they don't need to worry about data migration when they're building their machine learning data lake, data pipeline on top of their data lake infrastructure. All right, so we go a little bit on sort of the architecture diagram of Alexio as an access layer. So I'm gonna draw your attention to the left side where it says, if you imagine it's only on this side the offline data platform and data warehouse. So we imagine that your training is co-located with a storage system where you're not decoupled. Then in this case what Alexio is good for is we can help you to speed up your storage system data access to a high-concurrent retraining workloads to increase your GPU utilization rate and to reduce your storage system pressure and access cost. Now if we're looking at the right to lower left one where I point a training data to the offline training platform to just give you an example. So the training inference cluster when it's not co-located with your storage GPU and then with your storage system or when your GPU is limited because your GPU is a limited resource, right? So most likely your data warehouse and your training platforms are not co-located. Then the data locality becomes an issue. So Alexio is actually best at resolving your data locality issue. So before you deploy Alexio on the lower left corner as an engineer you won't need to manually migrate as needed to copy to a local dist or copy from the one HDFS cluster to another HDFS cluster. But after you deploy in Alexio if you look on a lower left one you really only Alexio can access and then cache and then manage your data and demand without the need for your manual migration. So basically your offline training and inference can just tack into Alexio to get all those data you wanted. And then the same logic goes to from the right box to the upper left box except for it is for the training similar to the training migration that's for model migration for online inference cluster. So basically the gist of idea if you could look from the right box to the left two we help with migration and we speed up. So the idea is because most of the time your GPU resource is limited and it's most likely your data warehouse and training platform are not co-located. So I kind of went over all those problems we resolved earlier but just in case you want to take a snapshot of what we do. So compared to that data migration the problem Alexio has that accessing layer overall unified access and layer is we reduce the engineer overhead in customizing that data migration. So you eliminate your manual work. We also eliminate the process of your manually deleting those outdated data in your persistent layer. And then we also reduce the high pressure for those data nodes and then data throttling caught in flourish storage. And you also don't need to worry about the difference between your online platform on your offline platform. You basically just need to access Alexio as sort of a unified API. That's what we provide. So to on the commercial version they call it a transparent URI and you're just tapping to it. And then the problem we resolved for compared to your having that direct access so some people say what if I just have direct access? The problem is the GPU utilization. So give you an example. The ResNet 50 will get about 20% to 30% GPU utilization. Now if you use Alexio you will get about 80 to 90% GPU utilization to overcome the IO bottleneck and reduce the IO waiting for CPU and GPU and thus we increase the utilization. And we also save time by limiting those save money by those limiting those API costs. So each cost quite a bit especially in machine learning case. In those scenarios at the scale of machine learning you're talking about tens of thousands API costs. And then in each single machine learning training your API costs can easily go over a few thousands about tens of thousands. So that's compared to everyone you have. So this is where you compare accessing Alexio to accessing your cloud direct access. All right. Oh, this is to give you an example of some companies are already using Alexio. So in Microsoft case the user for training enable high GPU utilization that we talked about in cloud training. Shopee is similar case for GPU cluster without data migration. And then there is some other big tech companies so a lot of names we cannot mention because there are customers. We're seeing 250% performance gain for inference and for their online platforms. And then there's some other companies that are using Alexio for pre-processing Spark machine learning. They have the data in the cloud GPU on-prem so the Spark machine learning pre-processing go through Alexio and then load back to the on-cloud to continue their training journey. So our customers do try to customize to actually make the best use out of it. All right, I'm gonna pass the torch on to Lu who's the engineer who sort of work on Andorra project. Okay. Hey everyone. This is Josh Ming for introducing Alexio's data access layer. So I'm going to introduce Dara. So Dara is Alexio's next generation architecture for AI machine learning. So Dara is job for decentralized architecture. So like why would come with this Dara architecture? It's basically like just like AI machine learning is pushing their limit and our users are actually pushing our limit. So to fulfill their requirement we come with a next generation architecture. So for the motivations and benefits. So our users, they already use Alexio for single training like two billion files in a single training. And they tell us that, okay, I want 10 billion files in near turn and the user demand is still increasing especially in this fears competition like all the giant tech company they are competing for building the giant models like GPT4 and other things. They want to join this game and they want their model to be more accurate as possible. So they want to fit more and more data into their model. So they keep asking the infrastructure to be able to support that ability. Basically they want a limited scalability. And for the reliability part. Our users, they keep require us to provide a really reliable like data infrastructure. So one really concrete example is that our user directly ask us to provide 99.99% of availability which gives you a downtown only like 15 minutes per year. And even those top story system vendors they only provide like 99.9% which give you like 40 minutes like downtown per month that kind of SLAs. So they ask us that we as an application level we don't want to care about the data errors. So you guys handle the data errors yourself. So this gives Dora a high motivation to develop something that not have a single power failure and it's highly fault tolerant. And for performance wise like different application they have different performance requirements but basically they want the data IO not be your GPU utilization rate limitation. So that's a result like Dora implement all kinds of performance optimization from zero copy network transmission to support high concurrent rate in the training workloads. And our users also want like some fancy features for data governance. They want to use Alasio like for multi talents and also manage how the user going to use Alasio and all the resources. They want to have quota management and they all have like their specialized security model. Like many of our users they don't use the generalized security. They have their own like security model. And how can we be able to support all the different kinds of security model so that it's not limited to a certain like popular one but they have other company like customized ones. And that's the motivation for us to build a whole new architecture. So to understand like how Dora provide a high scalability and high availability I will first introduce the whole architecture. So Dora basically in the middle part is this between the application, the AI machine learning application on the left part and also the under storage on the right part. So it helps us to fetch data from under storage and catch it to places that closer to our AI machine learning applications. And all the middle part like Alasio client and Alasio cluster are the Dora architecture. And this architecture have three main major parts. The Alasio worker which is responsible for execute detail task. So it will help fetch the data from under storage and catch it locally and provide to the client. And we have the server tree which basically managing all the Alasio worker information. And we have Alasio client who gets the cluster information from cluster server tree periodically. It gets the task information from the AI machine learning application. And you have is over algorithm based on the affinity block location policy and consistent hashing algorithm. Basically it can base on the cluster information and task information to calculate, okay, which worker should execute my task? And it will directly go to that worker without contact something like master or other things. So directly calculate the worker and go to the worker and ask the information from the worker. So in this whole structure, there is no single power failure and it's highly scalable. I will talk about more details in later slides. So this architecture is supposed tens of billions of data. So it's high scalability comes from two highly scalable component. The first one is the server registry. And the service registry, remember for the architecture before that it key check of all the worker information. And the server registry is not in the critical IO path which means that when you read a data or write some data so it's not the client does not need to contact with the server registry in that part. So it will not be the IO bottleneck. But also we utilize highly scalable server service. For example, the Kubernetes ETCD which supports tens of thousands of also workers and with fine tuning, it can even support tens of thousands of a lot of workers. And on the other hand, on a lot of workers side is designed with easy addition and removal of nodes. So basically it's isolated, it's decentralized. So you can simply add more nodes when needed. And each worker nodes is support tens of millions on files. So with the two stability components together, we are able to support tens of billions of data. And on the other hand, this architecture does not come with single power failure. For server registry, we are using the highly available server service. So you have mechanism side, to be consensus algorithm data replication to ensure that the service is highly available and does not have for a single power failure. And on the other hand, the Alasio workers, killing any of the Alasio workers have limited impact. So we will basically like fall back to the under storage or relocate the job to other Alasio workers so that even if anything like happen unexpectedly, the whole system will still functioning. And from the user side, they may experience a little delay because the catch heat ratio job a little bit, but the whole system is still functioning. The user can still fulfill their IO requests. So on the other hand, that Alasio have like extensive experience in AI machine learning industry. And we have many company users that are working closely with us. So providing us with really valuable information like data pattern and traffic pattern. So especially the read traffic pattern. So analyzing those data pattern and traffic pattern, we can come up with our performance optimization approach. So our users like for the traffic pattern, so there are some similarity between like analytics and AI wallows. Like they all have sequential streaming read. But on the other hand, for AI wallows, we experience much higher concurrency of position read. So the read wallow, we experience much higher degree of concurrency and they are much more random risk. So for example, like AI machine learning, they are much more unstructured data. So the structured data like a table, they may only want to read one column from the table. And they may distribute the read for one column to multiple different threads. So which from the storage side where we look at the traffic pattern is pretty random. So that's why we need a good support for the high concurrent position read. And also because they may be only reading one small cell from the whole table. So each read can be quite small. So for arrow file, while user provided as the traffic pattern, the arrow file read can be as small as 4KB. And based on the read patterns and all the data patterns that provided by our users, we are able to come with some optimization. The one for sequential streaming read and the other for high concurrent position read. For the streaming read part, we develop geo copy data transmission. Basically it reduce the data copies as much as possible and improve the memory efficiency so that our read can have a higher throughput. So in fact, this optimization have proved to improve the large file sequential read performance by 34% to 50%. And on the other hand, for concurrent read, we implemented the high concurrent position read throughout the whole repo without many of the smart read optimizations. And with this approach, we can solve up to 100 times read amplification issues when you are using the traditional sequential read. And on the other hand, we improved on structured file parallel read up to 9% night times and also improve structured file position read from two to 15 times. And as we talked about earlier, that by just mean that Alasio provide the catchability to speed out the data access from cloud storage to our different applications. But on the other hand, Dora enable us to connect different like machine learning AI stages together and they move data from one kind of application, one kind of data platform to the other kind of application and platforms. So from data pre-processing, feature engineering model chaining to model inference. And this, so the reason that Alasio can connect different like application together is sense to is easy to use data access API. So it's compatible with the mainstream interfaces like poses, S3 and HDFS API. So data scientists and data engineers can use their familiar ways, like your familiar like techniques, like for example, you're assessing your local file system approach or assess the data in S3 and HDFS to assess Alasio. And also Alasio support connecting to different like storage renders. So when you're changing from one storage render to the other, you also do not need to modify your AI application and training squeeze. And one thing for the Dora architecture is that you can actually use the Dora client itself as a local catch solution. It's a standalone solution. So we call it Alasio Fuse SDK. So it's the poses local catch for remote data. It allows us to assess the remote data set, the S3 data set, HDFS data set, et cetera, just as the local data set in your local file system. I also have the built-in local catch management so that you can take the data from your data source and catch it locally, manage the data in vision, only catch the hot data that needed to be used by the AI machine learning applications. And so that compared to, so it also benefit from many of the Dora optimization that I just mentioned before, the zero copy to reduce data copies and improve the throughput and also the high concurrent precision read. So compared to S3 FS Fuse, which is the most popular S3 poses solution so that you can use S3 data just as your local data. Compared to them, in millions, billions of small file changing cases, we can achieve two to 20 times IO throughput improvement and up to nine times of metadata latency reduction. And we are now also like, so we try to see like what Dora can work in different AI machine learning stages. We try always try to expand the Dora's use cases and also leverage is like architecture, scalability, availability and advanced features. And one of the cooperation that we are cooperating with a giant tech corporation to try to see whether we can use Alasio to catch large language model responses. So large language model, so you guys may not be familiar to give you an example that Choi GPT ESL is a large language model. And we are working with those giant tech cooperation to see whether we can reduce the response latency and computational causes to actually speed out the large language model responses and reduce the computational costs like reduce the GPU and also the latency. Okay, that's it for our talk. So do you want to talk about the QR code? I want, okay, yeah, so in case since we didn't even get it running earlier. So there is a QR code that is a community survey. So if you're interested in getting in touch with us, just scan that and then just fill in your name basically email. And then I do have also our Slack channel there. That's usually where we're at. And then the bottom is our GitHub repo if you're interested in checking it out. So that's where at once you've done was a QR code. I think I should also flash back to where the two of our emails are because you're interested in getting to it. All right, and then we're open for questions. Yeah, well, we'll change to the slides there. Move too fast. Anybody have any questions? Yeah. I'm curious to know about the source of the optimization comparing to S3FS fuse. You're also using a fuse layer. So the inefficiencies aren't from fuse. So I'm assuming that it's coming from cache. Is that the main source of the efficiency? The improvement? So for this part, we do many of the investigation. I do think that for there are two parts like fuse part and the caching part and the fetching part. So there are three parts that in this workload. So we are not like initially compared with S3FS fuse just that we build out this architecture and then we see whether we can do some comparison. And we do have different like catch mechanism. So we basically use page catch, like you start the file as a page, like we have some buckies and other things mechanism. But for S3FS fuse, they catch it as a single file. So this gives them inefficiency, especially when you're reading large files and other things. But the reason that I didn't show the large file comparisons because usually it's bottleneck by the, your NVNE or your this performance. But for small files, we do found that they may have some inefficiency, but I'm not quite sure what's the inefficiency of S3FS fuse. Maybe for the metadata part, for the gap file status, I do find that there are a lot of like, maybe some synchronization mechanism so that it makes it much slower. But we are divided, we mainly target for chaining part, the chaining is highly concurrent. So that's why we put a lot of effort to reduce the, to improve the concurrency, to reduce the synchronization, but also maintain the current needs. Thank you. I wanted to give somebody a chance again. I've got lots of questions. So would you say that you're optimized for read-only workload or can you write results back as well to have the same data set being shared by other consumers? So your question is that whether we support write and whether we support that, you write and then immediately read. Right, so for example, you said that if the efficiency in part comes from the metadata, that you're caching metadata, then maybe the backend storage doesn't have the correct metadata immediately available for other clients. So for any storage system, especially a caching system, a big issue that we need to solve is like, under storage they have the source of truth, like your metadata, your data. And how can we be able to synchronize with them? So in the Dora architecture, one good thing is that the application can provide us with several information, but basically they can provide us the file name and also timestamp when is the file being modified. And then for our caching system, we actually will check that, okay, whether it's that copy that you want, that version of data that you want. If it's not, we will invalidate and re-catch it from the under storage system. And for our write, we will write directly to the under storage. You can optionally write to the cache, but for that part, like the caching part, it's normally like, we also write a version to it so that when you read, if the application already know the version that you're going to use, so you know the version of the data. Okay, so it's write through for write and then cached on with an invalidate. Yeah, yeah.