 Hello and welcome to the learning session here. Today I'm going to share how we use decentralized file system to optimize your AMML workloads on Kubernetes. So my name is Sean, I'm a software engineer from Luxio. So first we're going to talk about the benefits that the file system brings and what is the current solution we have and a new design that we came up with. And it's implementation and finally some benchmark on the new implementation. So first, the most important advantage that this file system brings is the data locality. And there are two main benefits it brings. The first is performance gain. By bringing your data from your remote storage to your computing node, you have faster access to your data compared to remote storage like S3 or other. Therefore less time will be spent on your data intensive applications. It is also very cost saving because fewer API calls will be made to the cloud storage, both data and metadata. Plus because of the performance gain, you will have a higher utilization of your GPU resulting in less GPU time, faster training, and that's, we all know GPU costs a lot of fortune. Now the existing solution we have, which is our Alexio 2.x, we follow a traditional distributed system where there is centralized master, well, masters because we want high availability, so there will be an odd number of master nodes. But the high availability is just theoretically because we all know nodes in Kubernetes will get maintenance and there will be failovers on the masters. During those times, the system is still not serving data. So here, masters are a single point of failure. And also because the data now is, there are tons of data out there, it's pretty typical to see a training needs billions of files, so that number of files makes the master the bottleneck of the overall performance because of the lots of metadata operations. So we really need a system that has a better reliability, availability, and scalability, which is why we came up with this decentralized file system using consistent hashing for caching. So here, we completely removed the master nodes and instead of using master to determine where the cache goes, we use consistent hashing on the client side to decide where the data and metadata goes on which worker. So now because of the masters are removed, now there's no single point of failure and no more performance bottleneck on the master. And here, if there's any worker for any reason is down, the client will directly talk to your source, the data source. So in Alexa 3XX, we implement this consistent hashing algorithm for caching data. So now we have a much higher scalability. When one worker is able to support 30 to 50 million files, so with a 50 worker system, we can easily cache billions of files. It also has a higher availability. We can guarantee 99.99% uptime and there's no single point of failure. We also have a cloud native Kubernetes operator for deployment and management on Kubernetes. So we did a end-to-end CV training with PyTorch on a subset of ImageNet. So by reading the data from Alexio, we reduced the data loader time percentage from 82% to only 1%. And we increased GPU utilization rate from 17% to 93%. This is a huge improvement. And we will share more details about our journey on the data locality on Thursday. We also have a talk at 4 p.m. Thank you so much.