 Good morning, everyone. Let me start our talk. We are first talk in this morning, so we, I'm Jeong Gyu-shin from South Korea and working at LeBron and Joon-gi came from the same countries, South Korea and working as CTO in LeBron. Yeah, thank you . We will introduce the Sokoban. It's a very, very interesting implementation for the conflicted new content orchestrator, especially for the accelerated machine learning and AI. So let me start. We will explain the problem and our approaches to the problem. So let's start with the Sokoban. After that, we will introduce the Sokoban and explain some characteristics. After that, we will just explain some practical case and show the short demonstration. So let's start. You know the Sokoban game. It's just push some container to appropriate location. Machine learning and AI work route are getting bigger and bigger. So the pace is very, very fast. But the problem is that the container has just some problems nowadays. And high-performance computing, in contrast, it has very different characteristics from the machine learning work route, but it has some, the very high barrier to use the containers. And also, even though the container is very easy to make your own computation environment, but the container is actually not intended for the long-running job or patch work route, especially with current orchestrators. So we tried to solve the problem in 2015, maybe eight years ago, because we were getting addicted to the container and there is a Docker. So we tried to make some research platform on Docker. At the time, the slurrum is widely used and Docker just came to public three years ago and there was no Kubernetes. It's just named as Google book. And of course, there are no NVIDIA container or GPU specific container systems yet. So we tried to combine the pros of the slurrum and the Kubernetes at the time, just a book, to solve the problem with a good point of each systems. When we started our project, we tried to make some plug-in architecture or modification to the Kubernetes system at the time that was better. But we faced a lot of problems at the time, because Kubernetes is really, really great tool, but actually it's not intended to run the AI work route, especially for the patch work route. Nowadays, it's getting better and better, but eight years ago, there was no way to use Kubernetes for the long patch job and long-running computation task. So we just think let's just build a new container system. So maybe with that new container system, we can make batch jobs and interactive jobs together very well. And maybe we can just abstract the hardware, placing APIs, or we can just adopt many, many new technologies into our new container system. So we can make our new container system. So maybe our new container system can be super passed because we can make it by our own with our many tweaks. And maybe it's super customizable because we try to make it in Python. But the problem is that, you know, the container ecosystem is quite large, and there are many, many, many improvements on the container ecosystem. So if we make our own, we can adopt the progress or challenge, progress from the Kubernetes ecosphere, right? But anyway, we try to start to make our own system eight years ago. So we named it as a SoCoVan, actually. SoCoVan is a game to push the container into the eject position. And the biggest difference between our SoCoVan and Kubernetes, there's no pod. We don't have any kind of pod architecture with our systems. So it's very flexible. Every nodes are just the container nodes, and the VM or usually the open stack node. And we can just run any kind of container on the system. Maybe it can be Docker or container D or others, for example, the Magnum. And also we apply the customized scheduler on specific nodes, computation nodes. So we can choose many, many patterns for the scheduler. And also we focus on the multi-tenance because we face a lot of GPUs and nodes. So we focus on the pre-acceleration of our system and also combines the two-level scheduler from the whole system cluster-level scheduler and the internal scheduler too. And also we made a job or the agent server system to manage the container management software, such as Docker or container D and Kubernetes itself too. And also finally we make a very straightforward integration for the many, many hardware, including the GPU, NPUs, and the many accelerated network systems. So we make an open source since 2017, and we run the monorepo and our system runs on the inter-architecture, ARM architecture, and RISC 5.2 and runs on Linux, Windows, and MacOS. And also it runs bare metal or open stack and Docker or Pubman and many kind of container based systems. And even nowadays we support the direct management for the open stack VMs. So our system is written in Python, newest Python version, and we use the PostgreSQL and Redis and DCD. And we start our project since 2015, and we open sourced our project in 2017, and we announced our system is open stack ready in 2018. And actually the circumvent is the part of the back-end of the AI suite, the open source AI and machine learning platform, and it's contained a file component in the owner's circumvent, as we explained today. And now it operates on many, many AI clusters, and you are already using some results from the circumvent. For example, if you are using the Samsung smartphones or some LG, the DC washers or any kind of the systems, you are already using the AI weight on our systems. And nowadays we are running the 10,000 enterprise GPUs among the world, and it's increasing. Actually the number is not counting the open source version GPUs. So our component is quite simple. We have a container file as a circumvent, and we have a lot of the server systems and the storage specific proxies and other components. And also it just received requests from the users and distributed by the manager to the agent. And agent has their own container engines and it run container or VMs. And you can start it right now by using the developer setup. It's a kind of dev stack for the circumvent. So like you install the dev stack on your system, you can just type the script, to install the whole system on your computer. If you have some production level setup, you can just use the Python package. So I will toss my mic to Jun Gi-nim about the characteristics. So let me introduce a little bit more details about our circumvent orchestration. Actually we were given like a 15 minutes talk, but the next talk will begin in 10 a.m. So we have a couple of more minutes. So I think I'm going to explain more details. So we have many key features embedded inside this coban orchestrator, which are tailored for HPC and AI oriented GPU accelerated workloads, as well as latest like MPU or like ASIC based workloads as well. We have many kinds of abstractions for like resource groups and accelerators and storage back ends and so on, but in this talk I'm going to focus on the scheduler related parts. So let me go through here. So we have a multi-level scheduler which works in the level of the entire cluster, and the secondary level is worked in per computing nodes. So the cluster level scheduler runs inside the manager component, which administrates all the clusters and monitors the computing node status and manages the user databases and access controls and etc. So the manager scheduler controls the density and priority of the main workloads, and it performs iterative two-phase scheduling per resource group. So the resource group is a logical unit of the set of agent where the actual computation runs. So you can map any resource groups to any user or project to have access. And inside the resource group we have a set of agent and we feed the scheduler about what computing sessions are already running and what are the agents and what are their capacities and what accelerators they have and so on. All these things are given as a structured input to the scheduler and the scheduler decides to see which pending session will be scheduled first and then which agent node will host that session. Actually we also support multi-node and multi-container sessions, so in that case individual agents, there may be multiple agents to map the single session. We have a scheduler plugin interface, so each plugin may define these two steps separately and we already shipped our Sokovan orchestrator with two schedulers, one is a heuristic FIFO and LIFO and DRF I think. So heuristic FIFO is a scheduler that works mostly like a FIFO scheduler first in and first out, but it has a special concern for preventing the head of line blocking problems. So for example if there is like a pending session request that requires too much amount of resources, then it cannot be scheduled on the cluster and all the remaining like a subsequent pending requests are just blocked because of the session cannot be scheduled forever. So in that case we automatically boost the priority of the subsequent smaller sessions that can be scheduled right now so that the cluster can keep continuous working on the new sessions and also we have many detailed configurations like pending timeouts and so on so that we can automatically kill if some session creation requests waits too much amount of time and things like that. And we also have implemented DRF Dominant Resource Fairness which tries to distribute like overall loss across different types of resources in the cluster. And in the node level resource scheduler which runs inside the agent which are like a small daemons running alongside the containers, you can think this as a kind of a daemon set in Kubernetes maybe and it allocates the actual computing resources to individual containers and it considers many stops. So for example, if there are like a numerous nodes in the agent node with different GPUs mapped to installed on different new nodes and so on, it considers such layouts and things to optimally assign the computer devices. And also we take advantage of NVIDIA's NCCL to automatically configure the overlay networks and the interconnect between the containers when things are distributed. I will go into more details about this. So for example, the NUMA Organis is one of the key scheduler feature required for HPC oriented workloads. We provided two different policies that can be configured when you run a new computer session comprising of multiple containers. So one is interleaving and the other is prefer single load. So for example, with Weka.io or other GPU directed storage, some of them often requires having active CPU cores on the same NUMA node where the assigned GPUs are residing in. So in that case, we need to apply the interleaving policy so that we could ensure all CPU cores from the same NUMA node, from the GPU, the set of GPUs we are allocating. But if we... In other case, we may also want to prefer single node policy to maximize the performance by like concentrating the workloads into a single node to eliminate the memory access delays. So also we support arbitrary number of NUMA nodes. So some of our deployments has four NUMA nodes. So it works well there as well. Yeah. And we do also support multi-node and multi-container clustering. This is our particularly nowadays getting more popular because of the like a rise of large language models like GPTs and things like that. Because the model size are too big so it cannot fit inside a single GPU node. So we are forced to use multi-node training in that case. So we support multi-container workloads in two nodes. One is multi-node and the other is single node. For each model we apply different networking schemes and we also support like nickel-based RDMA interconnects between the GPUs like MV links, MV switch and so on. And we also support GPU direct storage which accelerates the GPU to storage transfers by using RDMAs directly between the GPU and storage without going through the CPU. We have many like environment variables automatically configured by the SCOAN Requestrator to let the user programs or scripts to know the configuration of the current cluster. So and we also support heterogeneous agent backends. So each agent may have different implementation . So the work unit may be just a plain container or it may be a virtual machine or it may be just a native Linux process depending on the implementation. And we have three current implementations of the agent backend. The first one is the native Docker-based one we are using and we have a single cluster as a single computing node. So in this case the agent reports the entire total capacity of the Kubernetes cluster and makes it look like a single computing server. So every scheduling happens inside the SCOAN site instead of using a virtual machine. So that we can enforce our own scheduling policies and configurations. And we also have OpenStack agent backend which is still on alpha stages but it runs virtual machines instead of containers but it can also run containers on top of OpenStack. And we also have a one-in-one application. So I think airflow and MLflow are currently the most well-known ones. So for the airflows we have two size of integration. So we can just run an entire MLflow airflow applications inside the container on top of backend.ai as well. So this is our session management API and to work as a kind of backend scheduler framework of the standalone airflow service. There are two ways. So I'm going to show a small demo. So this is our web GUI and you can see the several menus that are showing the current running list of sessions and how to create the sessions with the various environment and configuration options. And you can also choose batch sessions or interactive sessions and which storage folders to mount with how to set the resource amounts and you can move and just start the sessions and it will pop up and you can choose the application, interactive applications in this case. You can also directly access the shell inside the container and also Jupyter notebooks right away. And all these are like provided using a secure tunneling with per user authentication automatically. Yeah. So this is just a plain workload here and you can also use the code server and so on. I'm just going to skip these parts. This is the administration interface so you can configure and add or remove users and their resource policies and browse through the container images and so on. So I'm going to skip this part and we have like a new extension called the Fester track which is like a GUI based MLOps framework. So you can use like a graphical designer to compose your machine learning pipelines like a data preprocessing and model training and deployments and so on by adding like modules and so on so you can you have the same configuration setups like resource amounts and which container images to use and so on. And after then you can run this on top of back end.ai and the Scoban orchestrator so you can also see the Yammer configurations to share the pipelines with others. So you can run and yeah, the computing jobs starts after their dependencies finishes with the success and so on. So yeah, this is currently on this stage but I think this will be a great enhancement to the Scoban orchestrator so that you can use all these nice features and I think this is like a quick summary on the features and I think now Jonggyu can introduce our like field cases. Already, yeah. Okay, let me share the field case. We have many practical cases, maybe more than 70s and some systems have more than 1000 GPUs on the system and also the more than 500 OpenStack VMs and yeah, this is just a example of the practical configurations with high availability and there are some tests this is done in United Kingdom so with that kind of different abstractions we could achieve the optimal performance for the hardware. For example, this is the example to train the large-language model. Actually, the theoretical Teraflops is the 150 for the GPUs and we achieved almost similar to the... Yeah, reach the limit. So, and we just compared with the Teraflop case and we found that our system is just less than one point difference between the optimized work load and our automated work load on containers. And also with the GPU abstraction, we could adopt the maximum GPU directed storage made by NVIDIA and the WECA.io this is the network access storage. So we could make a world-first implementation for the GPU directed storage with a container-based AI cluster. It achieved faster than the 150 gigabit for a second. So it could feed the GPUs enough. And let me just summarize. We designed the new orchestrator based on the completely different abstraction. There's no part, there's no limitation. Everything is very fluid. And it's easily hackable because it's written in Tyson. And also we optimized the allocation and deployment of acceleration hardware including GPUs and NPUs and then networks, especially for the in-pin event. And also we explored the full potential performance in multi-node situations with 100 of GPUs. And also we achieved a couple months comparable to bare metal work load with SLUM or other bare metal work load. So we could make the GPU-to-GPU networking GPU directed storage automatically comparable with just a single click of the MUI. And we achieved a theoretically maximum performance with GPUs and GPU directed storage too. So there are more and more stories but we have a very limited time. So if you have a question you can just visit us after this event. We'll be right here. So give any kind of question to us. So thank you for listening . And enjoy the pudding plasmid. Thank you.