 Okej, myślę, że jesteśmy gotowi na początku, więc pozwólcie mi się zrozumieć. Jestem Michał, jestem software engineer pracujący na Google. Dzisiaj obejrzę Juki, który jest software engineer na CyberAgent. Jesteśmy rozmawiać o Q, który jest projektem, który jesteśmy maintenance-off. Zobaczymy i pokazujemy kapalitów Q, aby budować platformę, który uruchomił AI ML Workloads. I w szczególności skończymy kapalitów na Quota resource managementu. Zobaczmy, co jest Q. Q jest składem, a jak to jest odpowiednią odpowiedź, to jest oznaczenie, kiedy zacząć czy stąd stąd. Także to opowiedzi na jednym poziomie wystarczająco niż kuby składów, które opowiedzi na pods. Tak więc, z Q, podkreacjowaliśmy, podkreacjowaliśmy, dopóki robota zaczęła, i to jest bardzo ważne, ponieważ z tej kapabilii, możemy wycofować API server i Cube scheduler. Inna ważna charakterystyka z Q, jest, że mamy obowiązku, odpoczywanie, nie ma żadnego semantyki, więc dla roboty machine learning, To jest bardzo ważne, że wszystkie podróże skończą w tym samym czasie. Więc, jak możesz wyobrazić, że zimno zasady, jeśli dwie wielkie roboty zaczęły w tym samym czasie, to mogli zabijali się, a nie mogli skończyć wszystkie podróże. Więc to jest jedna z najważniejszych obowiązków dla Q. I oczywiście z Q podróże nam pomagać kwatermanagement, więc pomagać kwatermanagement, w tym samym poziomu, że pozwoli nam specyfować kwater per podróże, ponieważ może są pewne priority. We also support specifying quota for different resource types like different models of GPUs. And we also give you control of preferences for different machines, like how they are provisioned. Is it on demand or reservation that you have some discounts on? Because maybe there are different prices depending on how the nodes come to be. So how does Q achieve its goals? So the main design principle of Q is actually very simple. i chcemy być cloud-native. I co to znaczy? To znaczy, że na przykład nie musimy mieć żadnego eksternalnego databasea. Wszystko idzie w ETCD i jest uruchomione przez CUBE API server. I też jesteśmy kompatyjni i korzystamy z wszystkich kubernetowych komponentów, np. cluster autoscaler, CUBE scheduler of job controller. Niektóre z tych komponentów jest w porządku, kiedy pracuje przez CUBE. I oczywiście, czasem, możemy znaleźć gapy w kubernetowych komponentów, ale potem mamy politykę do poruszenia uruchomionych uruchomionych. Więc jedna z sukcesów do poruszenia uruchomionych uruchomionych uruchomionych uruchomionych uruchomionych jest suspensemantyka. Więc to jest co to, co potrzebujemy dla CUBE do poruszenia uruchomionych kubernetowych komponentów i było dodane do kubernetowych kubernetowych komponentów, a potem oddał się do innych kubernetowych komponentów, pozwalając nas do integracji kubernetowych kubernetowych kubernetowych. I przez dzisiaj kubernetowe kubernetowe kubernetowe kubernetowe kubernetowe kubernetowe kubernetowe kubernetowe convergence историяambling rany Muse i symbols researchers sought team the internet the co the we Możesz też wytrzymać ich z Q. Aby zrozumieć, jak Q działa, zacznijmy od tego, gdzie jest jego miejsce, m.in. kubarnatowe kontrolerów. Zacznijmy od tego, jak działa ta cykl. Zresztą roboty zaczynając przez user, który tworzy to roboty, a w kubarnatowej kubarze Q spędzi to roboty, z tego filmu, o którym słyszałem. I to jest skupione, aż kwota jest rozrzucała. Więc kiedy kwota jest rozrzucała, do tego robota, mamy jeszcze jedna opcjonalna czeka, dla odmienienia czeka. Więc to jest básicamente mechanizm, który pozwala ci się zauważyć swoje logice, przed odmienieniem. Czeka. I kiedy odmienienie czeka jest w pełni, to robota zaczyna się uruchomić i odmienienie czerwonej kontroleru tworzy połówki. Zresztą te roboty mogą być przerwone, np. stop by queue, jeśli jedna z najwyższych roboty przychodzą. Ale w szczęśliwym połowie, kluczator sklejczy i tworzy podróż na podróże i wreszcie kube sklejczy podróże do podróży. Więc teraz zobaczmy na to, czy jest jeszcze inny okoliczny okoliczny api, które są ważne w tej chwili. Więc najważniejszy okoliczny okoliczny okoliczny jest job. I to jest całkiem jedyny okoliczny, który zainteresowano z niej. Więc jobs are created by users, and here is where the user specifies the number of pods that the job requires, and also the pod template that specifies the amount of different resources required to run the pods. And as I was saying, Q supports various types of jobs, so in order to abstract out them and also add some extra details, we Q creates the workload object and maintains one-to-one correspondence between the workload and the job created by the user. And the jobs are sent to local Qs that is indicated by the label on the job added by the user. And all the quota management is possible due to the cluster Q object API, where the admin configures the quotas. And the main concept for configuring the quotas is resource flavor. So resource flavor is like an abstraction for a set of machines of the common characteristic, where the one dimension of the common characteristic may be let's say the model of GPU that you want to use, another maybe how the machines are provisioned. Is it reservation spot on demand or just like partition of basically a set of machines with common characteristic. Another important concept for Q is cohort. So cohort basically sets you the scoping for preemption and borrowing for a set of cluster Qs. And we also have the concept mentioned before of admission checks that let you define the additional conditions under which the job is submitted. Now I would like to present you the project of the batch reference architecture. So the goal of the project is to help the cloud architects or system architects to build platforms for running AIML workloads. And this is inspired based on the feedback from the GKE users. So it's like to collect all the best practices for building subsystems. It doesn't necessarily correspond to one to one to any of the systems, but it's a kind of abstraction for a recommendation how to, what is the starting point for building subsystems. But they still can be customized per user preferences. If you are interested more into the delays of the project, here's the link. So I recommend to take a look at this. But in this section, I would like to show you how Q works based on the simplified example inspired by the batch reference architecture, where I would like to show you how Q works. So in this setup, we have two teams. The blue one is the higher priority team and the green one is the lower priority team. They send the workloads to the corresponding local queues pointing to the cluster queues that are wrapped into the cohort. And at the end of the day, we have physical resources in the cluster that are split into three resource flavors. The reservations on demand and spot. And you can, for simplicity, think that we are talking about one model of GPU. So in this scenario, let's say we have the low priority workloads sent by the green team. So for the first workload, what Q'd us is we consider the resource flavor defined for the cluster queue below in order. So we first consider the reservation resource flavor. And in this case, we don't have nominal quota, but we have the borrowing limit that allows us to run one of the workloads. Because we will borrow from the other cluster queue. So this allows us to put the first workload on reservations. For the second, for the remaining workloads, we no longer can borrow. So we lend them on the spot machines. Now let's say the high priority workloads come in. So again, Q considers the workload in the order. First we try reservations. But what we see is reservations are already taken. So we would need to preempt to run. And this is a configuration option. But by default, we prefer not to preempt. So the first of the workloads will land on demand resource flavor. For the second, we cannot longer put it on demand in this example, because we don't have more cluster queues to borrow from. So we start preempting. The preempted workload is recued. And the high priority workload lands on reservations. And then finally the low priority workload lands on the spot. So yeah, with this simple, relatively example, you could see the main concept of queue working and how borrowing and preemption works. And with that, I will now hand over to Yuki, who will show how queue is used in production in cyber agent. Next topic is production use cases. First of all, let me introduce my company. It is often said, are you a cyber security provider? That answer is no. Actually, my company is a content provider with a blog site streaming platform, smart home games, internet advertisement agency in Japan. So in this topic, let me introduce a queue in cyber agent. My company has internal on-premise ML platform and the static computing. In this infrastructure, we use bare metal machine for GPU nodes. And the environment is heterogeneous with seven types of GPUs, like this. Additionally, based on the left side, our Kubernetes cluster is built as a single-bannier multi-tent cluster. My company has over 300 namespaces and all namespaces are created per users. Also, we've been operating this cluster for over four years, so we did not create a new cluster to install queue. We just installed queue into the cluster in operation. Next, let me explain about kind of workflows and frameworks. As you can see here, we have primary three types of workflows, training models, notebook, and serving models. Basically, in ever workflows, we use open-source frameworks, like QPro, KSA, batch job, and so on. But we manage only notebooks by ourselves. As we mentioned before, queue can easily adapt to in-house Kubernetes resources, so we implement it to a small in-house queue job controller for notebook. After the next slide, I will show you detail to training workflows. Once our training workflows is building LLM, the training jobs are managed by upstream queue and QPro MPI jobs. Please check our hugging phase repository for more details. In this slide, let me explain how to guarantee the sequential automation of MPI jobs. The QPro MPI job is constructed by two roles, launcher and worker. The launcher role is responsible for starting the MPI process using MPI run and so on. The worker role is responsible for performing actual training processes. So we need to allocate resources to all role parts at the same time. In general, such behavior is called all-or-nothing scheduling. By default, queue does not guarantee that the queued job parts actually get ready. Therefore, resources may be allocated to subsequent job parts. Here we can imagine the situation in which GPU utilization fragments among nodes. For example, there is a situation in which five free GPUs exist in the cluster. However, each node has only one free GPU. In this situation, when we submit a job requiring two GPUs in every part, job part cannot submit, cannot start even though the queue admin's job. Wait for parts ready feature give us a possibility to resolve such issues. We can also configure the location, head or tail of the queue when the queued job is put. Once the queued parts cannot get ready until time out, the job is evicted and pushed back into the head or tail of the queue. In this slide, I would like to introduce issues specific to wrong-learning operational clusters. Ideally, we should manage all computing resources and workflows by queue. However, there are some gaps in the real world because in general, some kind of quota and workload management system already exists. And the existing system and queue often conflict. Actually, there are such gaps in my cluster. In my cluster, the existing quota management system depends on this core resource quota. Initially, we were planning to switch from the existing system to queue at once. However, we found that we need to migrate step by step because we need to avoid stopped services. So we consider two approaches to migrate from existing system to queue. The first approach is wait for pots ready. And the second approach is automation check for resource caller. After we evaluated two approaches, we selected the second one, automation check for resource caller. In the next slide, let me explain the result of the evaluation. As I mentioned before, the wait for pots ready can be introduced easily because we can use feature only by modifying queue configuration. But when admitted job by already resource caller, queue controller manager continues to try to create job pots and job continue to be queued. This repeatedly creation increase queue API servers load. In general, the capacity of queue API server is limited and variable. So we gave up to migrate only by wait for pots ready. In the second approach, automation check for resource caller. We can avoid increasing the queue API server load. As we have mentioned before, queue provides automation check for extensible automation decisions. So we can add other automation decisions easily by implementing the small controller. In the next slide, I will show some situations in which cluster queue can be configured. In our environment we have conflicting demands. First one is all GPUs always should be allocated to user as workers. Second one is cluster admin wants to verify problem features using GPUs. So we define admin cluster queue as entirely overlapped with user cluster queue. Then we decided only borrowing remit in the admin cluster queue. Once user jobs are submitted to the queue, admin job are preempted. The second conflicting demand that the first one is important project want to resolve GPUs so that they can use GPUs when they want to use it. The second one is any GPUs should not be left over for efficient usage. So we created dedicated cluster queue against important tenants in business perspective and all same priority tenants cluster queue belong to the same cohort. Also any business cohort preemption policy gives us possibility to realize the above conflicting demand. Next topic is new features. Recently we make significant progressives to improve resource utilization and increased scalability. The first feature is the rendering remit. As Miharu mentioned in the previous slide we introduced dynamic query admission of job and reclaiming resources. The first I'm introducing dynamic query admission of job and reclaiming resources. The first I'm introducing dynamic query admission. By default queue admits jobs only when all job parts can be allocated to enough resources. But I understand to demand that we may be satisfied if only a minimal number of parts are started. Based on left side diagram when we enable the partial admission we can deploy only currently available powerism into the cluster out of requested powerism. It means that only part of job is deployed. The next feature is dynamic query reclaiming resources. This feature has similar use cases and concept with partial admission. As I mentioned before partial admission affects the admission decision. When some job parts are completed even though some job parts are still running by default queue keeps marking all allocated quarter as in use even if some job parts have already finished. Based on right side diagram the dynamic reclaiming resources allows us to release a query as soon as job parts are finished. Ok, so I would like to continue with the other features that we are passionate about and they are currently either in alpha or design or implementation phase so just coming soon. The first one is the integration with the provisioning request API. So maybe let me first introduce the provisioning request itself. So this is the new application developed in collaboration with the cluster autoscaler team and the aim of the API is to provide all or nothing semantic to the queue. So the current problem with cluster autoscaler integration with queue is that cluster autoscaler only creates nodes based on existing pods so this requires us to create pods and then for large machine learning training jobs first of all the scaleups can take very long so that some pods are running some are not and we are in weird situation but also we can just fail due to GPU stockouts and having 99% of pods is not enough. So provisioning request API aims to solve this problem without the need of the pods to be created. So this is basically the API that is exposed. First we have the podsets so this field like a single podset is basically let's you specify the number of pods that are required and the pod template that contains the requirements on the amount of resources and we have a podsets so a list of podsets in order to support heterogeneous jobs. Second we have provisioning class so this field is like a string it's a string that lets you indicate the semantic for the provisioning request. So there are a couple of semantic builds into cluster autoscaler but you can extend the list by using some cloud provider specific APIs. And finally we have parameters that let you configure the desired behavior. Let's take a look how Q integrates with provisioning request. So as before we have the job created by the user the job is suspended and the quota is reserved and at this point Q creates the provisioning request API and now is the job of the cluster autoscaler to provision the nodes. So once cluster autoscaler is done the corresponding admission check is marked as ready and now is back to Q to inject all the necessary information like node selectors, labels and what not so that the points can bind to the newly created or reserved nodes. The next feature that we are passionate about is multi-cluster job dispatching aka multi-Q. So the main goals for this project R2 the first one is to improve the GPU obtainability. By using clusters in different regions you can get to maybe in different regions you have the GPUs available at different times due to peak hours being at different times and you can with this feature also probe GPU obtainability from different cloud providers. The second use case is to scale up big computational clusters by offloading to smaller execution clusters. So in this design as you can see we have a set of execution clusters and we have a single management clusters with which the user interacts with. So it's like fully transparent in this design to the user that there are multiple clusters behind. And in order to achieve that in the management cluster we don't create pods so there are no actual computations and the status of the job created by the user is live updated based on the progress in the execution cluster by the multi-Q controller. So in order to achieve this we push the upstream improvement to the job controller and if you are interested in technical details here is the cap. And the next feature is fursharing. So let's say we have a cohort like this and team X is currently not using its resources because it's temporarily working on another project. So team A and team B compete for the resources. And as it is currently in the queue the workloads are admitted in the 5.0 order so this can lead to imbalances which you can sense that it's not fair. So with this feature we will resolve all the quota imbalances by preemption. The next feature motivated by the feedback from users of queue at larger scales is that we need to introduce hierarchical cohorts in order to reflect deeper organizational structures. So maybe you want to have different rules for quota management at the team level than on the department level. So this is what will be achieved with this feature. And the next nice thing is that by having deeper organizational structure we can prioritize borrowing on close distances. So as shown in this picture team A borrows from team B and this is prioritized by queue. So now let me conclude saying that if you are interested in using queue that we recommend you can just use the latest release or wait for the next or you can collaborate with us on the features so that we know your use cases better and if you are interested in getting involved then one good option is to contact us over Slack because the project is developed by batch working group attend one of the regular meetings you can find more information on the batch working group in the link or if you are interested more in the what batch working group is doing then we invite you for our presentation on Friday. And with that I'm happy to take some questions. Hi, thank you for the great session. So we've mentioned initially for the cyber agent use case that you're using a single large cluster. My question is have you considered multiple clusters and what was the reasoning behind this choice? So does it mean why we don't need a multi cluster? You have a single large multi cluster I guess that you're multi establishing tenancy using the namespace boundary so have you considered because there was a later slide talking about multi queues and multi clusters. So have you considered in the future maybe or I'm also interested in how or why you considered just a single cluster versus distributed jobs across multiple clusters so. Yes, so it's a good question. So actually so we are considering using multi queue but so multi queue is alpha stage features so we still not using multi queue features. That sounds good. Hello, so actually some feature like a multi level queue is a little similar like as a firmware like Unicorn and how you compare queues firmware like Unicorn or volcano or maybe it's possible to integrate them together or yeah, just some thoughts, thank you. Queue with other schedulers such as Volcano or Unicorn, right? So we we have a slightly different approach so we think that one of the most important benefits of queues is to delay the podcreation because it really creates a lot of load and complications load on the API server and also on queues scheduler and there was like a great summary comparing different approaches which I would like recommend you very much. I think I will not do that good job of comparing the approaches but what are the complications as there was yesterday talk you can find in the schedule it was excellent but maybe because we don't create pods we don't have like the core scheduling on gang scheduling built in queues however you can actually use schedulers from Volcano and use them with queues so this is one way you can achieve that but generally for all or nothing semantic we have a slightly different approach of provisioning requests API Yeah, thank you great talk and I have one question so regarding the GPU machines what kind of a special you know, strategy that you applied via this queue like compared to the regular CPU machines can you talk about a little detail about that I mean how to schedule the GPU and the CPU GPUs are maybe they are a little bit special first of all there is there are scarce resources so there is not a lot of them provided by the cloud providers so you often hit stock outs and also another maybe special thing about machine learning jobs very often they require all of the pods running at the same time so these are like the important considerations that differentiate running training jobs from other batch workloads so with the efforts like provisioning requests we really aim to improve like both the obtainability and all or nothing semantic so we can better support running jobs that require GPU to answer thanks for your presentation so i have one question does queue has a plan to support the future like a queuing hint or something repeat like what queuing hint in the gubernated upstream schedule has so queue is not a pod schedule so so as i can remember queuing hint is a feature for queuing schedule so we don't have any plans to implement queuing hint ok there is also i can add that we are considering to use in the future scheduling gates for job this is sort of an enhancement upstream and this can be useful for interactive jobs that you have a dark dependencies between jobs and you want to start a given job only once all of its dependencies are completed but yeah it's hard for me to assess when this will happen but we are considering definitely in the future using scheduling gates for job because they land upstream but yeah it's it's somewhere behind our heads hi this is Abishit from IBM Research the question is regarding provisioning request interaction in queue as i understood from the slides you reserve the quota and allow the machines to come into the cluster that may take sometimes if you are requesting hundreds of machines from the cloud provider the question really is while that is booting up would you support a backfilling kind of functionality inside queue so here is an example you wait for 100 machines 10 of them arrive 50 of them arrive and there are pending jobs in the queue that can utilize the machines that are already on the cluster while the 100 machine comes up so with provisioning request we have the philosophy at least in queue so first of all provisioning request is still I think better in the cluster autoscaler so if you have some use cases then maybe you should open some issue or discuss at some forums but for now the basic philosophy is that we create one provisioning request for the job so it's like responsible for provisioning nodes only for that job so there is like one-to-one correspondence so other jobs cannot land on the nodes provisioned for the job and also that we just await so for example with some cloud provider specific APIs you can maybe wait when you want to let's say provision 1000 nodes, GPU nodes the cloud provider may not give you immediately the nodes because it doesn't have but maybe it can see where they are available like predict when they will be available so that you don't kick off the scale up immediately but it is cute and you wait let's say a couple of hours but once you get you are guaranteed to have 1000 machines I'll open an issue Thank you Do we have some more questions? Thank you once again Thank you