 So, next session, let's welcome Aldo, senior software engineer from Google, and also Yuki, software engineer from CyberAgent. Thank you. Hello. Oh, yes. Good morning, everybody. Well, we are going for a little bit of a turn. We're not talking about AI directly. We're talking a little bit about infrastructure right now. So, we wanted to introduce to you this batch system that we call Q, myself and Yuki here are the maintainers of this project. We started around two years ago, but surprisingly, we just met face to face today. So, first of all, why are we talking about this topic? Why do we need to do job queuing? Or if you want to call it training, why do we need to queue the training jobs? Well, in this world, we have too many jobs, and of course, as we have heard in the previous talks, there is limited resources, GPUs are scarce, TPS are scarce. If you're running on-prem, you have a fixed amount of resources, so you cannot scale up, right? And if you're in the clouds, you are, of course, competing with other customers that all of them want GPUs and accelerators. And you might get discounts, so you want to manage your discounts, and of course, you want to scale. So, what is Q? Well, Q is a system designed to solve these problems, and it does so by offering a resource quota management. So by means of defining quotas for your cluster, the goal is to maximize the utilization of these scarce resources that you work hard to obtain. And in this model, Q offers a model with a two-level hierarchy so that you can define guaranteed quotas for certain tenants, for your teams, but also allow to borrow so that when one team is not using the resources, other teams have the chance to borrow and use resources. And we also need to establish some preemption, some priority-based ordering, because not all jobs have the same importance, and when you have a job that is higher priority and everything is full, you might want to preempt. So all of this is offered. Another aspect of Q is fungibility. What does this mean? So your jobs might be able to run in a variety of resources. Maybe your jobs can run on different models of GPUs, or maybe your cloud provider offers different tiers of availability. So maybe you have reservations, you have on-demand machines, or you have spot VMs. So all of these, your jobs might be able to run on all these different resources, but you might want to say, well, I want to consume my reservation first, I want to consume my on-demand first, and so on and so forth, or in GPU models. So all of these concepts can be expressed in a Q cluster. So Q is a job Q-ing system, right? So when you use Q, you don't queue pods, per se. You queue the overall jobs, because your jobs can have like 1000 pods, 10,000 pods, et cetera. We don't want to queue all those pods individually. We want to queue them as a unit. So Q, to accomplish this, Q supports a variety of APIs, CRDs. The most prominent one is, of course, the job API, but we also support the job-set API. We support all of Qflow, thanks to Yuki here. We also support Rayjob, and in the last release, we also support plain pods, because some users are still in their journey to migrate from bare pods to other higher-level CRDs. So we want to support them to use Q in the meantime. Q is extensible. So if you have a custom CRD that defines your job, you can use a library that we provide to integrate your CRD with Q. And this is a new feature as well in the last release, which allows you to customize certain rules for when your jobs should be admitted. Yuki will explain a little bit more detail into this in a bit. Why choose Q? Well, Q, first of all, is a Kubernetes project. So Q is sponsored by the SIG Scheduling, the Special Interest Group on Scheduling. But also the roadmap is discussed in the working group batch of Kubernetes. And what does this mean? It means that when we are developing the roadmap for Q, we talk to SIGApps, we talk to SIG Auto Scaling, of course, SIG Scheduling, and we build... We influence each other's roadmap to advance Q and to advance the batch system. And with this, one important thing to note is that Q is not a scheduler. Q is a separate component that is compatible with all the existing components of Kubernetes. So it's, of course, compatible with Q scheduler. And with this, it means that it's also compatible with any other scheduler. We've seen in the community some people using Unicorn with Q and using Volcano with Q. It's compatible with the Controller Manager, the Job API, and so forth. And also compatible with Cluster Auto Scaler. So Q was built with Auto Scaling in mind from the beginning. And we have adopted a release cadence of around three months. So our latest release was just two weeks ago, our 0.6 release. I want to explain about differences on the Q operation flow between B0.4 and 0.5. The latest version, 0.5. As a first, let me explain the flow in the previous version, B0.4. Q assumes the situation that there are cluster arguments and cluster users. In this slide, I named cluster admin just admin, cluster user researcher. As a preparation, admin needs to define cluster Q, resource flavor, local Q. By the preparation, admin can set capacity Qs. After preparation by admin, researcher can create jobs. Once job is submitted by researchers, Q admins that Q admins the job if capacity is free. The job is unsuspected and is injected node affinity. We can specify node affinity in the resource flavors to schedule a job to the nodes with specific features. For example, with hardware accelerator and with specific life cycle, like frame table or spot instances. After that, job controller like part job, Qport training operator and Qbrae creates ports for jobs. Then created parts are scheduled to nodes by Qube scheduler. Finally, when there isn't enough on the node, cluster autoscale scale out the cluster. It means parts are created with pending status. This slide is B0.5. The main difference is for step 2.1 and 2.2 in this slide. So we can add arbitrary checks before jobs are admitted by implementing custom controller on our servers. For example, we can implement cloud cost checker. Then if cloud usage exceeds budgets, we can stop the Qube jobs. Also, we powered the cluster autoscale integrations based on external checker. Until B0.4 cluster autoscale will provision more nodes triggered by Qube pending ports. So there are the situations that jobs are scheduled to any nodes in spite of the Qube. However, since the latest release B0.5, Qube can request to provision more nodes to cluster autoscale via provisioning request. By this feature, we can allow Qube to dequeue the job only in the case of cluster resources are enough. Next, I want to talk about Q use cases in production. We will introduce two use cases. The first one is my company cyber agent use case. My company has an internal on-premise ML platform with static computing. The cluster has a multi-tenancy and exists more than 200 namespaces for users. Also, the cluster has multiple kinds of GPUs, NVIDIA H100, H100. In general, such a cluster called by heterogeneous environment. The main work roles are training machine learning models, Jupyter notebook, and serving machine learning models. In these work roles, we are using open source frameworks, Qube for training jobs, KSAP. Also, almost work roles needs GPUs. Regarding Q usage, only training and Jupyter notebook are managed by Q. Serving is evaluating in the test environment. After the next slides, I will show you details on each work roles. First is training. In our platform, there are many kinds of training work roles like computer vision, voice recognition, and so on. One much more impacting work role is training large language models. We are building LLM trained by English and Japanese open-dev sets. And the training jobs are managed by upstream Q and QPro jobs. MPA jobs create two role parts, launcher and worker. Launcher is role to perform NPA on command. Workers is a role to perform actual training process. In other words, training processes are NPA rank or NPA process. Launcher and worker role must be studied successfully. If either role failed to start, the whole NPA job failed. So I introduced the sequence of admission with ready posts that is Q built-in feature. When we don't use the feature, Q de-Q de-Q the next job regardless of whether the de-Q job posts are ready. However, when we enable this feature, Q will block the de-Q the next job until the de-Q job posts are ready. The next work role is Jupyter Notebook. Jupyter Notebook is so popular tool to develop models and analyze data. My company provides Jupyter Notebook to researchers based on a single replica set who set without outscaling. However, as a built-in feature, Q doesn't support Q-ing set for set. So I developed the minimal kubernetes custom controller. The controller implements Q-genic job interface and expresses a suspension semantics based on annotation and the number of replicas. Also, Q started to support the plain post since the latest Q version is 0.5. So I'm evaluating the plain post Q-ing feature with Jupyter Gateway Provision. In this slide and next slide, I want to show some cases to set capacities by class IQs. The first cases is coexistence of user and admin capacities. In general, the researchers' work role has higher priority than platform developers' work roles. So we define the admin capacity as entirely wrapped with user capacity. The only growing limit is defined in the admin capacity. Once user jobs are submitted to Q, admin jobs are preempted. Next case is semi-guaranteed capacity for important projects. I defined the two-level user capacity for class IQ. The first one is capacities for important projects from a business perspective. The capacity is defined as a separate form of shared class IQ to guarantee the capacity. Every other project and users are assigned to shared class IQ. Shared class IQ can borrow resources from project class IQ when they are free in project class IQ. Next, I want to show a different case study. This case is not a particular customer. As you know, GKE is a cloud provider, so we've talked with different customers and how they are using Q or how they are planning to use Q. We've come up with a few common ways, common architectures in which they can serve their users. We've built this reference architecture. What is the goal of this reference architecture? Again, we have scarce resources and we want to maximize the utilization. On the other hand, we want to control the cost. In this environment, of course, we have multiple research teams. Each research team might have different priorities, also different sizes. From the infrastructure point of view, we have VM instances with different availability levels. First of all, you can have a reservation. Reservation is the idea that you buy multiple VMs in bulk at a discount price. That's already guaranteed for you as a customer. Then you can also purchase on-demand VMs, which might be subject to stockouts. Or you can also buy spot VMs, which are cheaper, but then you are subject to preemption at the VM level. You have all these different levels of availability for your resources. This is more of a decision from the customer side. They want to have two levels of quality of service for their jobs. For example, they would have a standard job, which should run on reservations. But if the reservations are full, they want to be able to burst to on-demand. Then they also have best effort jobs. These jobs should run on the unused reservation, but if they run out of space, they can burst to spot. But what makes a job best effort is that they can be preempted if a standard job comes in. This is the environment, and then this is how it maps to our APIs in Q. VM availability levels, so preservation on-demand spots, can be represented with the concept of resource flavor. The combination of research teams and the job quality of service is a cluster queue. So if you think of a single team, a single team will have two queues, one for standard and one for best effort. The ability for all these entities to borrow resources from each other is through this concept of cohort, which I'll show you in a bit. So here I'm zooming in into one team. First of all, on the left side, you can see that we have the different resource flavors, the different VM availability. We have the reservation on-demand spot, and then we have here two cluster queues. Both cluster queues are for the same team, or let's just call them queues, so maybe it's a concept more familiar. So we have two queues, standard and best effort. And now here is where you can define this fungibility concept that we were talking about earlier. First of all, these resource flavors are ordered, so because reservation is first, reservations are consumed first, then followed by on-demand, then followed by spot. Now, team A standard has a nominal quota of 20. This means that 20 units are reserved for this customer. You define the units for each of the resources. So this might be 20 CPUs, or 200, or whatever you need, or maybe 20 GPUs, or a combination, CPUs, memory, and GPUs, and whatever resources you have. So I'm just using one number here for simplicity, but this means that nominal quota means that this is reserved for you. Now, there is a borrowing limit of zero. This means that this cluster queue cannot borrow. Now, on the right side, the best effort actually has a zero nominal quota. It means that it doesn't have any allocated, any reserved capacity. It can only borrow from the resources on the standard team. So whenever the standard queue is not in use, the best effort can use those resources. But whenever standard needs those resources back, it can preempt any jobs that are running from best effort. Then you can have a backup queue, a backup resource flavor. So if your reservation is full, you still want some bursting onto on-demand VMs. In this case, we are setting up the bursting to five for on-demand. And on the best effort side, they can burst, but to cheaper nodes, in this case, to spot. So they also have a limit of five. And lastly, at the bottom, you can see the configuration for preemption. Again, on the left side, standard, it says reclaim within cohort, any. So these two cluster queues belong to the same cohort. And these settings allows the standard cluster queue to kick out any jobs on the best effort cluster queue when they need the resources. The next setting is flavor fragility. So when you have a queue with multiple resource flavors, you have an option. So let's say reservation is full. If you have the option to borrow, do you want to borrow? Or if you have the option to preempt, do you want to preempt? Or do you just want to let that job run and go to the next flavor, so on-demand? In this case, we've seen that customers prefer to preempt. So that's the setting. So whenever your reservation is full, but you have a high priority job coming in, you want to kick out any jobs on best effort on the reservation before you move on to on-demand. So this setting allows you to do that. And best effort doesn't have this setting because, well, it doesn't generally, it cannot preempt in most scenarios. So this is the team A. And similarly, we can define team B with two queues. As you can see, we have added some borrowing limits so that the best effort cluster queue in the spot flavor can borrow from the best effort quota on team B. So all of these queues are in the same cohort. And of course, as you have more teams, you can expand more. And if you want to isolate a group of teams, you can create a separate cohort. So each cohort can only borrow resources from each other. And that's the concept of the two-level hierarchy. So with that, I want to talk a little bit about what's next for queue. So we have a few things in top of mind. And nothing here is final as we just finished the release two weeks ago of 0.5. But these are the things that are top of our mind that we are discussing. First of all is on-demand visibility of pending jobs. We have some visibility today about where your jobs are in the queue. But users are requesting more fine-grained data. And we are building a new API that is more performance that allows us to do this. The next thing is support for groups of plain pods. We already support plain pods, but as individual jobs. We are working on the pod groups. We are also thinking about extra policies for recuing. So when you send a job somewhere and it doesn't run because we have a stock out or something like that, users want to be able to reclaim that job and put it back into the queue, put it in a different availability level. We are working through those ideas and hierarchical cohorts. As I said, we have a two-level hierarchy. Users want to run in larger organizations and they want to be able to run what we call a hero job, which is a job that basically can use the entire cluster if needed. So for that, we need extra levels of hierarchy. So that's also in discussion. And to our lesser degree, we're also talking about multi-cluster support, which we plan to implement through admission checks. Workflows, so think of Argo and things like that. We don't want a queue, we don't want to implement workflows, but we want workflow orchestrators to work with queue. So Argo or Decton or Snakemake, if you're familiar. All of these tools, we're working on a way to integrate them so that they just work with queue. And serving is also something in our mind we can serve. So if you have any thoughts about these topics, feel free to reach out through the issue, or you can find us in the working group batch in the Kubernetes Slack. And yeah, thank you. Thank you for listening. Thank you very much for the great talk. I think we have time to take one question. Anyone have a question? There's a mic in the middle of the room if you have a question. There's one question from here. If no one used the mic in the middle of the room, I'm going to give mic to you. Thank you. Great talk. Thank you very much. I have a question about how does this compare with traditional schedulers like Slurm? How does the queue and Slurm compare in functionality? Yes, there is a few different ways to look at this. First of all, queue is Kubernetes native. So it uses all the components, CRDs, and all of these tools that Kubernetes users are familiar with. Now, there is another aspect of it. Again, queue is not a scheduler. It's not directly comparable with Slurm. But at the same time, if you have, for example, we have seen a sync earlier today. And there is the Livermore National Laboratory also working on, like, Slurm in a box, right? Or they have this other framework, the Flux framework, which they are packing in a box. So basically, each researcher can have its own mini cluster or a cluster within Kubernetes. So they are able to run in this closed environment. But then if you have multiple researchers, you want to be able to distribute these quotas, distribute the quota of the cluster among these multiple researchers. So there is ways to think about this in which queue is not a competitor of Slurm, but it could be complementary. Now, in terms of tooling, yes, Slurm has a long history. In terms of tooling, we still don't have a CLI, for example, which is something that Slurm users are very familiar with. So that's something also in our minds to address. Same with the dashboard, things that are still in our mind, but we are working on core features at the moment, so we didn't look into those yet. Hoping I... Thank you very much.