 Good afternoon Wilfrid Spiegelberg and Peter Boschko here We're part of the PMC unique Apache unicorn group Designed the Apache unicorn from scratch I've been doing Apache unicorn since inception about three years ago and Peter Hi, my name is Peter Boschko Join club there in 2016 worked on Uzi had to produce yarn and then join the unicorn team in 2021 So what we're going to talk about today is how we are using the schedule framework to give you some batch extensions In the Kubernetes environment. So like I said, we're coming out of the hip Hadoop yarn area so batch processing large data processing and we're trying to give you the similar kinds of things within The setup around Kubernetes so What are the things that we're looking at when when we look at? Scheduling on on Kubernetes We want to do workload queuing when when batches get deployed when batches get generated We generate often or more than one go We want to keep them around we want to start them whenever things will or need to be run We want to do an all-in-one kind of scheduling setup gang scheduling. We don't Spawn up one pot. We want to do a set of pots, which one is the driver others are the workers Spark is a good example for that, but there's other pie torch I'm Peter's there's a number of things that will do that for you things that are not available in instant thing the other Extra bit that we want to give you is Application sorting so instead of looking at just one pot or a job or a demon set We say Nana. There's a mixture of pots a group of pots that together form the application We want to schedule based on the requests that come from from that thing it could be One pot to begin with Scale up to a thousand pots and then drop down again to a couple of pots Like what we do with data processing at the point in time that we need it We we scale up the pots. We do what we need to do and the pots go away. So We've got a real bursty kind of a deployment But we we still want to see all these pots being scheduled as part of one thing not every single pot separately so these kinds of schedulers these kinds of Facilities are there when you look at high performance pro the HPC pros and the batch processing when you come from a slur Or a yarn kind of a setup help These kind of same kind of things we're gonna do or you want to do from a Patchy unicorn within Kubernetes, but Kubernetes from its origins was always a services based Set up We want to do both at the same time There's a number of schedulers that will give you some of the batch things that you want to do But then they don't do the services or the other way around when you use the default scheduler You get the services, but you don't get the batch Things that you want to do so a pitch unicorn. We schedule whatever you give us so we put a application on top of the existing pot objects the Applications the demon sets whatever you want to run on on a Kubernetes layer We we run on that and then we also allow you to easily run spark jobs tensorflow jobs MPI jobs Whatever you want to do without needing to change the underlying framework to submit your jobs, so we don't want to go in and needed to run a specific tensorflow setup or Compiled from scratch because you want to do batch processing or the same with spark so we We're gonna give you a simple integration based on the minimum amount of code changes Preferably no code changes on the way that you submit an application the way you submit a job and purely use annotations and labels on on pots or Demon sets or whatever you submit to schedule and give you the batch scheduling things On top of that when you start looking at data processing there's always The question where do I run it? How do I share my? Resources that I've got in my cluster When you look at the default scheduler you've got one scheduling queue So a pot submitted by the first user or by the 10th user It doesn't matter they're all coming into one queue and it's getting scheduled based on a priority kind of setup most data processing that you want to do is not looking at One user of five users, but you're looking at hundreds of users. You want to share the quotas you want to share the system nicely so Unicorn provide a Hierarchical queue structure to place these applications in and do you make your scheduling decisions? So instead of having one single queue that will contain all the pots for the whole cluster You now can subdivide and schedule subsets of the pots based on their own rules and their own Priorities first in fact, whatever you want to do Within that hierarchical queue system we give you the possibility we give you handles to say These These queues can only run an X number of pots or an X number of resources But we also want you to be able to say With guaranteed this queue gets Half of the cluster or it needs ten of the nodes always So every single point in that hierarchical queue we can give you guaranteed these sources Quotas or different scheduling policies. So going from one Possible policy throughout the whole cluster. We now give you a Set that you can self-define and self configure over the whole cluster When we started off Working on Apache Unicorn There was no plug-in architecture in Kubernetes. So before we when we started the There was the only way that you could really Change things on the scheduler was completely replacing the scheduler. So we wrote our own scheduler we implemented all the functionality for Binding pots doing all that kind of stuff because that was the only way that we could do things The next step was the extensions that were there for Doing HTTP call outs and then you could customize some of the behavior That didn't perform well. So that was pulled out of the Kubernetes code and the next step that they came up with was the plug-in architecture Within the plug-in architecture, they'll they gave you a number of extra things that you could play around with so Unicorn went from the standard deployment like what we're running now Which is a complete custom scheduler to a plug-in architecture, so we started off with simple Core that does all the unicorn core that does all the scheduling decision makes all the scheduling decisions And then we've got a shim that integrates with The API server pulls the information in does that's all the things with the pots and what we need to do And we've got an admission controller sitting on top of that to do some of the more advanced stuff making sure that I think It is as easy as possible for the end user Moving to the plug-in version of that we changed over from Doing everything within the core to having the shim that now completely includes the default scheduler so instead of Writing all the code to bind pots to the node and and do the volume binding everything ourselves We now rely on the default scheduler to do all these things for us So we hook into the scheduling framework We implement certain points in there and we augment or you replace whatever we don't want or what we Want to change from the default schedule? So This is the picture that's probably have For people that have been looking at the scheduling and the scheduler in Kubernetes that has shown up before The right-hand side the binding cycle is the last point in the whole whole cycle That's where we leave things The scheduling cycle that's already been there that's been there for a long time That's What does the real? Checking which node we want how do we place it on there do all that kind of stuff and then Part of what is just come out in 126 127 is the pre and queue Plug-in side of things so we worked with the Scheduling sig and we said look we want we want to be able to do a little bit more instead of just having all the pots Flow into the scheduling cycle immediately. We want to be able to gate the pots in a pre and queue cycle so That has been delivered as part of 126. So how does Unicorn use these Plug-ins and where where we sit we've kept the unicorn core. So our Core scheduling code has not changed. So whatever we deploy Default mode the standard mode or the plug-in framework the unicorn core still makes all the scheduling decisions The only way that has changed between what we do with the plug-in and the default mode is the way that we interact with Kubernetes and What we need to do ourselves and what the default scheduler does for us so the first part that we implement is the the pre and queue Plug-in because we want to be able to decide which ports the scheduler looks at and which ports that get put onto nodes and and be processed so Without the P&Q There's a lot of overhead that flows through the whole system because We can't stop the default scheduler from looking at a pot That's just not built into the scheduling framework and that meant that without the pre and queue hook Pots that unicorn thought could not be scheduled yet the default scheduler had already looked at and marked as unscheduled rule and then the auto-scaler kicks in and says I need a new note because this is marked as unschedulable. So that's where the the pre and queue hook came from Then the second point that we implement is the pre filter the pre filter Runs over all the nodes and decides which nodes to use and which nodes not to use so That in combination with the filter hook Allows us to select the nodes and make sure that the node is selected is used that we from the unicorn core have decided on the last Bit that we implement is the post bind That is the last point in the cycle and when the post bind comes back. We know that the port is Scheduled completely accepted by the cupelet and is getting started by the cupelet So the post bind is a housekeeping point for us because we know that everything has gone through and that we are at the final stage The input from the core scheduling cycle goes into the first three plugins So we decide what goes through decide which node we put on and for that we interact with the three Plugins the the pre and queue the pre filter and the filter With all of this we want to do quotas and all these other things now Peter will go into further all the quota tracking Yeah, so let's talk about quotas. So as we've read mentioned, we have a higher article A hierarchical model. It's a hierarchy of so-called Resource queues That's what we use to calculate the available available resources To the running applications these queues can be created automatically But also in in config in configuration files, which is right now. I am We have to submit applications or pods to the leaf queues. You cannot you cannot run stuff in the parent queue And then whenever there's an allocation or there's a resource request. We do the counting in the leaf queue Increase counters and that propagates up all the way to the root queue So this essentially means that at any point In the hierarchy, we always know always know the the resource usage We know the resource usage at at for every for every sub three You can put quotas on leaf queues and and and parent queues and Since we know the usage all the time It's very easy to enforce it and we enforce it in every scheduling cycle So here is a very very simple example So that there's a cluster for developers and testers Leave queues our name after the users There are some running pods Alice is running two pods and Mallory is running one pod This is the current resource usage that can be observed. So obviously root sees the the total Which is 15 gigs of memory five CP CPU Dev 20 gigs of memory to CPU and QA 30 gigs and three CPU and now we want to put some resource limit or quota on the on the on the dev queue We want to limit resource usage for the developers 60 gigabytes of memory and for CPU and also it's worth mentioning that for the root queue. This is a calculated Automatically when the unicorn starts the nodes are Registered and we just retrieve the capacity and later on we update it when a node join node joins the cluster or Or leaves the cluster Okay, and then Bob Bob wants to run a pod But the functionality for him this pod cannot run at least not at the moment Dispending why? Because of the of the limit. So this pod is asking for four CPUs. There's nothing wrong with the memory but the current usage at the dev queue is to CPU and there's the there's a limit for four CPU and There's the and that's not enough. So so Bob has to wait until the pods that were started by Alice terminate and It's important to know that the Bob's pod is pending. It's not rejected and You will see why this is important Okay, so namespace quotas versus unicorn quotas, but we can also use unicorn to manage namespace quotas Why why would we do that? So if you use namespace quotas We set it The quota is exceeded the pod is ejected end of story and If you want to run your workload and probably you want to run it some time later You need you need some retry logic the script or you try manually And also other users might be competing for for resources to and there is no no ordering between the users So you might lose this race against others so it this is Not an ideal scenario In unicorn you can have a setup Where the queues are auto created based on existing namespaces. So for here here in this example of sales finance dev and test these are existing namespaces and Unicorn created them when a pod was submitted and In this case the pods are not rejected if there is Is there's no more room? the quote has exceeded them the users just had to wait and When there is enough resource then they will be picked automatically based on based on the ordering that is set You have to set so-called placement rules in order to have these queues created And also you have to update the namespace object themselves. There are two annotations that you can use the the quota itself and the parent queue This this arrow says that the development and production queues are optional grouping. So that's not as that's not necessary all The the leaves can appear under root directly. So that's that's not an issue But sometimes this is this is desirable So this is so this is how it looks like when you want to configure this This is by the way from the upstream documentation when it's Explained real well. So on the left side, there's the Unicorn config the yam With in green you can see the static use the the production and the development. This is what This is what I called optional We call them like sometimes configured queues or managed queues These queues always exist. They didn't they don't disappear until you remove them in in the yellow There's the placement rule section and if you have used Apache Hadoop before Hadoop yarn With fast scheduler or also capacity schedule. It's the very same idea. It's also called placement rules there You tell Unicorn how to name the queues that don't exist And and that's it then there are all kinds of rules like name after the user name after namespace Some maybe some labeling. So there are different kinds of rules and These are the the orange This is these are the two intentions that you have to put On the on the namespace object the first is is the quota is this is a tiny jayson and second one is is the parent queue that That tells where the namespace queue should be created and finally we also want to put quotas on users and groups and This this feature consists of three parts The first we have to determine who submitted submitted the actual workload And it turns out that this info is only available inside the mission controller It's not present on the pod or deployment or any kind of other objects At least that's the only way what we found You need a you need a mutate webhook We have a mutate webhook we extract this info and we modified the We modified the pot spec Of the workload not just for pots also for Deployment job cron job accept replication controller. So we had an extra notation This is also a tiny tiny jayson and later we deny changes to this annotation You can you can fine-tune this behavior change it, but that's that's the basic idea And then there's the tracking itself It's it's the accounting increasing decreasing counters when there's an allocation or report termination for the different groups and users and We have a nice rest API where this is visible. So we have separate endpoints for users and groups and if you open the user Endpoint you can see this user total and Then you can see the per queue statistics where that the user is running applications and the usage There's there will be a demo about that and There is the third one which is not ready The actual this is the actual enforcement so this is this is in progress and We targeted this for one three all the latest to reconversion is one two all There's the general ink you can check it out It would be great Yeah, so that's it I skipped over pre-emptions so one of the main things that we were missing in our current setups is How would you do preemption in in this kind of a setup? If you do want to run one single port you can preempt but in this case We've got a full queue hierarchy. So how do we decide what we do with preemption like the default scheduling cycles We do preemption based on priority classes The only thing is that we've added an opt-out from a unicorn perspective So instead of saying that something is allowed to preempt during the scheduling cycle We now say don't preempt me when I'm already running If you look at jobs that are running for instance, if you've got a spark job running or some other job running for a couple of days and You kill the pot that belongs to that that could be really costly So you want to be able to opt out that we allow that We've got the queue configuration that we therefore needed We max is the quota and we guarantee a certain amount of resources, which is the basis for preemption We've got some other nice Really complex things that we can do with fencing so that you can say certain parts of the hierarchies can't preempt other workloads Freeing you don't want to have production Pempted because somebody in start something up in a death or in a test environment I'll skip over all that there's a presentation that was done during the HPC and batch day or from Tuesday that goes into that really elaborately and then shows you with a full demo demo on the bottom of the slide is the link to that presentation so Preemption within the system we do that again as part of the normal scheduling cycle and we use Guarantee resources for that so again within the the setup of the queue system We allow you to specify certain points in the in hierarchy and we Give you a guaranteed amount of resources in the setup So like a multi multi-tenancy fencing with priority offsets or Nice fancy things. There's a lot of documentation on the side Goes bit too far into them What we can show you and we're gonna try a live demo here based on a kind cluster will show you how we redistribute some of the Some of the workload between the different Queues and What we what we do? I've set up a small kind cluster to do that for a three Node cluster and I'm going to submit a simple application the application is Just a normal Job with a couple of extra annotations on there from Users Too long to show what we want to do and where we want to run it Let me create the application So we create the application we start up ten pots because of the quotas that we've set up We can't run all of them We can run eight out of the the ten the other two pots will just stay there and stay as pending in the unicorn web UI and now What's that one we can we can see that kind of information to the web UI is gone Okay, that's what you could we could do a live demo that's ma'am We've set up the the queues we've got this all running it runs in a low priority setup and Now what we're going to do we're going to submit a high priority job Against it on the same system. It runs in a different queue and It creates a number of high priority pots These pots at first day in pending because we don't we're not going to directly preempt things We wait a little bit to see what what's going on and then after the wait time is done We are going to look for enough resources that will allow us to do what we want to do so 30 seconds is is the the setup and No Come back and in the meantime, we'll see that we have redistributed some of the data. So here's the unicorn web UI We have created the the queues We've got a root with low priority root with high priority and Within the queues we have got the setup that we had first and we have redistributed some of the load from root load to root high And when we reload we should get the final state And over time we have moved. We've killed some of the root low. We moved that into root high We've got a guaranteed quota Sitting here That means that we are going to preempt whatever we want and we need but we are never going to preempt more than That we allow to so we all we've guaranteed that root low will always have that's this These spots running so we will give that we don't go below that and that also means that root high which had a higher guaranteed but we've only got eight CPUs can't get to Allocated to the the maximum of the guaranteed that is there so We still have got pending pots But we've redistributed Some of the high-priority are running some of the high-priority are still pending. We've got low low priority Pending because they're still not being scheduled. That's just sitting there waiting for for things to happen To hook that back in again to what Peter mentioned around The quotas so we've got the quota tracking going through The load for Peter runs in a root low. We see exactly what's happening here The second user is also tracked root high We get the memory we get everything and then from the group. We were assigned. We both had assigned the same group We see that both applications attract under that Same group with root low and root high so the total usage is there For everything everybody to check Now if I now Delete the first application We will see that the left over pots that were pending for The scheduler will get scheduled and will become running after a short amount of time. So We pick up where we were and we we schedule whatever we have got as a leftover amount no That shows what we do with this within the scheduling framework We've got full control over what we do with pots what we assign which gets scheduled where we place them because in the kind cluster that's a little bit more difficult to see but We have got full control over what's what gets played where and We also are able to do preemption based on The quotas what that you set up that you want so from a batch Processing perspective that gives you a full control over your environment You can say I want certain users to have priority over other users. You can even say Peter is not allowed anything in in one queue or the other queue so you can do a full blown Multitenant environment within the system that we're going to set up So this was the cluster that we used we had a prepared cluster quotas and Pemption The applications that got submitted were one for the use Peter one for the other user and we saw the redistribution of the quota to the other user and We hopefully have some time left over for Q&A if there's any further questions Yeah, there's a microphone aside for this for the people that are streaming because I was told that there were a couple of people that was Streaming so if you could Go to the microphone, please How do you? Say like with driver and executors in spark. How do you stop the driver being prepped it? So The the main thing is the priority class that you set on on the driver And within the priority class you opt out from the Preemption so that's that's the first thing the other thing that we do is when you submit Spark driver and other things We create an application object within the application object We mark the first pot as the originator pot. That's the one that gets created first the also the originator pots Even if it doesn't have the allow preemption tag set always gets put put in the back of the queue for Preemption, so we try our best not to preempt the driver pot It's not a guarantee. It's a bit of like Yes, we put it always in the back of the queue if we can't do anything else then we will still Kill the driver pot. Yeah Thank you and The demo you showed yeah, well those two applications running with the same priority class. No, no, so they were running with a different priority class One was running with Low priority that was the one from in the in the first queue that we set up and the second one was with high priority If I would have submitted a third application While the first low priority was running and I would Put in a third application also with low priority The second application would have been scheduled first Not only because it's a higher priority, but we also put five four on the system So first in first out, but you can decide that Anywhere in the queue hierarchy So if you say I want to schedule Certain part of the queue First in first out, but the other part of the queue. I want to schedule Fairly or purely based on priority. You can do that That's that's all set up and all available within the queue hierarchy. So it's not just one policy It's policy per Part of the queue that you set up. Okay, so one more question So if I had two applications to the same priority class running Well, I ran one and it had a guarantee resource of x, but it filled the cluster And I run my second one and I had this guarantee resource But the same app the priority class would resources preempted from application one and given to application two Such they both had at least guaranteed. Yeah. All right. Thank you Yeah, if they're exactly the same priority. Yes. Yeah, we will still do that. We distribute again Over over the two applications I have two questions One is how does it is compared with queue the queue with k that was also being shown like If you can broadly see the advantages and disadvantages of of both and the second one has actually to do with this Part which it was not clear to me if you wanted if you have some kind of implementation for fair usage And if it has like something fancy like the the classical job schedules do which can kind of even how much have you used in the past with the kind of A lifetime kind of We we don't look at lifetime so to say so We because we don't know When you run for instance a spark job or whatever we've got no idea what what comes in You can ask for gang scheduling But you ask for a certain amount of resources We try to do our best and give you all the resources that you ask for if we can't You as the application submitter can say Do I want to proceed with whatever I Can get or do I just want to fail so soft or hard gang scheduling So that's the only thing that you've got when when you look at that side of things Um Compared to q I think q is not Hierarchical so it has a set of cues but The hierarchy and and the the way that we distribute with guaranteed The other thing q when it does preemption it preempts the whole job It can't do one pot at a time. It it can't do distributions like that it it Does the whole job or nothing? So that's not as flexible Sharing quotas in the When you look at what we do in A set of like you see here We can share quota from qa All the way to the other side That is not possible with q You've got a group of group of cues that you've set up before and so The flexibility is not there as much within the the q Set up Thanks a lot Any further questions So when you leave the pot impending Is the application able to determine that the pot is impending because there's not enough resource available And it's it's effectively cued Compared to it's impending because some secret it needs isn't there or some other reason why it might not be impending Yeah, so we we use the event system and we put events on the pots to show you that the Pot is waiting for resource to become available. Yeah, we do that Yeah for sure. Thank you. That was it then