 Good evening everyone. My name is Sunil Govindan. So I am a Unicorn PMC member. I also lead the data engineering and scheduling team at Cloudera. My co-speaker Craig, who could not join with us today here, he is also a PMC member at Unicorn project and he is the lead developer for Unicorn at Cloudera. In today's session, we will be talking about SLA-aware batch scheduling in Unicorn with multi-tenant preemption. So here is a brief agenda. So let us look at some of the core abilities of Unicorn scheduler. Unicorn can schedule both batch workloads and long-running services in both plug-in mode and as a standalone scheduler. Unicorn is also capable of making faster scheduling decisions in case of batch workloads where the resource demand is very huge. Unicorn is multi-tenant aware and it supports hierarchical quotas based on the business requirements. Unicorn can also be deployed both on-prem and on-cloud as well. And one of the core differentiator feature of Unicorn is workload queuing. As you can see, there can be many queues and resource request can be placed into any one of these queues based on the use case. Unicorn effectively schedules these resource requests based on the resource availability and also based on the scheduling policies such as fairness, FIFO or the gang scheduling itself. And Unicorn also adhere to the quota limit set forth on each of the queues and then these ports will get scheduled across various nodes in the cluster. And this is an open source project from 2020 onwards and it's a top-level project at Pasha. And so we are at 1.2.0 version of Unicorn at this point of time. Now let us look at about the batch workload itself. So preemption is a mission critical feature for most of the batch workloads and we'll try to deep dive into why. So if we consider static queues where both guaranteed and the resource, the max limits are same. And this will definitely help to guarantee that in a multi-tenant cluster deployment, the resources are scheduled within the quota boundaries. However, there's a caveat. This will cause underutilization in most of the queues because most of the time the ender queue capacity won't be used when you running the jobs for a longer time. Such issues can be resolved by making the queue elastic. So that means you'll be able to configure a guaranteed resource and also you can configure a max limit to it. And with this model also there is a caveat because when legitimate applications wants its resource back, at that point of time it has to wait till other applications to finish its execution. This usually causes delays because we don't know how long the other application may take and in such scenarios it is found to be much more trouble. So preemption will be very handy in such scenarios and by enabling preemption we can quickly bring the balance across multiple queues. And this needs to be application aware as well and we will discuss about that in the upcoming slides. So I want to give a detailed example about the previous scenario. So this is one of the application called app on which is submitted to Q1 and it has taken the ender cluster capacity that is 100% because there were no other applications running in Q2 at that given point of time. Now when we submitted some application that is app 2 in Q2 around time T1, you can see that there were no enough capacity available. So some around 20 plus Spark up executors were running as part of the app 1. So some of the executors got completed at T1 and Q app 2 were able to grab those resources. And around T2 time frame it got some more resources because the app 1 was almost about to finish its execution. And around T3 it got the full resource because app 1 got completed. Now as you can see it took a good amount of time to get its desired capacity for app 2 which was running in Q2. So if we enable preemption, we will be able to bring that cut that delay by a lot. And at time T1 itself we will be able to give the full guaranteed capacity for applications running in Q2. So with that note I would like to move on to the next segment where we want to discuss about where things are standing with respect to Kubernetes, with respect to these multi-tenant scenarios. So in Kubernetes all ports in the entire cluster are sorted based on priority. So that means it's a big queue and the ports are then considered for eviction based on that order. And an opt-out is not quite possible by looking at the context of the application. For example, I'll take the example of Spark. You have Spark driver port which is very critical. And if you kill the driver port itself, the ender job will fail. So in today's scenario it's not possible to do that. And hence this will cost a lot of cost. And this will bring in maybe a job which is running for eight hours or nine hours and what if the driver itself is lost? So you'll be losing the whole job and end results with huge penalty. And coming to the priority class that is used for defining priorities across jobs or ports in the Kubernetes, these are cluster wide objects. And also there are no limitations that we can set on to this priority on that particular port. So because of this any rogue user could come and set a larger priority on a port and that could disrupt the ender system. So this is one of the possibility. There are few later, few configuration from one to 24 onwards. However, in a public cloud setup it is difficult to be configured because it is mostly an admin and AP server related configuration. So now let's look at the preemption in Unicom itself, a very high level overview. So in order to understand about preemption with Unicom, we need to get familiarized a bit about Unicom itself, how it is handling. So Unicom works with hierarchical queues as I mentioned earlier. So that means resource limits, configurations, and the ACLs controls can be set to any level, queues at any level, and it will be inherited to its gel queues. And preemption makes use of the fact that it allows applying the guaranteed and the max limits to any of the queues at any level. So that means, as you saw in the earlier example, queue one and queue two, we will be able to find an overutilized queue at any given point of time and then do the preemption necessary to ensure that the undercommitted queue, in that example it was queue two, gets its desired guaranteed quota. And for Unicom, I think the large batch workloads are the major focus because to run jobs like Spark or Flink, we need to ensure that Unicom handles this volume of jobs. So it means that we cannot lose the jobs as well. So we were trying to ensure that we look at each of the application in detail and make sure that the originator port, in case maybe I'll say the Spark driver port is always kept safe so that it will not cause any kind of disruption. And we also give an option to opt out from the preemption itself. So this opt out option can be easily configured in the spec itself and the port will not be preempted. But it is just a notion or to the scheduler that, okay, don't preempt me. However, in scenarios where mission critical ports needs to be scheduled on that node, there could be chance that these port also could be preempted as the last resort. So in order to understand the Unicom's preemption algorithm, we thought it would be helpful to state it as a series of rules or laws. Then it would be much more simpler to explain what was the algorithm that we were using to select the candidates or the victims, right? So rule one, policies are not guaranteed. This will help to ensure that in a multi-tenant environment, not all the ports who are saying that I want to opt out monopolize the entire cluster. So we could get into that kind of a trouble. So we will still preempt those kind of ports. However, we'll keep it as the last resort. Now, the second law says that preemption cannot leave the queue lower than its guaranteed capacity. So this exists so that like queues guaranteed, the queue guarantees that actual guaranteed quota is always met there so that we don't want to bring down the guarantee capacity below its guaranteed quota. Third one is again simpler. A task cannot preempt other tasks in the same application. So when I say application, like take an example spark, right? So you may have 30 or 50 or plus ports running together as an executor. 50% of them is allocated. Then you met the queue quota. And for the remaining, say, 50% of the executors, we should not be preempting the other 50%. So then that means we'll be getting into a loop. Next one is like task cannot trigger preemption unless its queue is under the guaranteed capacity. So this is again a simpler explanation, right? So if there are enough resources in the cluster or it's in the parent queue, if the elastic quota is also available, then we should go and grab them before even triggering the preemption. So we should ensure that the queue is properly starved so that we can do that. The next low is that a task cannot be preempted unless its queue is over its guaranteed capacity. So we want to so all the queues were behaving nicely in the cluster will be left as it is only those queues were above its guaranteed capacity. And we have a critical demand, then only we'll be preempting them. So rule 6 is that a task can only preempt a task with lower or equal priority. This is very similar to the today's Kubernetes preemption model that we'll be taking will be considering the task for preemption only if it is lower priority. And the final one is a task cannot preempt task, outside its preempt, preempt, preemption fence. So in, in unicorn preemption implementation, we also come up with a concept called fencing. So that when you, when you look at queues, we can protect by itself by creating a fence boundary. So which will be explaining later. So this is the core design. So which we, which I also already covered that unicorn adheres to the queue hierarchy, and preemption makes decisions to bring the queues to its guaranteed resources. And preemption is also application aware, so that the original ports are kept will be, we'll try to save the original port as much as possible. And we provide the option to opt out from the preemption. Now let's look at a workflow. So this is a simple queue structure where you have root queue, marketing, sales and system queue. And under marketing, you have sales and sales ops and finance as payroll and reporting. So as you can see, we fenced these two trees that is sales and marketing, so that it can, if there is a requirement, if there is a requirement on the sales queue or sales ops queue, we will not be preempting anything from the payroll or payroll reporting queue or advice versa. And within each other, we also put a boundary for sales and reporting. So we have, we did couple of levels of fencing here. Now let's look at an example where a 20 GB demand is on the sales ops queue. And you can see that there are no available resources in the marketing queue. It's already, it's at its capacity, but sales is already above its guaranteed. So a preemption will occur here by taking 20 GB from the sales queue and sales ops will be able to immediately schedule those 20 GB, right? So I'm showing as memory as the resource, but it could be for anything, CP or GP or the same matter. For the simplicity speak, I will be showing this one. And the next one is to the payroll queue. A new 50 GB can be directly scheduled. You don't need to do any preemption here because available capacity is there. So it will directly get scheduled in the finance layer. And because the route is now after scheduling the 50 GB, we are at 100%. Now we are trying to submit some more jobs to the pay reporting queue, but this is fenced. So since it is fenced, it cannot actually grab resources from the payroll. So that is actually the core concept here. So you're trying to protect the queues that are of our priority. So in this case, there won't be any kind of preemption. Finally, we are submitting 10 more GB to the system queue, but the same queue does not have any kind of fencing around it. So it can go and actually preempt any resources from these queues. And here you can see that the finance queue is already above its guaranteed capacity. So and it will be hitting the payroll. So 10 GB will be preempted from the payroll queue, and that 10 GB goes to the system queue. Okay, now to go into the configuration support, right? It's very simple to enable preemption with Unicorn. So you can allow the ports can opt out from the preemption when using Unicorn and the default Kubernetes priority class itself can be used. So we don't need to add any new additional configuration here. And you just need to add an annotation called allow preemption equal to true. Setting this annotation to true will mark any ports to opt out from the preemption. However, it's just a indication it can be still be preempted if there is a critical workload coming, but we'll try to put it as last. Now we're looking into some other configuration features as well. The first one is the queue properties. As you can see on here, we have multiple configurations that that can be configured along with the existing queue configuration. These are nothing but annotation with which you will be able to enable preemption in Unicorn. So first one is guaranteed and maximum resources for workloads to be preempted in a queue that guaranteed limit must be lower than the max limits. Now coming to the fencing, a queue specific preemption policy can be configured and you can give two options. Either you can fence that or you can actually disable the preemption for this queue altogether. So when you when you fence create a fence around the queue and it subcues, it will prevent the preemption request within the fencing boundary so that any new request coming from the fencing boundary will be will be locked within that segment and it will not go outside and preempt something else. Now we also added a preemption delay and this can be set how soon the preemption will be attempted. So in an SLA driven environment, this preemption delay will define how fast you need preempt and how slow. So we also have a few other configuration like priority based fencing. This will also help to see creating a priority for the priority classes. The priority value can be seen by outside that queue could be the fence to value but within the fence you can use any other priority but outside queues could see it in a different value. So it's a very detailed explanation we'll be covering later. So with that I would like to go to a quick demo. Craig will be taking over this part. I will explain the basics of unicorn preemption, show a sample configuration and demonstrate how unicorn applies this functionality in practice. For this demo, we will use the queue setup depicted in this diagram. Unicorn schedules workloads using a set of queues arranged in a hierarchy with the root queue on down. Queues shown in purple on the diagram are parent queues and are used for organizational structure. Queues shown in green are leaf queues and are where workloads are submitted and run. Unicorn supports setting a preemption fence boundary at any queue level which limits the scope of preemption to that queue and its child queues. These fences are one way requests outside a fence boundary may preempt tasks within another fence boundary but not the reverse. In the setup shown here there is a system queue which exists outside of any fence. This queues permitted to preempt workloads in any other queue. Under the org parent we have two queues sales and marketing. Each is fenced which means workloads running in sales can never preempt those in marketing and vice versa. Finally both sales and marketing have leaf queues for dev and prod workflows. The dev queues are fenced but the prod queues are not. This allows prod workflows to preempt dev workflows within each organization but not the reverse. Shown here is the unicorn queue configuration for our demo. At the top level we have a root queue containing child queues system and org. System is unlimited but org has a maximum capacity of 400 megabytes of RAM and 400 mCPUs. The org queue has two child queues sales and marketing each with a 200 megabyte maximum and 200 megabyte guaranteed capacity. Finally each of sales and marketing has two child queues dev and prod both have a maximum capacity of 200 megabytes each but only 100 megabytes guaranteed. We also define a preemption policy of fence at the sales, marketing and dev queues. The prod queues are unfenced which means they are allowed to preempt from their siblings. Shown here is an example pod for this test. In order for unicorn to schedule things successfully three attributes need to be present. One the scheduler name needs to be set to unicorn. This can either be done explicitly as is done here or can be done automatically by the unicorn admission controller. Secondly unicorn requires an application ID to organize groups of pods together. Finally a queue must be specified either directly in this case or via placement rules within the queue configuration. For the purposes of this test all of the example pods will be of this form. The only difference will be the application ID queue and name. The resource requests will be constant at 100mCPUs and 100 megabytes of memory. To begin let's launch four system pods which will remain running for the duration of this demo. This demonstrates that pods in the system queue will never be chosen as candidates for preemption. We can see from the unicorn UI that four applications and four containers have been launched. Additionally if we navigate to the root.system queue we can see that the allocated memory in this queue is 400 megabytes. Next let's launch a set of pods in the sales dev queue. Although we have launched three pods there is only room in the queue for two of them to run so one will remain pending. We can see from the UI that seven applications but only six containers have been launched. Navigating to the sales dev queue shows that the allocated memory in this queue is 200 megabytes which exceeds the guaranteed capacity of 100 megabytes. This means that tasks running in this queue may be considered for preemption by other processes. Now let's launch a similar set of pods in the marketing dev queue. Again although we have launched three pods only two can fit in the queue so the third will remain pending. However no preemptions occur as the queue is over its guaranteed limit and both the sales and marketing queues are fenced from each other. Navigating to the marketing dev queue again shows the allocated memory in the queue is 200 megabytes exceeding the guaranteed amount of 100 megabytes. Now that we have added some running pods to the cluster let's try to schedule a pod in the sales production queue. Because the sales queue is already at its maximum utilization of 200 megabytes the pod cannot be scheduled directly. However because the production queue is below its guaranteed resource amount of 100 megabytes preemption can occur. By default unicorn waits for newly created tasks to be waiting for at least 30 seconds before attempting preemption. Once this time elapses unicorn will attempt to find a suitable preemption candidate. Since we have a fence around sales only the sales dev queue is eligible for preemption. Preempting a single task from the dev queue will satisfy the requirements. Either the first or second pod scheduled in the dev queue would work. However unicorn prefers to preempt tasks which have been running for a shorter period of time. As we can see the second sales dev pod is terminating and the production pod is started in its place. When we attempt to launch a second pod in the sales production queue we see that it remains in pending state because the parent sales queue is already at its maximum capacity. Additionally since the first pod in the production queue was already scheduled the prod queue is now at its guaranteed resource capacity and so no further preemptions will be attempted. Finally we can see that our original four system pods have remained running throughout the demo illustrating that they were protected from preemption. Our sales and marketing organizations had no impact on one another and our dev queues were not able to monopolize resources guaranteed for our production queues. I hope you can see from this demo that Unicorn provides a simple yet powerful framework for managing diverse workloads across a multi-tenant environment. We look forward to announcing this feature as part of the upcoming Unicorn 1.3 release which so quickly an acknowledgement so Craig and Wilfred from the Unicorn community completed this whole design and the development of the feature and this will be available as part of the Unicorn 1.3 release which is due very shortly and join us and share your feedback through the mailing list and also the Slack channel we expect more suggestions and thoughts are always welcome. So thanks again we'll wait for some questions and answers