 understandable. Good afternoon, Wilfrid Spiegelberg. Apache Unicorn is a scheduled plugin for batch workloads. I was supposed to be here with my colleague Craig, but he stuck in the US with some personal things that he needed to take care of, so it's just me here for today. Apache Unicorn, as we've just announced, we've graduated to a top level project on Apache that came out yesterday, so that's a big thing to hear. We'll go in and we'll dive into what we're going to do today. We're going to look at why we're doing Apache Unicorn, why did we design it, why are we running it, why do we need this, or why do we want this for the batch workloads. And then we'll dive into a bit of the architecture. We've been around for about four years, three and a half years, which was before the plugin framework was there, so we come from an old architecture into a new plugin framework architecture. And then we'll have a short demo and I hope that I've got some time left over at the end for questions, and if not then we'll sort of things out along the way. So the plugin framework is only just released in 1.0, so it's in tech preview at the moment, and we'll dive into that a little bit more. This must start sounding familiar by now. All the talks that we've had today are talking about big data workload, batch workloads, needs to be different. It doesn't work like services on Kubernetes. So from the perspective where we came from, we're coming out of the yarn in the messes world, we look at pods or applications that get scheduled that have got a number of different pieces and they belong together. It's a logical grouping. It's not so specific for anything that we do together. Not all workloads are equal. Sometimes you've got things that you want to be first in first out, financial data. You first want to process the data from yesterday before you do today's. That's the first thing. But if you've got interactive sets of data, you want to do that fair sharing. You want to make sure that everybody gets their resources and everybody goes through the way that they have submitted things. The other thing is that if you look at batch workloads, you want to do queuing. It's been mentioned by everybody. You want to do some queuing. You don't want to have something external needed to be, need to resubmit a task if it doesn't fit within the quota. You want to hold on to it. You want to schedule it later and just go through with all of that. Workload queuing also because of the fact that your demand is not constant. Often jobs run during the nights, during the evenings for the standard batches that get run, but the interactive workloads that will run more during the day. You've got completely different demands on your cluster during the time. You want to postpone certain workloads. You want to hold on to them and you want to kick them off whenever you want to. Why are batches unicorns? We looked at what the default scheduler did and what we had as a demand from the customers and what we were doing on a yarn and a mess of this kind of thing. In the default scheduler, we don't have an application concept. You can't just say group all these things together and run this as one thing. Schedule them, put them in resources, do whatever you need to do. The scheduling algorithm is also limited. We've got one queue. We sort the queue. It's based on the pots that are there and then if we've sorted all those pots, we're going to run them. But the queue that we've got is one queue for the whole cluster. I can have multiple types of workloads running at the same point in time. I really would like to have multiple types of sorting algorithms. For instance, one namespace will run with the first team. First out, the other one will do fair sharing or priority-based queuing. So that's the other part. Limits and quotas, resource quotas are not part of the scheduler on standard community. You enforce them before you submit things. So somebody comes up and says, I want to run this batch job. Oh, there's no resources available. Now I need to come back and retry to submit that same kind of job again. And again, up until they've got the resources available. So that's a problem. The quotas, because they're hard enforced, don't allow me to do any of the queuing. So what do we do with Apache Unicorn? So we've got a different approach. We schedule based on applications. Whatever you decide is an application, we'll group together and we'll schedule that as part of one application. That can be anything from one pod or thousands of pods and pods that are created dynamically or pods that are submitted all in one go. So you decide on what to do there. We create a hierarchical queue system. So we set up a queue on a route, queues underneath and et cetera, et cetera. And we enforce the quotas in that hierarchical queue system. Later on, we'll look a bit more on how we do that and what that gives us and what the real extras that we get out of that. We can do in that hierarchical queue system, we can do flexible quota distribution. So I saw that in one of the other presentations coming back too. It's quota sharing. I think it was in Volcano that a group is using only half of that quota. Another group within the same team wants to use more. So we can now share that quota. Flexible quota sharing and flexible quota distributions we can set up in that hierarchical queue system. We've also got configuration for our sorting policies. So instead of having a pod sorting policy that runs out of one queue, we can sort at different levels. We can sort the queues, which queue needs to be first, which one needs to be later. How do you distribute your quotas between the queues? And then we can also look at application sorting. So you want to do priorities first in, first out, fair sharing, and all these things can be set up per queue. So I've got the possibility to have queues that are doing fair sharing, queues that are doing 5.0 based on the applications again. And then the last but not least, we've got the node sorting also. The area where we come from, we've got customers that want to run either in the cloud or on premise. And on premise, you've got a static cluster. You don't scale up, you don't scale down, at least most people don't. So that means that you want to be able to sort your nodes differently. In the cloud, you want to pack everything together, make sure that you get your costs under control, pack all the pots that you've got on one node, you do bin packing. And in an on premise cluster, you want to spread everything out as much as possible so you can use as much of the CPU that you've got available and do bursting and all that kind of stuff. There's also a couple of other things that we come up with a bit more advanced scheduling requirements. So Apache Spark has been used a number of times and as an example. So if you look at advanced scheduling, we've got the gang scheduling. You want to create an application, but you don't know beforehand how much it's going to use, but you want to give it a certain set of resources. So you specify on the Spark application or on any application what kind of gang resources you want to use. You can have multiple gang definitions, so multiple prototypes, multiple combinations of things. And we only start scheduling based on the resources that are available. So other thing that comes up in the gang scheduling and somebody mentioned that as SLA scheduling, I think it was called, is the soft and hard scheduling for the gang. If you say, oh, after five or 10 minutes or 15 minutes, I don't care if I don't have all my gang resources available, I want to start the job anyway, then you can say, okay, let's go and schedule and then we'll figure out where the rest of the resources come from. In other cases, you want to say, oh, no, no, I really need all these resources. If I don't get them, stop the application, just fail it and let's go on. That's over. All those combinations are possible and we can set up all of that stuff for you. Now look at the application sorting. Again, we sort per queue. Multiple applications are there. If there are streaming applications, there could be a Spark application. A Spark application could also be an SQL query that somebody wants to run and wants to have a direct output. So depending on where we run, depending on what we do, we can do the different sorts in there. And the third point was the note sorting. Binpacking, fair sharing, whatever you want to do. Top one, the binpacking private is in the public clouds, AWS, Google, whatever you want to run. The bottom, you run your private cloud, you share, you burst your containers. All these are available. Application sorting, game schedule and all that kind of stuff, you can set up different queues, different things, you can do whatever you want, but there's only one note sorting policy allowed for the cluster because there's no way that you can sort notes in a different way because you share all the notes on one thing. Yeah. In the previous slides, I stepped really quickly over what possibilities we've had in the hierarchical queue system. So let's dive in a little bit more into the hierarchical queue system. I'm going to take a really simple example. You can make it as difficult as possible. Again, we're coming from the yarn perspective. We've had customers that are running with hundreds of queues in five, six levels deep with all kinds of different combinations, all kinds of different setups. Again, all that stuff is also possible within Unicorn, but that's not something that I want to go in too deeply in around here. So we take a simple setup, root queue with underneath that, one tenant queue, and we're going to schedule from there. In Kubernetes level, we create three namespaces. We've got an unlimited and a limited namespace. Those two namespaces will be scheduled completely according to the Unicorn logic and whatever we want to do with application sorting and all that kind of stuff. We've got a non-Unicorn namespace, also sitting beside that. That one is not going to be scheduled by Unicorn because we're running as a plugin. We're running as an extension of the default scheduler. So what we say is those two namespaces, unlimited, limited, those will follow Unicorn and whatever you want to do and whatever you set up for the non-Unicorn namespace, we don't care. We don't touch it. It just goes through the logic that the default scheduler does with whatever you've set up there. In all these namespaces, we run an X number of ports. The ports underneath the non-Unicorn namespace are just separate ports. They get handled like that. The ports under the Unicorn namespaces will be handled as logical groupings in an application. The root has got a dynamic resource quota. It's the total size of the cluster. Whenever you register a new node, it gets picked up. The root gets adjusted. Quota grows. Quota shrinks based on what is available in the cluster. That's also important because we need that quota a little bit later on when we go and do the next step. Because the non-Unicorn namespace that we've created does not have a quota within Unicorn, but it still uses nodes and amount of resources in the cluster. So whatever is being used in non-Unicorn namespace will be deducted from the root quota that Unicorn knows how to handle. We don't use the quota that is set on the non-Unicorn namespace. We use the real usage that is there. We track the ports and we deduct that from whatever is there. Again, that makes the quota at the root level dynamically adjusted. And then within Unicorn we can set different quotas on different queues. In this case, we've set a resource limit as an example on the tenant queue of 75 gigabytes, 75 CPUs, whatever you want to set is fine. We've limited the limited queue also to 50 gigabytes and a 50 CPU. So because limited corresponds to a namespace, that namespace or the ports that will be running in the namespace will be limited by Unicorn to that 50 gigabytes. Unlimited does not have a quota set directly on it, but the parent of unlimited is the tenant queue. So what happens is that when we try to schedule things in the unlimited queue, we're not only checking does this queue have a limit, but we also look up to the parent and the parent of the parent just to make sure that we keep track of that quota that is there. So we do a recursive check. So effectively in the unlimited queue we've got a 75 gigabyte, 75 CPU limit set. However, checking the parents and the interaction at the different levels is a bit more complex because what happens is that the usage within the tenant queue is really the sum of all the queues that are below the tenant queue. So whatever resources are used in unlimited and limited get combined as a resource usage in the tenant queue. So if I use 50 gigabyte in limited, then really effectively unlimited can use another 25 gigabyte. How we share that 75 gigabyte limit from the tenant over these two queues depends on the sorting algorithm that I've set between the two queues. So if I do fair sharing, that means that based on the amount of quota that I've got, I get it nicely distributed over the two queues. If there's no load running in unlimited, limited will be able to pick up the 50 gigabyte at the maximum and gets limited at that point there. So again, looking at all these different ways of configuring things, we've got different queue sorting algorithms that we can set up. We can do the pods, the applications within the unlimited queue as a 5-0 setup while we do fair sharing in the limited queue. Combinations are endless. You can set up whatever you want. Then we'll go and have a little bit of a look at the architecture. So what did we do? How do we get where we are here for this architecture? So when we started, the system that was available was no extensions on the default schedule. We couldn't do anything. There was no plug-in framework. The extenders were probably still a little bit in development, but that was abandoned halfway through. So we really didn't have any options. So what we did is we built a simple architecture based on a plug-in, a shim, and a core scheduler. So because we came out of the yarn world, we thought we've got multiple resource managers, NaSOS, yarn, Kubernetes, all these kind of things. So we want to be able to run the scheduler on top of whatever resource manager is there. So what we decided was we built a core that does all the scheduling stuff. So that handles the queues, that handles the quota checks, that does all the things that we want to do of enforcing all that stuff. And we've got a shim that runs below that, that hides all the resource manager specific things for it. So the Unicorn Core doesn't really understand what a port is. It doesn't know what it is. It just knows I've got a resource request. It has got an allocation that I need to schedule. So we can remove the Kubernetes shim and we can put a yarn shim in place or a message shim in place that does the same thing. The Kubernetes shim talks to the API server and it converts whatever the API server gives in ports into what we understand in the core as applications and all that kind of stuff. So this was the design that we had when we started about three, three and a half years ago. This is what we've implemented. Over the time that we were running, the plugin framework became mature and we started using it. So how did we change our design from what was here, implementing the whole scheduler, doing all the things that the default scheduler did to something a little bit more in line with the plugin framework. So what we did is we pulled apart the shim. So we replaced all the things that we did within the shim with the plugin framework. So we integrated, instead of our own code for handling the default, for handling the binding of the ports and all that kind of stuff, we replaced that with using the plugin framework and letting the default scheduler handle all those things for us. So the unicorn core hasn't changed. It's exactly the same as it was before. We just replaced a part of our shim with callouts and the plugin integration. So in the current version that we've got, we provide you with both options. So you can run the old model or you can run a new model. Both work and both are generated from the same code. It's just a different docker image that we provide for you for running it. The plugin architecture. For people that have looked at the scheduler and have looked at extending the scheduler before, this is a familiar picture. So we're there. We started using that. So what have we done? What have we implemented? So we've picked up at the point that we started to pick up the pre-filter. In the pre-filter, what we do is we pick out the ports that unicorn needs to handle. So everything that belongs to either namespaces or things like that we want to manage as unicorn through the unicorn code, we pick those out and we handle them. Anything else? We just let go. We don't bother with it. We just let the pre-filter say, yep, it's all fine. Go and do your thing. The pre-filter ones, we filter out whatever we want to do. So at that point we get the pre-filter and release whatever has passed. So anything that passes through the pre-filter and what needs to be handled by unicorn has passed the internal quota checks for unicorn. We know that we are in the right state. So we know that it's an application. The application should be running. So it runs within the queue because not only can we limit things on resources, but we can also say, oh, you can only run 10 applications in that queue. So we know that that application is running and that we need to assign ports to them and it fits within the quota. So after the pre-filter has passed, the default scheduler will start looking for a note and assign a note to the port. Unicorn, the core scheduler does the same. We look at the note and we also do that. In the filter, those two things come together. So the default scheduler says, I've got a note for this running this port and we keep on rejecting the note up until we say, oh no, this is the note that we have now decided, unicorn has decided that this port needs to run on. And then we say, yep, oh, okay, and we release the port. At that point, we let the default scheduler take over. It will do all the rest of the things and go through all the stuff up until we get to the post bind. So what we do is internally within the SHIM, we have our own accounting, our own tracking, our own filtering and our logging things. And up until we get to the post bind, we say, we have stopped. We have got nothing to do. Post bind will update our tracking. For this, we've got a short demo that will show you two sets of ports that we do. We've got a set of ports that will be tracked by Unicorn and a set of ports that won't be tracked by Unicorn and we'll show you slightly in a quick overview in a two minute demo, but how that looks from the front end. And I hope that the sound works. And now for a short demo of Unicorn's new plug-in interface. We first create two namespaces, each having a quota of one CPU and one gigabyte of RAM. However, Unicorn has been configured so that namespace two will utilize the internal default scheduling logic rather than the queuing logic that Unicorn normally uses. Now we create three sleep pods in the first namespace. There's only room for two of these to run given our queue settings. So when we list the pods out, we can see here that in fact only two of the pods have been approved for scheduling and the third remains an appending status. We can also see that a scheduler name has been set of Unicorn and a default generated application ID has been assigned to these pods. We can also check the web interface and see that in fact a queue has been created automatically for this application. It is in a running state and two of the pods are active. Now we can clean up our pods so that we can create them in our second namespace. Now we create three pods in our second namespace. This namespace has been configured to bypass Unicorn queuing and so all three pods will be allowed to start at the same time and we can see that here. We can also see that Unicorn has been assigned as the scheduler, so we haven't completely bypassed Unicorn. However, we have not associated an application ID or a queue to these pods. We can see this in the web interface by the fact that there is no queue for namespace too. Thank you for watching our demo. Have a great day. So that was done by Craig who couldn't be here today. So what we saw here is we pick up the pods, we schedule a part of them through Unicorn and a part of them we don't. In summary, so what we've done, the community, we've released 1.0 with the tech preview of this. We're still missing a number of things, scale and performance testing we haven't done yet. We started working with other communities. Apache Spark was already named by the Fogano guys. We're working with them on the same Spark JIRA to get things going. We're still scheduling, we're trying to integrate with them also and do a little bit more. We've got some improvement in some future work that we want to do. The pre-filter gives us a lot of unschedulable pods which affects our auto-scaling badly. There's a pre-enqueue hook that just popped up about two weeks ago, so a week and a half ago. That is looking like a really good solution for what we do with batch scheduling for us. The other thing is that we've got a problem, something that we need to look at is with the auto-scaler, with the impact of the outer quota pods on the auto-scaler. That is really limited to what we do. If you don't use the auto-scaler, we don't have that problem. I think we've got some time left over for Q&A. I don't know how good I am at the time. Can you handle GPU resources? Yes. From a scheduling perspective, from the core perspective, we are resource agnostic. Whatever you define as a resource, even if you say, I want to schedule pods that use a resource called license, we don't care. You can schedule whatever you want. If you can define it on the pod and you can define it anywhere else, we can schedule on that and we can see it as a resource. Like you mentioned, there are some customers using deep hierarchies. Do you think that was a good idea? Do you think that they are being abused? Maybe you should, for example, force them not to do that again? I imagine it to be really hard to reason about it. You're saying six, seven, never. Yes. So, yes, those Q-hierarchies are really difficult to talk about and really difficult to understand. But the companies that use them, what they do, they've got a logical grouping of that. So, let's say you've got a multinational that says, I've got the company as the top level, then US as the second level, and other regions around the world as the second level. So, they've got a really logical setup for that and they spread and do things around that in that way. So, yes, it's unbelievably hard to troubleshoot and to understand. But, yes, there's a logical solution for customers. Thank you, Wilfred. I hope you stay around for more questions. I'll be around for more questions. We'll have a problem.