 Hello everyone, welcome to QCon Europe 2021. It's happy to have you here. So today we are going to talk about SIG scheduling, introduction and deep dive with Mike Dame and me as presenters. We are both from Red Hat working on the OpenShift project and we are very interested in scheduling and everything around it. We'll just, in short, we will summarize what SIG scheduling is, where can you find us and talk to us, then we'll give you an update about what's new in the scheduling area. And at the end, we'll talk about updates in this schedule project. Currently, we have two co-chairs, Abdullah Gallibech from Google and Huang from IBM. Meetings are taking place every second week on Thursday at 1 p.m. Eastern time. And if you want to reach us, you can find us at the SIG scheduling channel on the Slack. The SIG scheduling is home to several projects, namely, schedule plugins, which is a home to out-of-tree plugins, which can be used outside the entry plugins, which are provided in the scheduling framework. There's this schedule project, whose goal is to make sure that both violating scheduling constraints are evicted. Then there's a cluster capacity, which is, whose goal is to estimate remaining scheduling capacity in terms of POT. So, for example, if you want to know how many instances of your database POT can be still scheduled in your cluster, then there's a QBedge project, which is an implementation of the Gank scheduling. And there's Poseidon, which is an implementation of the Firmament Schedule. In case you are interested in more details about the project, just take a look at the SIG scheduling with me. So, we are not going to talk in detail about how scheduling works or what it is, because the last two talks given at KubeCon discussed those parts. The talk given at KubeCon North America discussed the scheduling from the perspective of a user, admin, and developer, and the talk given in Europe talked about scheduling framework and what the schedule is. So, in short, how the scheduling works, you have a few of POTs, which still needs to be scheduled, which means they have no node assigned to them yet. Then there's a list of nodes, which are suitable for scheduling. And there is a schedule component, which what it does is that it takes every POT, which needs to be scheduled and find the most suitable node for it. The schedule component consists of a scheduling algorithm use cache and plugins, which are provided by the scheduling framework. So, the schedule was designed with the goal to provide a very simple implementation and to maximize throughput. And basically how it works is that you get a list of nodes, you run filters, which gives you a list of the feasible nodes. And in case there are two or more nodes, just score plugins will give you the node, which is the highest score. So, once a POT is scheduled, the schedule no longer knows what's going to happen with the POT. So, eventually, some of the POTs may diverge from the original scheduling plan. So, that's where the schedule component comes. And its goal is to make sure that some of these scheduling criteria are still respected, contains a list of strategies, each strategy responsible for scheduling constrained. And once it's run, it goes through every POT, every node. And in case a POT violates the constraint, the POT is evicted from its node. So, for example, some nodes may be overutilized. So, the schedule just evicts as many POTs as is needed to make sure that resources are freed from the overutilized node with hope that those POTs will land at a node that is underutilized. Right, so now let's talk about more about what's new and what's been improved. So, there's been some improvements in the documentation for developers. New document has been created which describes how the scheduling views work. In short, there are three scheduling queues, active queue, back of queue, and unscheduled back queue. Active queue keeps POTs which can be scheduled. Back of queue keeps POTs which are put to sleep for a while. And the unscheduled back queue keeps POTs which cannot be scheduled for some reason. There's a document describes a lifecycle of a POT, how it transitions between individual queues based on certain conditions or events. So, if you are interested in more detail, you can take a look at the new document. There's also a new document describing the code architecture of the default scheduler and how it's built on top of the scheduling framework. So, in short, there are basic building blocks like scheduling algorithm, cache queues, and the scheduling framework. And because the scheduler allows you to configure various profiles, each profile corresponds to an instance of the scheduling framework. So, once configured and assembled together, you get a functioning schedule component. Again, if you are interested in more details, just take a look at the new document. And then there's a schedule plugins repository. So, in case the default entry plugins are not enough, you can bake in additional plugins which are provided by the schedule plugins repository to home to out of three plugins, which some of which you might find useful. Currently, there are seven plugins, for example, the code scheduling which allows you to add to the capability of a game scheduling. And if you have a new plugin, which you want to share with the community, the schedule plugins is a good place to start. Yeah, now I will let Mike talk. Mike. Yeah, thanks, Jan. I'm Mike Dam, like you mentioned, and I'm gonna go over some other things that are going on in the SIG here that were some improvements that we have for the scheduler. The first one is, this has been kind of an ongoing project of refactoring out some of the internal dependencies on our default plugins. We have a lot of the plugins that will depend on parts of code that are from ktio slash Kubernetes. And then being connected to this large code base makes it difficult for the plugins to be more modular, and which would be helpful if they could be used by other projects like de-scheduler or any sort of scheduling sub-projects that you might have. So we've just been working on trying to remove those. And one of the great outcomes from this has been the new component helpers staging repo within ktio. Jan was a big helper getting that going, and that really gives us an opportunity to move common helpers and helper functions out of maybe internal packages and into a place where any project, even not specifically related scheduling, could import them. So for example, we had a couple of helpers that just parsed out if a pod has tolerations for taints on a node and just returned a check for that. So there wasn't really any reason for that to be strictly internal. But so we put it into an external package where people can use it. And these packages in the repo are all owned by their respective SIGs. So you know that you're getting a good maintainability that if I'm using this function from somewhere, it'll always have a semantic meaning behind it that the team is supporting. So if you check out this issue, you'll see that's been going on for a while, and you'll see links to the component helpers repo. I think it's just github.com slash Kubernetes slash component dash helpers. But you'll see it in there too. If you wanna help out, this is a great thing to get involved with. It's a good project for new contributors because you'll learn a little bit about the code base as you're moving these functions around and updating their references and other places and stuff. Some other things that we improved on are we had some runtime improvements for certain plugins. A big push that's going on right now that I think just got some changes in for 120 is the event-based re-queuing of unschedulable pods. Previously, the event queue or the unschedulable queue would just re-queue pods kind of indiscriminately whenever events came in. And the work that's been going on with this allows certain plugins to register for certain events that should trigger pods to go back to the unschedulable. So this will be more efficient re-queuing of those pods that will hopefully be more reactive to things that are actually going on in the cluster. We also made some improvements to the node affinity plugins and improved the throughput on that. And there's some discussion in that issue on how we got that and how it was found and how it was taken care of. And the last thing isn't so much a performance improvement, but it is a big improvement for pod affinity and that we now have a, you can widen the scope for namespaces for the pod affinity plugin. So previously, the pod affinity only cared about nodes or pods within the same namespace. So in this picture here, you can see like we have three different nodes. And if we wanted these pods from different namespaces for whatever reason that we have to be co-located on the same node, there wasn't really a way to do that. But now you can broaden the scope of pod affinity to look at multiple namespaces, which would be hopefully useful for some people. We also have ongoing work for the scheduler component config. If you've been following the scheduling framework and specific, the scheduler is really heavily configured through component config, starting to move away from flags for some things and definitely the old policy API, which is something that if you're still using you should be start seriously looking at a plugin config through the component config, but there's still improvements to be made to that, including some of the things that I've listed here. Some of the plugins that have been replaced by other, more featureful plugins are being deprecated. So that's things like node labels, service affinity, no preferable white pods. Those are gonna be deprecated, moved away from and there's replacements for them that are available. We also have the node resources plugins, the various node resources such as balance, usage, least usage, which tries to spread pods out most allocated resources, which is for bin packing when you want to prefer nodes that have the most resources on them already. We're gonna be trying to unify those into a single plugin because there are a bunch of similar plugins that are adjacent to each other. So it will be a lot easier for people to understand if they can be configured from a single place. We're also been exploring the possibility of fully qualified plugin names, especially for our entry plugins, the ones that are part of the core Kubernetes would like to provide a well-defined name semantics for them so that you can at least identify an entry plugin from a custom plugin. This is a big part of the framework sort of ecosystem of trying to have out-of-tree plugins coexist well with entry plugins. It's just an organizational setup for them. And then in this issue, some various backwards compatibility removals that we currently have in place. So we need to get too much into detail of those a couple little hacks just to make like older versions of the policy work with newer versions. But as this config is evolving, obviously we're gonna have to move away from that. So if you are heavily using the current scheduler config, it's important to stay up to date on some of these changes and be aware of them. Nothing will be removed outside of the allowed deprecation timeline. So you always have that, but eventually some things that you might be relying on could be removed and you wanna be prepared for that change. And lastly, for the scheduler framework, we have these two unique use cases that were brought up to me, I think Wei actually sent those to me. I wanted to talk about them a little bit. So these are real-world use cases for the scheduler framework. These are links to some blog posts so the slides will be shared and you'll be able to check these out. But they're very interesting. First one is from Cockroach Labs, basically discussing how they came up with their own custom filter plugin and how they use that, what their problem was to be able to scale Cockroach DB up and down. One of the interesting things that they talk about is how they actually consider using pod topology constraints, which if you're not already using pod topology constraints, if you're on Kubernetes 118, it was enabled by default. So here on there, this is I think one of the most helpful plugins that's available right now on the scheduler for evenly distributing pods among the nodes, especially with the de-scheduler too, this kind of goes jumping ahead of it. I see a lot of people that try to, they want to distribute like a deployment of pods across nodes and they'll run the de-scheduler for resource balancing, and they'll say, well, why are two or three of these pods still on the same node? It's because you have to consider the resource usage over the entire node, but with topology constraints like this, you can specifically say, I want these pods to be distributed evenly, no matter what. And that's just very helpful for making sure that you can get those spread out. The second blog post here was from OpenAI, and that is less specifically about their, what they're doing with the scheduler, but talking about scaling Kubernetes up to 7,500 nodes, very interesting. They talk about some scheduling aspects. They have some sort of controller that is managing taints automatically based on who's creating certain pods as tolerations for those taints based on teams. They also talk about this concept of setting up, they call them balloon deployments. It's really interesting the kind of dummy deployments to keep the cluster auto-scaler from automatically scaling nodes up and down. And I wanted to draw attention to this because we do some similar stuff in some of our scheduling end-to-ends where we'll create some dummy pods that just try to balance out the cluster so that the test can run consistently. And they also touch on using pod anti-affinity to get even spreading, which was pretty much the way to get even pod spreading for pod topology constraints. With the caveat that when you set a pod anti-affinity to itself, you can only have one copy of that pod per node. And with pod topology constraints, you can set up having on two or three within a certain skew. But this was an older way to kind of hack the scheduler into getting a nice even spreading, no matter what, regardless of resource usage. But they ultimately end up going with the co-scheduling plugin from the scheduler plugins repo, which Jan mentioned a couple of slides back. So it's great to see that people are using this and these are setting some examples for new users to go by. When they want to adopt the framework. So finally, we're gonna get to up some updates on the de-scheduler. The de-scheduler is a project, it's one of the bigger projects under SIG scheduling where we like earlier talked about it, remove spots from nodes following general scheduling logic. The first thing to talk about here is that we've added, we've updated our reviewers and approvers to add some of these great contributors. Besides myself and Jan, we have Sean and Lee who have been really active and are just doing great stuff for this project. So along with adding these people so that you as a new contributor, if you're interested in your PR should get assigned to a relevant administrator of the project. And if you have any questions, these are some of the active contributors to work with. We moved some of our existing maintainers to an emeritus status. So that's like Avesh and Ravi and Klaus and those guys were great for getting the project started and establishing it. And that's a good way to recognize their contributions that they've done, even though they might not be able to actively review PRs right now. So these are people to get in touch with. A couple of new features that we added this release in Prometheus metrics, which was a big ask. So the de-schedule overport metrics on pods is that number of pods that are evicted and things like that so you can track how it's running in your cluster. Label selector filtering for different strategies to be able to select pods more specifically. Eviction based on pod topology spreading soft constraints, which we usually added this just with the hard constraints for pod topology spread, but people asked about adding a soft constraint. So we went ahead and did it and that is an option now. Also, the ability to ignore pods with persistent volume claims. The de-scheduler by default ignores pods with local storage, but some people wanted to specify that just with a PVC instead. So now you can prevent evicting those pods that are attached to the PVC. And something else that's going on, trying to work on a way to make the de-scheduler more reactive to events in the cluster, kind of similar to what I talked about before with the scheduler being reactive to events for moving pods from unschedulable to schedulable. The de-scheduler right now only runs either on a periodic loop or you can run it as a cron job or a job, but having it actually listen to what's happening in the cluster and then run the relevant strategies based on that could give people more instant reaction and balancing in the cluster than having to wait for the de-scheduler loop to come back up. So a lot of these have been, like I said, requests from people. If you have anything that you're interested in helping out with the de-scheduler or questions or features that you'd like to see or contribute, we're also on zig scheduling to talk about it there. That's our main Slack channel. Or just feel free to hop on the GitHub page and open an issue, start a conversation about what you're seeing, what you'd like to see. We are always looking for new contributors for this project as well. So, yep, that is our presentation and I want to thank everyone for coming. At this point, we will be opening up for Vibe Q&A. If you have any questions on something we talked about or something that we didn't talk about that you were hoping that we could get to, now's the time to ask. Thank you.