 Yes. So this week's OpenShift Commons, we're really happy to have Abhishek Gupta from the OpenShift Engineering team and it's definitely a V3 topic. We're going to talk about scheduling pods for high availability and he's going to do a bit of a deep dive and then a little bit of a demo afterwards and then we'll do some Q&A. And this week the recording appears to be working. So we're going to cross our fingers and then hopefully the demo will work too at the end of this. So take it off Abhishek. All right. Thanks Diane. So my name is Abhishek Gupta and I am on the OpenShift Engineering team and today we are going to talk about the scheduler within OpenShift V3. And for those of you who have been following Kubernetes and OpenShift development on the V3 version from the beginning, the scheduler that we have is pretty much working on a three step process. So talking about the scheduler overview, the first step of the scheduler is to take the list of the entire nodes in the system that are available and are schedulable. And filter out the ones that do not fit the requirements for the pod. So the requirements that a pod could state could be things like certain hosts, certain ports are required, certain disk volumes are required, or certain resources are required, so things like that. So some basic filtering out of the nodes that don't fit the requirements of the pod, those are taken into consideration. In the second step, the filtered nodes are now prioritized to find the best fit. And each priority function gives a score of between 0 to 10 to each of these nodes and they can also assign a weight to a priority function. So the way this works is you can have multiple priority functions and we're going to talk about predicate functions for filtering and priority functions for the prioritization a little bit later in detail. But the priority functions give a weight of 0 to 10 and you can assign a weight to balance your prioritization algorithm a little better and have granular control over it. Finally, once you have your prioritized list of nodes, you just select the best fit, you know, the ones which have the highest score and if you have multiple of those, you just select the one at random from that list. So let's talk about the predicate functions in details. These are the ones that are the filter mechanisms for filtering out the nodes that do not fit the pod requirements. And you can specify multiple of these. Each predicate functions deal with a specific constraint or requirement or conditions that is specified as part of the scheduler configuration. And each node has to pass through all of these predicate functions in order to be considered a fit for scheduling a given pod. So taking a look at some of the available predicate functions that come out of the box with the open shift. The first one is the pod fits ports and this one ensures that if you specify your host port conditions, if you require a host port in your pod that we will not schedule multiple pods that require the same port on the host. So as far as, you know, you can specify any port within the pod itself. But if you require that port to be mapped onto the host, then we will make sure that multiple pods don't conflict on the port requirements. Second, we have pod fits resources and this one, you know, takes care of the requirement that the pod specifies for various resources, namely CPU and memory. So a pod could specify requirements for CPU and memory and we will make sure that those can be satisfied by the node in question based on available remaining capacity, discounting the resources that are already consumed by the pods that are already hosted on that particular node. Moving down, we have the match node selector, which is just a way for users to specify or have a granular control over which nodes they want the pod to be hosted on. So within the pod spec itself, you can specify a node selector. And this is essentially a label selector that identifies a target set of nodes and your pod will be hosted on one of those nodes that match the particular node selector. If you want even greater control, you have the host name predicate where you can specify the exact host name and your pod will be scheduled on that particular host itself, assuming obviously that, you know, the other predicates do seem to pass that node as a good fit. And finally, we have the service affinity, which is of especially interest to us today with regards to our HA discussion. And this is the predicate that ensures that, you know, all pods that belong to the same service are hosted on a given set of nodes that are identified by the configuration for this predicate. And we'll go into the details of this one a little bit later. Now coming on to the priority functions, and these are the ones that come into play in the second step in the second phase of the scheduler functioning. These are the ones that determine the relative suitability of the host the pod for a given node. And as we mentioned, each priority functions can assign a score of 0 to 10 to each node. You can further refine your policy by specifying or allocating a weight to a particular priority function in order to give, you know, greater importance to a particular algorithm and deep prioritizing a different algorithm while not removing it completely. Multiple priority functions can obviously be specified and the scores that all of these give to the nodes are then aggregated, the weighted scores are aggregated and that is how you get the final scores for each of the nodes to host a given pod. Some of the available priority functions that we have in the system today out of the box are the least requested priority and this one favors the nodes where fewer resources are requested. Now what that means is if you have pods already posted on that on a particular minion, those pods, we will calculate the resources that have already been requested and hence deemed consumed by the pods that are on a given node. And then we will make sure that we will prioritize the nodes that have more or greater available capacity, you know, and be prioritized the ones that have, you know, low available capacity and things like that. And obviously, as a side note, pods do need to specify the resource requirements. So, I mean, it's not, it's not necessary for a pod to specify their CPU and memory requirements, but for this particular priority function to have any meaningful impact, the pods need to specify that. The second one down is service spreading priority. And what this does is it ensures that there is a good spread of the pods that belong to the same service among the available nodes within the system. And again, when we talk about, we are talking about the filtered nodes that have passed through the first phase. If we were to take this concept a little further, we have the service anti affinity and what this one ensures is that there is a good spread of the pods belong to the same service across a certain group of nodes. Now, while the previous one just spread it across all nodes, this one would identify, you know, individual groups of nodes and spread it among the different groups as identified by label selectors. And we'll go into the detail of this particular priority function in detail when we talk about HA. So, looking at some of the requirements, actually taking a quick step back in 2.x, some of you have followed the scheduling algorithms in OpenShift 2.x. Over there we have a similar concept with the regions and zones, but those are a little perspective. We force you essentially to categorize your nodes within certain regions and zones and while those functionalities are, you know, really helpful in achieving a good spread and HA, our current focus within OpenShift 3 and the design for that, you know, we aim to be a little more flexible. So, having said that, let's look at some of the requirements that we, you know, began with with regards to HA for the scheduler. So we wanted to have the ability to define multiple infrastructure levels. So infrastructure levels are, you know, things like your zones, racks, power bars, and things like that. These are completely flexible as such and you can have multiple of these levels which can be nested. Secondly, we wanted the ability to restrict all pods belonging to a particular service within a particular infrastructure level. So what that means is if my, you know, pods or if my application has, you know, low latency requirements, I definitely don't want them to span great regions and great distances and would want them to be located in, you know, within the same rack or the same zone or things like that. And that is something that we wanted the ability to have and be able to specify that at any level or at multiple levels. Spreading the zones within a given service among the particular set of nodes. So this is the crux of anti-affidity or having a good spread at a certain level. So let's say I want all of my service pods within the zone to be spread across all the available racks. This is the priority function that we'd be using and that is, you know, one of the requirements that you could specify a good spread at again multiple levels. This, so now talking about the infrastructure topology in a bit of a detail. Like I said, we've made this fairly flexible that administrators can define multiple levels and, you know, these could be any level really. So as examples, you could have zone racks, power bars, and the way we have allowed administrators to find this is by the use of labels on the nodes. You can have simple labels be specified on the nodes, you know, as, for example, zone equals Z1, rack equals R1, power equals, power bar equals B1. And that is essentially what will be picked up and considered by the scheduler in terms of infrastructure topology levels. Level names and the number of levels completely flexible. So there is no fixed specification on what names you want to give your levels or how many levels you want to have. So you're completely free to specify something like, you know, city building, server room, racks, and any of those things as your labels. Finally, levels are usually nested and, you know, that that is something that has been sufficient, but that is not being built in as a requirement. So you can have orthogonal sets of your node clusters. For example, you could have a broad dev test as one set of categorization and you could have an orthogonal set that is zone and rack, you know, assuming that zone could have nodes from both broad dev and test and things like that. It depends upon the requirement that could be use cases that would benefit from this. So the next thing that we wanted to talk about was service affinity. So this is what ensures that multiple pods within the same service or all the pods within the same service end up being created or hosted on minions within the same topological level. Again, topological level, if it's a zone and you define your affinity at a zone, you could essentially ensure that all your nodes within the service end up within the same zone on different nodes, obviously, but within the same zone. And the way to do this would be simply to configure your service affinity predicate in the scheduler configuration by specifying a label and the label is the same that we have used on the node to define the particular level. So as an example, I have in here a zone affinity, which takes in a label of zone and would, you know, one, if your scheduler is configured with this particular priority function, you would ensure that all the pods within a given service are located within the same zone. Now, multiple levels can be specified for affinity as well. So you could perhaps, you know, if you have multiple levels, you could want affinity at multiple levels. So if you have really high latency, you know, this would be one thing to perhaps use and you could specify multiple levels. So in the example that we have in here where we are defining affinity at the zone and the rack level, you would have essentially all the pods within the same service, located not only within the same zone, but also within the same rack, they will be distributed on the different nodes within the same rack, but all of them will be within the same rack. The next one is the anti affinity that we wanted to talk about. And this is the one that finally ensures a good spread of your pods and, you know, becomes the crux of the highly available functionality of the scheduler. This ensures that the pods within your service are spread across different nodes across a certain topological level. So let's say you define a scheduler configuration, which says that I want a good spread across the different racks within my zone. So you would, so you would define your priority function and provide a label of rack. And again, this is the label that the minions have. And the scheduler will ensure that as new pods come in, a good spread is ensured within the different racks that are, you know, available within that particular zone. So the way this does is the priority function gives the same score to all nodes within the same group. So this ensures that if a pod already exists within, you know, one node in a particular rack, that all the nodes within that particular rack will be given the same score. And the nodes within another rack, which do not have any pods in them are given a higher score. And hence, the nodes within the other rack will be, you know, essentially prioritized over all the ones in the rack where a pod does already exist for that service. And so one is the skill scheduler policy configuration when here we have two things. So first of all, we have the default configuration that comes out of the box in which you do not have to do anything. And we have the priority functions and the predicates predefined for you. They do not have the affinity and anti affinity predicate and priority functions in there, but for most basic purposes, conflicts and constraints, all of those things are taken care of by the default policy configuration. However, if you were to specify custom policy for your scheduler, you can do a mix and match of the different available predicates and priority functions and specify a configuration file and, you know, specify that within the master configuration in OpenShift. So over here we have, you know, a snippet of an example of policy file. It is versioned. So as we, you know, make changes or to the structure of the file, we will hopefully ensure backwards compatibility and conversions and things like that. So in this example, I have added two predicates, the pod fits resources, and I am defining affinity at the zone level. And as far as my priorities are concerned, I am ensuring that a good spread as far as resources being consumed for my nodes is ensured by using the least requested priority. And I am specifying a spread at the rack and the power bar level. To take now an example of what happens when you have these priority functions defined, let's look at a case where we have defined affinity at the zone level and anti affinity at the rack and the power bar level. So over here in the chart we have two zones defined, you know, the top zone has 12 nodes and the lower zone Z2 has, again, another 12 nodes. Within each zone we have three different racks and within each rack we have two different power bars and each power bar has two nodes. Now what happens when a pod belonging to a service comes in? Since the pod does not have any other requirements specified on a particular region or on a particular zone or a host or a node, the scheduler will just post it or schedule it on any available node within the cluster. So let's say in this case it gets posted or scheduled on a node within the Z2 zone within rack R22 and one of the power bars within that. Now once this one fits into any particular zone, what happens if new pods are then created for this particular service? So let's say we create five more pods. So now that we see that new pods coming in, the affinity policy at the zone level will ensure that all the new pods will end up being scheduled onto the same zone. So all the zone or all the new pods end up getting scheduled on the minions within Z2. In terms of anti-affinity, now we have configured a good spread at the rack and the power bar level and this ensures that the second pod coming in which is P1 or P2S1 goes onto rack R21, the third pod goes onto the third rack and then as new pods keep coming in, they're added onto different power bars within the same rack and so on and so forth. So eventually if you have six, you will ensure that each power bar within each rack has one pod. If you go on to now add more pods, you will get duplication or essentially multiple pods within the same power bar, but even then we will try to add it to different nodes within the power bar to have as good a spread as possible at that level. Now finally before moving on to the demo portion of it, I just wanted to make a quick word about the scheduler extensibility aspects. So just as anything within OpenShift and Kubernetes, we follow a general model of plugability such that things are based on a plug-in model. So having said that, there are essentially two ways to extend and enhance the functionality of the scheduler. First and foremost, current scheduler implementation is definitely extensible and configurable and the way to extend it would be to add new predicate and priority functions in order to be able to deal with more constraints and as well as to enhance the prioritization algorithm. And that in addition to the available predicates and priority functions and new ones that can be created, the scheduler can deal with most use cases reasonably well. If however you want far greater functionality features which sort of cannot be fit within the model of the current scheduler implementation, then the entire scheduler itself is built on a plug-in model and the current implementation is a plug-in within that particular model and other schedulers can easily be plugged in and integrated with the Kubernetes in OpenShift. So now having said that, I'll move on to the demo portion of it. We won't take questions after the demo, Diane, or before that? There's one from Luke right now and you may have answered it for him. I'm just going to unmute himself. I'm not sure whether it's about cross-service affinity. When he doesn't want his front-end pods and database pods to be in different regions, maybe a label affinity instead of a service affinity. Is there such a thing or maybe Luke, you can add to that conversation if you unmute yourself? If I understood the question correctly, we are talking about cross-service anti-affinity, I presume. One way to do that would be to just specify different regions within your pod itself or your replication controller and things like that. And that itself will ensure that the pods for your different services end up on different levels or you have good spread among them. But out of the box, as far as the scheduler configuration is concerned, we currently do not have a direct mechanism where you can link multiple services and define anti-affinity among them. So that would be typically an example of how you would go about doing something like this. So today, just to talk about this a bit further, we have these affinity and anti-affinity defined at the service level. It would be pretty easy to define an anti-affinity priority function that works at a project level or a namespace level and can ensure a good spread not only of the service pods but also of the pods belonging to different service. Now, whether you want affinity for that or anti-affinity, both of those things would be possible by simply writing a new predicate for a priority function. Okay. So there's a couple of questions. I've un-muted everybody. So hopefully, Luke, you can ask this second one that you've just followed on about when you say the scheduler functions and the scheduler itself a flood of old. Does this require recompiling or reconfiguring? Go for it, Luke. Did I get that right, Luke? The question? Let me just scroll up. Yeah. Go on to the side here. You can see some of these. So is it at the top? All the way down the bottom. Very, very bottom. There's three questions at the bottom. Okay. So when you say scheduler functions and the scheduler itself are pluggable, this requires recompiling or reconfiguring? We need to condense them about vision and how they're going to part of making that vision a reality. So if someone is speaking, it's very faint, can barely hear the person. And we need to retain the talent by giving them the right incentive, which I'm sure Roland did. It can help us with that. It's on board. Thanks. Sorry about that. Abishak, I had muted somebody who was having a conversation in the background. Right. No problem. So to answer the first question or, you know, in the ones that I'm looking at right now, the answer is both. If you want to have a different scheduler configuration based off of existing available predicates and priority functions, then that is something that is a matter of merely reconfiguring the scheduler policy and restarting the scheduler. If you were to add new scheduler predicates and priority functions, then that would be something that would definitely require compiling of those predicates and priority functions. And then obviously reconfiguring the scheduler policy file to incorporate those. As far as completely ripping out and replacing the scheduler, you would, you know, obviously just stop the scheduler and provide your own. And that does not necessarily, you know, need anything to be done on the open shift side of things because the scheduler, the way it works is it watches. It consumes the master APIs to watch for any new pods coming into the system and, you know, uses other data from the master and, you know, does the processing. And finally, for as far as scheduling is concerned, it again makes an API to create the binding for the particular part. Ryan, you want to try and ask your question now and see if we can get some unmute yourself? Okay. Can you hear me? Yes, I can. All right. So, yeah, I was wondering, I missed a couple parts of the discussion earlier, but I didn't hear anything about deployment strategies. And I was wondering if you had any kind of quick comments on how that might relate to this discussion. So, deployment strategies, if you are talking about, you know, the ones that we refer to as deployment strategies with an open shift itself, then what that is, is just defining how your pods for a given service, you know, gets spun up. So, if you were to have a deployment strategy of, you know, rolling deploy, for example, then, you know, if you were to make changes to the underlying image for your pods, then a new deployment is created and all your existing pods will be, you know, tear toned down and new pods will be created. And then the deployment strategy really refers to how that transition takes place, whether it's a tear down and recreate, whether it's a rolling deployment, you know, one of them is toned down, a new one is added, or it's an A B deployment where both are stood up and then, you know, switched over, things like that. So, if that is what you meant when talking about the deployment strategy, that is really, okay, so then that is really an orthogonal concern as far as the scheduler is concerned, because when you talk about deployment strategy, it only essentially, you know, dictates how new pods get created and how old pods get torn down. But where the scheduler comes in is trying to figure out where new pods get scheduled. And over here, there might be a little overlap where you would potentially want new pods to be created either on the same nodes or similar nodes or similar groups or on different nodes than the outgoing ones. And those are things that you can have, you know, essentially some control over with regards to the scheduler configuration. And that goes on to the question of how you define your application controllers for the deployment strategy, what services are being used, whether the service is handling both the set off outgoing and incoming pods and things like that. But mostly for the most part, these are two orthogonal concerns. Okay, thank you. Just looking at the next one where deployments mentioned earlier and cook combs and how those strategies. Yep, we went through that. Yes, I'm curious about how pods can be rebalanced. If policy is updated with new predicates, will it check to rebalance pod or would it only affect new pods? Yes, the latter. As of now, there is no automatic rebalancing, but rather when new pods come in, they will follow the new policy and, you know, ensure a good spread. If so, if you were to somehow simulate rebalancing, you could simply, if you have a replication controllers unit, you could simply tear down a couple of pods and those will be recreated and the new ones coming in to satisfy the replica account requirement of the restock of the replication controller. The new pods will now follow the new scheduling policies. So why don't we switch over to the demo there? Alrighty. In the interest of time and then we'll ask any Q&A at the end of the demo. Sure. So what I have over here is, you know, a bunch of nodes that I have created for this demo and well, it's a, it's a, oops. Yeah, it's a long list as such, but I have managed to create 81 nodes in my cluster over here. And the way I've done it is I have defined three levels, a zone, a rack and a power bar. And as you can see, my nodes are labeled with a zone Z1, Z2 and Z3 and my, you know, racks similarly are labeled as, you know, all of those labels and similar, same goes for the power bars and they move from, you know, zones and racks and power bars from the first minion to the last. Just so it's easy to track. So we have three zones. Each zone has three racks. Each rack has three power bars and each power bar has three nodes. So that's how we have a final of 81 nodes within the cluster. So taking a quick look at the service. First off, I'm going to create a service that a label selector for a front end. So if you look at this particular one, we have a front end service where the label is name equals hello open shift or whether and the selector is name equals front end. So with this service in place, now I'm going to create a replication controller. And initially that replication controller has, I believe a single oops, let's change that. So I'm going to start with a single replica for my replication controller to see how that, you know, gets impacted by the scaling policy and we'll increase the replica account as we progress in the demo. So first off, we create this replication controller and as we check the pods, we see that a single pod is created and it is hosted on minion eight. So essentially we have this being now created in the first zone. To take a very quick look at the schedule, the policy that we have, we have defined affinity at the zone level. And a spread at the rack and the power bar level and finally we have also for good measure just spread just use the service spreading priority, which ensures a good spread among all menus. With this in place, now we are going to go ahead and modify our replica account within replication controller to now three. So this essentially we are trying to ensure that the three pods that get created do end up getting created on the three different racks within the system. If we take a look at the pods in a couple of seconds. So we see that, you know, all of the pods are created on Minions 10, 8 and 24 and, you know, all the way from one to 27 would be nodes within the same zone. And these are all on different racks. If you were to, you know, further go ahead and just change that to nine. The objective will be here is to try and figure out or to verify that there is a good spread of these pods across the different power bars. Now taking a look at the pods now a couple of them still remain unsigned and we see that well if you were to take a look at the labels on these minions. We see that each one of them, as you can see there, you know, sufficiently well spaced out in terms of numbers. Each of them is actually hosted on a particular power bar. Finally, if we were to just, you know, go full out and create 27 replicas. So we'll have 27 pods. Objective is to, you know, see that we've ensured a good spread of these pods on all of the available nodes within that particular zone. So what we have done over here is now modify the replica counter 27. And if we now look at the different pods, well, it's hard to see over here. But if we just were to sort out the specific nodes that the different pods are assigned to, we can see that, you know, there is one pod on every node. Now, if you further increase the number of pods within your service, you will get duplication. But even the new ones coming in, we will ensure that, you know, if you add three more, they will get added to one of each of the racks and so on and so forth. So essentially, you know, H.A. is achieved by ensuring a good enough spread across the levels that you have configured your scheduler policy to have anti affinity against. The sort of the demo. One thing that I would like to mention before we open up the form for additional questions is that while the system, the current implementation is completely flexible in terms of, you know, how you can define your topological levels and where you can define your affinity and anti affinity requirements. We would definitely like to mention that a lot of this has to be a balancing act between H.A. requirements on one side and latency requirements on the other hand. So if you were to go ahead and specify anti affinity at the zone level, your pods for your service will actually get spread across the different zones. But it is up to the application developer and, you know, the deployer to realize that whether that service, whether the pods, whether that application can actually handle the level of latency that would come with spreading at that level. So, you know, you can have the highest level of H.A. by having it spread across zones, but, you know, other considerations like latency for your application and things like that would also govern how you would want to define your scheduler policies. All right. And with that, I'm going to unmute everybody and let's see if there are any questions. We're going to ask a question you can either throw it into chat or unmute yourself and just vocalize that questions today. It looks like that's working. As long as Luke turns on his microphone. Right. So let's see. I think. If nodes drop, do pods get relocated. So that is a different mechanism thought answer. Yes, that is a functionality that we're working on and, you know, you will have the ability if nodes drop nodes are no longer available that pods will be recreated. There is also a mechanism that again we're building where preemptive evacuation of nodes is being worked upon. So either nodes are already dead and the pods are already lost and we want to recreate them. That is being handled or being worked upon. And secondly, preemptive evacuation of pods from node that is also something that is being handled. And in either cases, you will bring the scheduler into play and the pods that were created on a particular, you know, that has either gone dead or is being evacuated. All those pods will be recreated a fresh and the scheduler will then decide where a best fit for those pods are and schedule them on those particular nodes. Is that work still ongoing? The mics are working fine. Go for it, Brian. Ask your question. Is that is that work still ongoing at this time the evacuation and identification of failed nodes? I believe so. That is actually something that, you know, folks from another team, John Hans and Ruby are looking at but if you have specific questions, feel free to reach out on the list. And I can point you to the particular trailer cards and any, you know, current progress and things like that. But that is definitely something that is in progress. Yes. There's a there's a guest who's just asked another question, Abhishek. Alrighty, so could we get an admin tool that could show us which pods could be better balanced? So yes, ideally, while we don't have that right now, obviously, but a simple tool to figure out, you know, given the number of pods within the system, if you were to delete them today and recreate them, where would they end up going? That could be compared to where they are right now. And perhaps, you know, that is the comparison that would be required and, you know, could pretty much be a dry run of recreate or redeploy my, you know, service pods. So yes, can be done. But having said that, there is nothing preventing you from actually going ahead and doing that by simply modifying your deployment config or updating your deployment config. So if you, you know, have your pods and services and replication controllers all managed in control by a deployment config, merely updating the deployment config or the version itself will ensure that all your pods are recreated anew. And well, the new pods will get balanced based on current scheduler policies and available nodes and best fits. And as Mike mentioned, some of that work is the visualization side of it is being looked at by the manage IQ team. All right, that's interesting. I mean, we could, you know, definitely pass this particular, you know, thought or requirement by them, because it does seem interesting. Yep. Sure. I mean, you know, just a dry run, like I said, of how a new deployment would manifest itself would be a simple way to go about it. But yeah, cool. There's also a few companies involved in looking at visualizing the Kubernetes cluster. All right, good to know. All right. Well, if there are any other questions there, the mics are open. If not, really, Abhishek, thank you very much for your time today. This was great and gave some good insight and I think it cries out for some visualization. Maybe we'll get the manage IQ guys on to this and get them to demo something for us soon. Thanks again, and we'll be back again next week. All right, that's good. And finally, you know, I mean, some last words is we've looked at the scheduler and its extensibility capabilities. And, you know, definitely would encourage folks to try it out, play with the scheduler configuration, you know, try out your different use cases and would love contributions on new predicates and priority functions, as well as any integrations for your own schedulers that you guys might have. So any of those things, new predicates and priority functions for extending the current functionality, quickly welcome. We are, as we speak, already working on adding new capabilities to the existing scheduler and, you know, it will only get better with time. Yeah, and I think the other interesting thing is that the Mesa Spears folks just joined the OpenShift Commons. So I'm thinking Mesa's as a scheduler might have some resources to do some alternatives. Definitely. I mean, Mesa's as a scheduler plugin, even yarn as a scheduler plugin candidate. I mean, those things, you know, which are, you know, really fully featured and highly scalable and things like that would be very interesting to get those as plugins for Kubernetes and OpenShift. Yeah, and Ivan just threw out a URL to a YouTube video that Kubernetes already has some visualization tools, so we might take care, guys. Thanks again.