 Hello everyone and thanks for attending this talk. I'm sure last few days have been very enriching experience for all of us, and I'm sure today is even going to be better. I'm not sure about you, but at least for me because my evening flight is already canceled. So with that, let's start. Today I'm going to talk about a scheduling simulator for capacity estimation of Kubernetes cluster. Although it says simulator, but it's a real utility that you can run against your cluster. Simulator and simulation is just part of the design now. Just a little bit about myself. I work at Red Hat. I've been a core contributor to Kubernetes and OpenShift for several years, and for the last several months, I've been working on these two new tools, de-scheduler and cluster capacity. Cluster capacity is the one I'm going to talk about today. This cluster capacity effort is completely community effort, there have been several people involved in its design and its development, so I'm not the only one. Here's the agenda of my talk. I'll first talk about why we really wanted to build this tool. Then I'll talk about why we have designed the way we have designed. Then I'll talk about some of its feature, and then how we can use basically some of the best practices we have learned by running on some real clusters. In the end, I'll give a demo, I think the most important part of this talk. So a bit of history. Before OpenShift V3, that is based on Kubernetes, we have entirely different implementation of OpenShift in version V2 and version V1, and our OpenShift off-stream headed to where they could input something called Gears. You might think them as similar to Ports, but they are not really similar to Ports, and then that utility and tool used to give some output number of instances, and based on that, our off-stream used to decide or used to take some actions, whether we want to add more nodes or more capacity or not. So that's how it started. Because when we moved to OpenShift V3, that is entirely based on Kubernetes, that tool was no longer useful and we wanted to build something similar for OpenShift V3. So with this tool, basically we wanted to answer two very simple questions, how much capacity is left in a cluster, and when to add more capacity to a cluster. If you have been running some cube cluster, and you have been doing capacitive planning, you might be wondering about even so many other questions. But we just started with these two questions. Another thing we considered a workload type, like we wanted to have a tool that could process like uniform workload and variable workloads. Uniform means your applications have same requirements and variable means your applications have different requirement. Another thing we had in our mind when we were designing and we are working on this tool, initially we just wanted to start with very simple utility, very lightweight. We didn't start this tool with very ambitious goal, like we want to have all the features related to capacity planning in this world. We just wanted to very simple tool that can do its job and what kind of questions or answers, answers we are looking for, it can answer those questions. So now because we have been talking about capacity, what does it mean in the context of Kubernetes? When you have application, you want to deploy on Kubernetes as a pod, although like you don't directly use pod, you generally use some higher level construct like deployments. So when you think about capacity, the first thing that comes to our mind, resources, what kind of resources are provided in a Kubernetes environment? These resources could be CPU, memory, storage, or some special devices like GPUs. Since we are talking about these resources, another question that came to our mind, like is it possible when we are thinking about capacity planning, can we just think these resources individually? So basically like in the context of Kubernetes environment, can we think capacity in terms of individual resources? Maybe not, and let's see why that is not the case. Because I'm sure most of you are familiar with Kubernetes, pod is the smallest schedule unit. That means when you specify your application requirement, resource requirement, your pod has all sort of resource requirement, and that means those resource requirements are going to be considered collectively. When the pod is scheduled, you don't consider those resource requirements individually. So that is the main reason, we need to have some sort of design where we need to consider all those resources collectively. So as I said, we wanted to have some tool where we can input your pod spec that specifically mimics your application, and we can run that tool against your running Kubernetes cluster, and then it provides number of pods. Because if we know number of pods, like how many number of pods we can schedule for your application, it gives some idea about capacity, and if the number of pods maybe is less than, below some number, you may want to add some more nodes, and you can take some access based on that. So this is I think why we wanted to have this utility. Let's talk about its design. Initially, we considered two design approaches, like one is resource-based approach, or you might say resource-centric approach, and another is scheduler-based approach. First, I'll talk about resource-based approach, and then I'll show you what are the disadvantages and why we didn't go with that approach, although that might seem a little simpler. And then, and why we went with a scheduler-based approach. So just consider, let's say you have your input pod that mimics your application, and you specify some resource requirement, let's say R. And let's say you have your cluster where you have N nodes, and each node is already running some pods. So basically, it shows the remaining capacity on node one. For example, it's very simple to compute, like on any node, you can just find like what is the total capacity, then you find how many number of pods are running, then you just sum for all number of pods, and then you find computer, what is the total capacity that is being used, and you just subtract, you get the remaining capacity. And you do that for all the nodes. So that means now you are getting the remaining capacity for that particular resource for your whole cluster. Now, it's very simple. You can just divide that, your input requirement, and it will give number of instances. But as I said, when we are considering pods in the context of Kubernetes, we need to consider all the resources. That means this process, this computation, now you have to do for all the resources. Once you do that for all the resources, then you get the minimum among all the numbers you have got from all the resources. That means that many maximum number of pods can be scheduled. So, I mean, it's like we are just considering raw resource-based approach, and we are getting how many number of pods can be scheduled. But let's see what are the disadvantages. And maybe to put it in other words, is it enough to just consider resource only in the context of Kubernetes? Let's talk about that. What are the other factors in a Kubernetes cluster that impact scheduling? For example, let's talk about fragmentation. Let's just, we might take an example of CPU. Let's say your input port request is, it's asking for 500 millicore. And let's say you have several nodes, and it shows the remaining CPU capacity on each node. If you sum them, it might seem several pods of that request can be scheduled. But you know the way scheduling works because no individual node can satisfy that requirement. We cannot schedule that node. So, this is one disadvantage of that approach. Then, if you are aware of how actual scheduling works in Kubernetes, when you're using Kube CTL, you create your pod, it goes through several stages. One of the stages is admission plugins. The way you have configured your cluster, the way you have set up your admission plugin policies, what you input, and what is the output? The output port might be different because there are two kinds of admission plugins. One kind of admission plugin says that modify your input pod based on the way you have configured your cluster. And then, you have another set of admission plugins that might do something else, but don't modify your input pod. So, if you just consider your input pod without getting through them all the admission plugins, the way your cluster is doing, you might not get the actual or maybe more accurate answer. It might be in fact wrong. So, that is another reason. What are the other reasons? What are the other factors that impact scheduling? So, if you are aware of several features we have in Kubernetes, the way we have multi-tenancy, the way we have segmentation in your cluster. So, I'm sure you must have heard about node selectors and node affinities, because when you have node selectors, that means you have sort of segmented your cluster in several parts. And let's say if you have your application, you are specifying some node selector. That means you cannot consider the whole cluster. You just need to consider only those nodes with that node selector. So, that is one thing. Then we have node tents and pod tolerations. Tents are also like you are segmenting your cluster, because only the pods, those are tolerating those tents can be scheduled. And if your application has that specific requirement for those nodes that are tentator, that means you only need to consider those nodes. You cannot consider the whole, all the nodes in your cluster. Something that we were doing with resource-based approach, because we just had the idea or view of resources, we did not have the view of all the other factors that are impacting scheduling. Another thing I want to mention, like although it might seem in your cluster, all the nodes are homogeneous, but I think they are not. Even though you might have similar resources on each resources, on each nodes, but still like you might have different node selectors, tents and other things, that makes your nodes heterogeneous. And then you have pod affinities or anti affinities, like depending upon your applications, your application might have affinity to some other application. And that way it also impacts scheduling. And again, you just cannot consider a raw resource-based approach here. And another, it's a very simple thing, like max number of pods on a node. I'm sure you must have seen how many max number of pods you can schedule on a node. You might have all the resources available on your resource, but if you have exceeded this limitation, then you cannot schedule more nodes. Then definitely like pods storage requirement, the way you have configured your clusters, and the applications you are running, maybe all nodes cannot satisfy that your application storage requirements. So for that particular application, again, you cannot consider all the nodes. You only need to consider those nodes that are satisfying its storage requirements. So all these things impact scheduling. And when you are doing sort of like capacity planning for your application, you really need to consider all these factors. I think another question that might come to our mind, maybe is it possible, like all these factors we are considering, we might just implement as part of resource-based approach. Yeah, that's possible, but if we had done that, what we would have got, the default scheduler, the native scheduler we already have in Kubernetes. So why to reinvent the wheel when we already have that? So that's the main idea, like why we wanted to go with a scheduler-based approach. And now I'll talk about like how exactly it works, like and what's its architecture. First, it creates an instance of the default scheduler. Basically, it uses the same code. So it's like you are creating the same scheduler, but how is it different from the default scheduler that we have in Kubernetes? It maintains its local resources store. Why does that? So that it does not impact the actual cluster. When it runs its analysis, it only runs it in its local resources store. Initially, it fetches the cluster state from your running cluster. And as I said, it runs its analysis based on that. Again, it does not, after that, it does not impact your real cluster. Then what it does, whatever input pod you have provided, it schedules them one by one, the way default scheduler does. And it keeps doing that until it gets some error that more ports cannot be scheduled. Another thing what it does, like it does not bind those ports to nodes because what it's doing, it's doing analysis to get the number and the maximum number of ports that can be scheduled. So it does not bind ports to nodes because if you bind ports to nodes, that means you are really impacting your cluster. So it does not do that. And as I said, like it stops when no more ports can be scheduled. So I mean, this is the main core of this cluster Capsule tool. As I said, we really wanted to have a very simple tool that can answer what questions we wanted to answer. So what are the features we have? I would say it's kind of like proactive approach where you have your cluster and you have this utility and you run against it. I mean, it's not like you have to run this utility manually. I'll show you how you can automate that. Either you can have traditional scripting or anything. But it's a kind of proactive approach where that you are regularly running this utility against your cluster and for your application, it's showing how much capacity is left and then based on that, you can take some actions. And it can be run via command line or as a pod. I'll show you what are the advantages if you run that as a pod. And if you have been running Kubernetes cluster, sometimes you might modify your default schedule policy. Maybe you want to do something more customization of your default scheduler. Because as I said, it is also an instance of default scheduler, so you can also modify or provide your custom scheduler policy that you are using in your actual cluster so that it can exactly mimic the actual cluster, your actual cluster. And as I said, by default, it outputs how many maximum number of pods can be scheduled. But if you are not interested in that, maybe you just want to know, is it possible to schedule just 100 pods of my application or maybe 500 pods of my application? You can specify that max limiter, then it will stop as soon as it finds that that many number of pods can be scheduled. It says it outputs that number. And if it's not possible, then it will output some smaller number. And pods resource requirement from limiters, that means by default, you must specify some sort of resource requirement, like let's say either CPU or memory, because if you don't specify that and you are just trying to find number of pods for best-ever pods, it's not possible to do any analysis because if you don't have any raw resource requirements, I mean, there is no way you can run any analysis. So basically, you must specify that, but if you don't specify that, another thing what you can do in your cluster, you can create another name space and then you can specify default limit ranges. If you're familiar with that, that's one way of defining default resource requirements. And then it will fetch it from there, if that application is also going to run the same name space or project. And another good thing, what it does, it shows not only the total number of nodes, it also shows how many pods on each node can be distributed. Let's say it shows 100 pods can be scheduled, but it also shows maybe 10 can be scheduled on this node, five can be scheduled on this node. So it also provides a distribution, so you might see what are the nodes, where we are able to schedule more pods and what are those nodes where we are scheduled less number of pods. So it might give all some more idea about your cluster. Also currently, it only considers uniform workloads. Basically, when I say like uniform workload, that means at one timer, you can only specify one pod, one input pod. Also, if you have multiple applications, definitely you can run that multiple times. And I'll show you what complexity is associated if we consider variable workloads, but definitely it's not that that cannot be solved, but I'll discuss about that soon. So let's talk about how we can really use this tool. So as I said, it can be run as a command line tool, and you just have to provide two parameters, a cube config to authenticate with your cluster, and input pod spec. So I think as I said, one of the goals we had in the beginning, we wanted to have very simple utility. And I think if you can run your utility by just providing these two parameters, I mean, at least I think this is simple enough. And as I said, like output is the number of pod that can be scheduled. And again, the same thing like resource request must be specified either in your pod spec or as part of limit ranges in your name space. And it has that verbose flag. Without verbose flag, it just outputs the number. And the reason why we did that, because so that you might have some automated scripting or some tooling so that you can just see the number and then your scripting can take some accents. But if you really want to see some more, like how on each node, how many pods can be scheduled, you can use this verbose flag. So what are the best practices? If yesterday, if you might have, maybe you saw Keynote by Clayton Coleman and I think he was talking about like OpenShift online clusters, and we have been deploying these tools on OpenShift, our online cluster to get more experience. And based on that, like we have found, like how, what are the best practices we can have to run these kinds of tools. So first of all, it would be better if you run this tool as a pod, either as a cube job or cron job, because then you get all the advantages you get by running a job or cron job. But for that, if you really want to do, you need to set up this environment variable, cc underscore in cluster to be true, so that the code knows that it's going to run inside a pod. It's called something called in cluster config in cube. Although like we might have some other way of also doing that, but I think this was, I think easiest way to achieve that. And then when you're running, I mean, I think it's true for any application or anything that you're running in a pod. You don't want to give all the permissions in your clusters, right? You always minimize the permissions that are required for your application. And that is true for cluster capacity tool also. So basically, the better thing would be, you just create another service account for that, so that you can limit permissions for that service account, so that somehow or for whatever way it gets hacked, it does not, it's the harm is limited. So what are those minimum required RBEC permissions? So these are the minimum required RBEC permissions you just need to run this tool. I mean, it's very similar to what default scheduler needs. The only thing it does not need like bind permission, because as I said, it does not bind ports to node. So this is the only like minimum permissions you need to run this tool. And then for easier usability, it's better like if you input your pod spec as config map, because I think if you have been running clusters and you have been running your applications, config maps are a really easier way to update your configuration. And because once you have specified that as config map, you can always mount that as volume inside your cluster capacity pod. So these are like the best practices we have been using to run this tool on our real products and clusters. This is the current status. It, like we have been testing with Qube and OpenShift clusters. It can be run against any Qube cluster or any Qube certified cluster, because it works with a real, I mean, stable Qube APIs. Currently, it's rebased to Qube 1.7, but soon we are planning to rebase to 1.8, and definitely 1.9. And it has been, this tool has been in OpenShift since 3.6 as a tech preview, but we just wanted to have more experience. Although, like I've been talking about OpenShift also, but this is a completely upstream project. It's part of Kubernetes incubator repo, so it's completely upstream. Everything is upstream in Qube. So let's talk about like what we can do in the future. Like I was talking about admission plugins, but right now we don't have that implemented and integrated in the current code. We definitely want to do that in future versions. Right now, you can only input your pod spec, but when you run your application, definitely you don't run directly as a pod. You always use some higher-level constructer, like replica sets, deployment, jobs, or stable sets. So definitely we want to have some mechanism where instead of pod, you can also import these things. And again, like as I said, right now we are just considering uniform workload, but we really want to consider variable workloads where like you can input sequence of pods. And as I was talking about like what is the complexity involved with that, let's say like you have two applications and you say, okay, just you input your two applications and you want to know about some more information about capacity. But the problem with that is when you schedule your two applications and you want to know what maximum number of pods we can schedule with those two applications, it is possible, I mean you can schedule like maybe 10, let's say you can schedule 30 pods for both applications, but maybe you can schedule five pods of first application, 25 of second application, or maybe you can schedule maybe 20 pods of first application or 10 pods of second application. So if you just specify your application, it might not give you definitive answer because there is no definitive answer for that. So better would be if we ask, like can we schedule like 10, maybe X number of pods of first application or Y number of pods of second application. But I mean like what I'm talking about, these are like my understanding, but if you have definitely like I would really like to hear like what are your use cases, how we can extend this utility. And if you have some ideas about this adjusting things we want to do in the future, I would really like to hear. Another thing like we wanted to consider multiple schedulers because now Kubernetes supports multiple schedulers. Also like in our scheduling meeting like we have been discussing whether like how to support this multiple schedulers. So that is another thing like we want to consider. So now time for demo. I'm going to demo it on our production cluster. And this is one of our online clusters and it has around 150, I think more than 150 nodes. And some of the things I have already done, like for example, I have created service account, I have created cluster, I mean role that I was talking about and I have also created binding between those two. So those things are already done. Let's see like about config map that we have. So these are the two config maps we have. I think it's visible like this is your input port definition. It's exactly like how you specify your port spec. So here you have a specified request of like 50 M CPU and memory of 128 and you have same limits. But it doesn't have to be the main thing it looks for a request. Because the scheduling and Kubernetes clusters what's based on request, not based on limits. Now I'm going to create a job. So this is the specification of job I'm going to use. If you see like this is like this environment variable that I'm using and this cluster config, capacity config map that I have created for input port spec and it's using its own service account. So what I'll do now I'll run this job and see let's see what happens. Again, it's real products on production cluster. I'm running, I take few seconds because there are around 150 nodes. I'm sorry. Around more than 150 nodes. So if you see, if you see this number it says like we can schedule 3441 ports of that pod. And it also shows some distribution like on each node how many ports can be scheduled. I'm sorry. Oh, okay, okay. So it shows distribution on each node. And if you see the reason like why it is stopped after that like it got insufficient parts sufficient CPU on some nodes and insufficient memory. Oh, so it seems like right now it has like 136 nodes. 136 nodes. I don't know like one week before we have almost more than 180 nodes somehow they've been reducing. I don't know why. In fact, I want to show another demo. Right now the input port spec does not have node selectors. It's considering all the nodes in the cluster. Now I want to show you if we have node selector as part of your pod input spec, what difference it would make. So let's first delete this and create another job. I'm going to delete this job. Well, it's deleted. Yes. Now let's create another job. Okay, first I'll show you config map for that. So this is the config map. It's exactly similar to the previous one. The only thing it has, it has also a specified node selector like type equal to compute. Like in our clusters we have two types of nodes, infra nodes and compute nodes. Infra nodes like where we only run critical system services and compute node where all the user workloads are running. So I'll show you like previously in the previous demo it was considering all infra nodes and compute nodes. But definitely like you don't want to do that because when you're deploying your application you are only deploying only this compute nodes. So now let's try that. And let's see what difference it makes. Generally, infra nodes are with more capacity. I just need to make one change to change this to a bright config map. And let's create this again. Again, the whole idea behind this too have very simple utility that can do its job. So you see now we can only schedule 2247 ports because now we have restricted how many number of nodes where we can schedule our application. And again it shows the distribution on each node. Like you see some of the nodes like you can schedule almost 81 but if you go down on some of the nodes only you can schedule like one, two or maybe under 10. So yeah, so that's it about my demo and other thing I wanted to, oops, sorry. Yeah, so these are the resources. Like as I said, it's a community effort. There have been several people involved in this design and development and it's a cube incubator repo. And this project falls under CID scheduling. You are always welcome to contribute and if you have more ideas, if you have your huge cases you are always welcome to communicate either via GIT or contact me directly or you might attend meetings on CID scheduling group. You can contact us on Slack channeler and it also have that images where like if you want to test, you can test. So I think that's all I have. Yeah, thank you. So if you have any questions. Actually cluster resource quota is part of admission plugins, right? As I said, we don't consider admission plugins here so it might not honor that but definitely we want to have like consider all the admission plugins. Yeah, yeah, yeah, yeah. But I think it's configured as part of admission plugins. If it's working as by default then like when we are scheduling because scheduler doesn't consider cluster resource quota, right? It's through admission plugins. But as I said, we are not considering admission plugins here. Definitely we want to have that. No, no, it's not like auto scaling for that. Like we have another project cluster auto scalar. As I said, like it's more kind of proactive approach. Cluster auto scalar is kind of like reactive approach where like it keeps and when there is problem then it adds more nodes. But we want to integrate this with cluster auto scalar also but right now we don't have that. Yeah, I mean, yeah, it's a very good question. Like we don't have that yet but as I said, like you're welcome to like join us, contribute, like definitely we work on it, yeah. I'm sorry, what you are not using? Okay, okay. Yeah, what we can do, we can discuss offline. I definitely can talk about that. I think we are running out of time. If you have any questions like we can discuss offline. Thank you, everyone.