 Hi everyone, thank you very much for joining us today. Especially in person, I know it's been some weird, like the last years have been a bit weird. So at least from my point, I'm not sure if you can relate. This is like, I haven't seen this many people in like the last two or three years, so it's a bit different. So thank you very much for being here. Thank you very much for the people that are virtually attending. My name is Madalina. I'll be one of your co-host today. I am a software engineer in Intel's resource management space in Ireland. I have web service and distributed systems background, but I recently moved into the cloud native and Kubernetes space. I won't be giving this presentation alone. I'm here with my teammate, Danicio, and he's going to say a couple of words about himself. Hello. My name is Danicio Togacius. I am a cloud native engineer at Intel. My background is science. A couple of years ago, I got opportunities to fast track and get a degree on the software development. And then I ended up on this huge incredible company that is Intel. So we both are based in Ireland. We work on the resource management team. And in this team, we have so many different projects, but each one then with the challenges and so on. But we are always looking on different ways to make an improvement, different innovations related. But with the mindset that was talked in this morning related to sustainability. So as I said, we have dual school projects. And yeah, I hope you enjoy it. So Madeleine and I is our first time at Cubicon. So it's really great to be here. And yeah, today, because you thank you to give these opportunities, we're going to talk about one of those projects that is related on the smart scheduling. And then we follow up with the presentation. So I stop to talk and hand back to Madeleine. As working your cluster, smarter scheduling decisions for your workload, the aim of it is to show you how you can leverage telemetry from your own cluster to make better scheduling decisions. So we're going to start off with why resource states and scheduling matter. How would you go around combining them and what are the benefits of doing so? We're going to touch on one of our projects, which is called Telemetry Aware Scheduling. It's an open source project. Denise spoke a bit about it. We're going to look at a high level intro, system design, and then the basic building blocks. We're going to see Telemetry Aware Scheduling or TAS. It's a bit of a mouthful. And a quick demo. And we're going to conclude the session with a bit of Q&A. So I'm guessing most of you here today run your workloads on clusters, whether they are on-premise or hosted by a cloud provider. Especially if you work with big clusters, you have a large number of nodes. And with that comes the problem of how do you deal with failures? Because the more hosts you actually have in a cluster, the harder it gets for you to pinpoint with accurate precision when a host or a node can become unhealthy. So when you design a system or even think of a system, you look at how will I deal with these failure scenarios. Some ideas that will come to mind is whenever you want to schedule a workload, you can avoid scheduling to an unhealthy node. Or once you know that a node becomes unhealthy, you migrate your workloads away. One level going above this that you can think about is I know my workload, so I know what I want to do. And then I know my cluster. I know my hardware configurations. What if I take the metrics of interest, like temperature or power or load, and I make a ranking of my nodes and I pick the best outcome? Especially if you're in Kubernetes, you don't want to also reinvent the wheel. So you're going to think of, how can I make my system work out of the box tools like the Kubernetes scheduler or the GP-aware scheduler? The next topic that I want to talk about is at least in how we thought about handling telemetry and smarter scheduling decisions. And this is where I'm going to talk a bit about one of our projects. It's called telemetry-aware scheduler. As I said before, it's an open source project. And as the name says, and I guess you already thought about it, it uses telemetry to help make scheduling and scheduling decisions. It's an extender of the Kubernetes scheduler. And that comes with a couple of perks. The first thing you think of is, because it's an extender, I have the capability of filtering and scoring nodes. And then the spice that I think comes a bit above is I can also utilize node affinity rules via fixed and also custom labels. And we're going to look at this in a bit. When you're talking about scheduling, at least in Kubernetes and with a default scheduler, one way you can actually touch upon a recommended pods, like a pods recommended schedule is via policies. That is pretty much how you would tell the scheduler how to react, what actions to take, and when. In our specific case, we work with something called telemetry-aware scheduling policies. And we're going to look at examples. And this policy is structurally based on rules, which in terms are based on metrics that actually come from your own cluster. If you have more complex scenarios, we also have a capability in telemetry or scheduler, or TAS, that we support multi-metric rules. So you can build rules that contain multiple metrics. And you can link them together with operators, such as any of or all of. So to set telemetry-aware scheduling up, you don't need much. The first component that you need is a metrics pipeline, because we're talking about telemetry. And I'm going to talk about, the first part I'm going to talk about is in the top left corner. So the metrics pipeline, you need that, because you need to expose, collect, store, and then make metrics available to the Kubernetes Custom Metrics API. In our specific case, we're using the Prometheus Node Exporter to make telemetry available to us. We use Prometheus for collection and storage. And we use the adapter to make our metrics ready, make it them ready for the Custom Metrics API. The second part that I want to talk about is the actual telemetry-aware scheduler. It works together with the default scheduler. So every time the default scheduler actually wants to make a decision, it reaches out to TAS. If the workload has a policy that TAS recognizes, it just returns like a suggested outcome of that placement to the default scheduler. The last piece of the puzzle is the actual telemetry-aware policy. And as I said before, this is actually how you control the actions. It's a custom resource known by the telemetry-aware scheduler. And the way you actually work with it is in a high-level workload type like a deployment, you would just add a label saying which policy you want to link it to. But don't be afraid. We're going to show you examples. As I said before, structurally, this policy has a couple of rules. And TAS itself supports four types. And I'm going to ask Anisa to come along and show you what are these four types and how they behave and how do they actually look like. Thank you, Madalina. So thank you for the nice intro on that. So I'm going to talk about a little bit more in the structure of another piece of the bidding block, that is the TAS policy. So the TAS policy composes all four strategies. The two first strategies are directly related to the native scheduler or default scheduler, which is telemetry-aware scheduler working like a standard. The two last strategies are more communicating directly with the default scheduler via the non-affiliated rules. So we can go. Yeah, thank you. So we've got to take a look on the first strategy that we don't schedule. Each one of those strategies are composed for the action and the metric name and the target value and the operator. So we say that the strategy is broken in case of done schedule when the metric rules are broken. So what that means is that in this case, we have the health metric, as an example. So if you look closely, you have that metric and if that metric is at that current state, have that value that is equal to one in this case. So that node particularly going to be put out from the scheduling process. So that's the equivalent of the filtering process on the default scheduler. And now the second strategy is the celerometer strategy. So it's the one that actually gives you a kind of priority or those nodes that have been available, if I may say, to the cluster. So once that you only have those suitable nodes, which one that you think that will be more adequate so they're going to get a kind of priority in front of the others. So that's what you can define with this strategy using another metric. In this case, there is a temperature that is a good opportunity to save if you consider temperature as a power related. So in this example that we have, that strategy is just saying that, OK, nodes with that metric temperature, the lowest value going to have the high priority. So now they move on from there. Now we are looking at the schedule strategy. The schedule strategy is from the celerometer way of scheduling use the node affinity rules. So it doesn't communicate directly with the full schedule. In this case, you have any other parameter files that have no way to link it using the node affinity rules. So if you have a part that eventually break those rules, the node that have the metric being broken going to be labeled. So in this case, it's a hard code label. It's just going to receive a label, say violating. So once that label is present on the nodes, we can use other components, external components. In this case, we use a Kubernetes schedule that identify that the node, or there is a one node that broke on the node affinity rules. And proceeding that way, you can go to inflict the part in this schedule, and so on. The last strategy is one recent feature added to our application that is a labeling. Basically, it's work like a disk schedule strategy. But the main difference here is that you can customize. The user can customize the type of the label that you want a node receive if we break that rules. So you can think like that, that you transfer information that's some metric to the node. So why is it important, this kind of flexibility? That would come from one of the collaboration across our company. We developed together with a team in Finland where they're interested in use the case related to GPU. Which they have also GPU-OS scheduling, that another open source product that is we work together with telemetry OS schedule. So let's imagine, sorry, there's that kind of fault here. Now let's imagine we have a cluster. And this cluster have a node. And this node have attached, suppose, two GPUs. Now imagine that we were able to write an telemetry OS schedule policy that tell you that if you are metric related like, might say, power consumption, break the rules. So if you have one GPU that break the rules and you have a workload that tell you that one part per request, one GPU, you have one GPU that have a program matrix, consume a lot of power. But the other GPU is working fine. If you use this schedule, that node that have those two GPUs going to mark as a violator, and all the pods are going to be scattered and evicted. So that's not a suitable situation. And then we create this new feature that are customized in the label. And that way, we can transfer the information, just tell and label the node that have those two GPUs, which one GPU have a high power consumption. They say, we can write a kind of label like say, disable hard GPU one or something like that. And then once that is, you have that information on that node, GPU always get a kickoff and do the business there and make it ready for the Kubernetes to just that pod that use that GPU that have a high power consumption. That's, yeah, I think that's enough where we can go for the demo. But we have to make a kind of deployment. And those are the three files that we use, normally use when we work with the Kubernetes discoverers. So you can see on the board, all the sections that are related, so you have the primer file. In this case, we have an application, and it's simply in GenX. And we're going to deploy five parts of that, five parts. And you have the link to the telemetaware schedule. So you identify that each one of those parts are going to be associated to a policy related to a JS. And the other section is where they know the things who's coming out. Because we have an old JS policy, sorry, the labor and strategy. And the final one, the simple policy for the Kubernetes discoverer that just tell you to invict the parts that are violating in order to finish the rules. So I apologize if some demo effect happened here, even if it's a recorded, but OK. Let's start here. This example here, we have a cluster that composes the three workers under one control band, what the normal is saying. So the only the worker nodes receive parts. So we have the right side, you have the metrics associated with each one of the workers. And on the further right, you have the labels associated for each one of those nodes. Now, on the top panel is where you're going to observe the audio resources being called during the demo. So like the pods that have been deployed and all the operations that happen on that. The middle is where you have the information that shows the event, which you want those steps. And the bottom, eventually, you're going to see parts of the logs. So we already have the prior data metawise scheduling and the policy. The policy, again, has been showed on the left side. So it composes, for this demo, the three strategies. Again, don't schedule, it's another metric on the labeling. So each one, again, have your own metric rules. So again, don't schedule, you're going to tell the chairs that two future nodes have violated their rules. So we can see that work node three that have that health metric, that is equal to one. That's the metric that that current state is showing. So you look at on the chairs policy, you have exactly that value that should not be there. In that case, that node will be put out on the schedule process. So therefore, work at three is out, fit it out. So the next one is, excuse me, the schedule metric strategy. So as I mentioned before, it's based on the priority. And the metric we use here is simply temperatures. And they're going to select it between those nodes, the works that have lowest temperature. They're going to receive the majority of the pods. In this case, we only have five, so they're all then going to land. And the nodes that have the highest priority, the lowest temperature, then the work label two. So they deployed all of them. And we can see that the content created and so on. You have all the five being up and running on the work too. And here is an example of the logs. So the logs on the chairs show the steps. Each step did that happen. I think it's a clear clear that in that fine were the policy, the strategy policy, now the filtering part. And then go to select the what nodes are available that doesn't break the don't schedule rule. And then go to the prioritization part that is basically going to relate on the schedule on metric and strategy and give the ranking and select based on this rank and send back to the full schedule. OK, walk through to have the hyper-arts. And there we go. We have now five pods running on the work. So there is reason to think that, OK, so we have one metric that's related on the memory use. So what do we expect that the value of metric for them, the nodes that have their map should increase in case on the work at you. So because we have more pods up than that. So we do a little bit of the pushing that in order just to go above the threshold relate to the labeling strategy. So we just saw that the labeling happening there. Sorry. So the labeling happening there. And then we take a look on the logs. And it's just saying that, OK, the metric value on the work at you just crossed over the threshold. And for the labeling strategy, they break that there. And then yes, you're going to write off the label that we've been describing on the policy to the node that's broken that rules. So just continue that. And that's what is actually just describing on the log. OK, so that's what is at the moment. So we have what we have now two nodes that are not able to receive. One because the breaker, they don't schedule us. And the other one, that's because they have finished the rules and are able to receive the order. So what we expect is just one left. And if we increase the number of the parts to be deploying on the cluster, it's just not upstate. And they end up, as you just showed that, on the worker or on the first one. And yeah, so again, so just show the cluster how this one. And you have the label on the worker. And we get what is back then that. We still have parts on the worker too. So they most likely are not happening. And that's a little depending on the situation and depending on your policy. But we would like to mobilize that. Migrate from the more suitable nodes. So GS, as I mentioned, we mentioned before that, they are not capable of doing victory parts. In this case, we have to call the help and help them on the Kubernetes schedule. And once that we have the configuration deployed, we just deploy the application. And you see immediately that this Kubernetes schedule going to look at and see the parts associated on the nodes that break the definition rules. And then immediately this evicted that. So the Kubernetes schedule work on the cycle. So once that you deploy, you immediately do those checks. And then, yeah, so going to put the other parts on the queue and come on the cycle to the full schedule and find the better node available that is working. Now, we don't have any more parts to run on the work queue. Again, so no parts, so memory should go down. The memory users, so the metrics associated should at least go back on the same levels. And when that happens, we don't have a delivering strategy rules broken anymore. So we expect the delivery to be removed. And that's what happened. Once that happened, work queue or work node queue is, again, back on alive, so ready to receive the parts again. But the work now have a 10-parts of the same story. Memory use is going to increase. So I expected break the rules because it goes above the threshold, and the delivery starts to write the labels on that nodes. And once the label is there, remember the Kubernetes schedule now is up and running on the cluster. You're going to look on the node list, and you see there is one node that's breaking the finishing rules. And the one is there, invoking all the parts on that node. And we schedule to the worker that are available in the cross at the moment. So yeah, that's it. So before we conclude this session, just wanted to give out a bit more details. So as I said, FilmTrayware Scheduler is an open source project. And above, you can find the link to our public repo. If you're interested, you can look at the repo, push PRs or issues. If you see anything that you would want to work on or fix. If you're also if you're interested about features, you can reach out to us via Kubernetes. The Kubernetes Slack room or emails, you have our emails in the bottom right. You can also find us in our booth. So feel free to stop by and talk to us. If you're interested about learning a bit more like architectural wise, a bit more details, you have links to our white papers. So feel free to check them out if you want more details architecturally. This is it in terms of presentation. Thank you very much for your attention. And I'm going to open it up for questions if you have any. Thank you. Hello. I'm here. Over here. Sorry. Lights out. No problem. What you show us, the scheduler has now a dependency with the metrics pipeline. If the metrics pipeline fails, will scheduling be blocked or is it able to still ignore the telemetry over scheduling? I mean, we now have a dependency of the metrics pipeline. Is it a hard dependency? Is it optional? I don't know. I would say it might not be able to react as fast. So you might need to wait a bit until the metrics comes back online. But I think I need to test a bit more this scenario. I think this is my opinion. So I think you need to wait until it comes back. Oh, OK. And do you want to add more details to this? I am. Sorry. Can you come back? Can you come to us once we finish? And we can talk about this more because I can't hear you so. I need to know where to look or us to look. Yeah, I was going to ask what happens when the telemetry scheduling you have configured for deployment or whatever doesn't find any switchable nodes but you still have available nodes? What happens in this scenario? So I think it will just look continuously for nodes. So if something happens and maybe your node comes back online, it might just continuously look for nodes. If it wasn't, maybe you might need to go back and alter your policies. At least that's what I think it would do. So with TAS, you would get a suggestion of a placement. So maybe when that's not running, you would just default to whatever the default schedule wants to do. And your policies might not actually, the TAS policies might not work. Hi. So the metrics pipeline, it generates new metrics or you also work with default metrics and the telemetry policy can also be applied on the default metrics from the nodes or any custom metrics that are generated like via a script or manually. Can you do that? So basically, you can use any metrics. Because it depends on your exporter. So in this example, we use a node exporter for the primitives. You can use also a cubic metrics API as well. So it's up to you to select what type of the metrics you are interested in. So collectd also can be used as well. OK. Thank you. So I'm going to read out a question from the online users. Does TAS work well with topology spread constraints to ensure an even distribution of application pods? Even distribution on the deployment, I assume. I guess they're asking, yeah, deployment or a stateful set? Yeah, I can't think. Yeah, we don't have to think about that. Is that a good question? Yeah, maybe follow up in Slack afterwards. Yeah, if whoever can ping us on Slack with the question, yeah, sure, we'll try our best answer. OK, thank you. We've got a couple more minutes in the room. I think he was first. Hello. I have a small question. So what will happen after our port is scheduled? So during scheduling, the port is placed one of the nodes. And metric changes. Will the port be rescheduled during execution or it's ignored during execution? Your question is to know what happened on the port after the schedule. Yes, true. So actually, default schedule usually check these rules when it decides where to schedule port. And it ignores, so it changes further. So in this case, metric is always changing. And during execution, it changes what will happen. OK. I think there is a time related on that. So it depends on how the process it takes. And eventually, yeah, if you get that time where the parts that have been discarded and the full schedule getting on that, it might have that kind of conflict. Yeah. It's something that we haven't thought because we basically use the discarded just from this demo in terms of the show, the de facto means of the application. So but, yeah, thank you. We will take a look at that as well. OK, I think we have time for one more. Hello. So my question is, is this scheduler going to respect the PDB constraints that we have for the port for the deployment? Can you repeat that? Can you repeat the question? Yeah, is this scheduler going to respect the restrictions set by a PDB, by a public disruption budget? How is that going? How is this going to work together with a PDB set for a deployment or a port? I think we need to look a bit more into this because I don't think I understand what or at least I'm not able to understand what you're asking. Sorry. Sorry. I mean, a public disruption budget, it's a configuration or a policy that you define in Kubernetes in order to restrict how Kubernetes evicts ports from a node or the minimum number of ports that you have to run for a certain application. So I'm asking that because if the scheduler is going to move ports around, it could actually violate the restrictions set by the PDB or the constraints. Does that make sense? Not quite. I think I need to follow up. Sorry. At least I have. Thank you. OK, well, thank you very much. And we're at time. Thank you.