 Okay. Welcome to the talk on cutting climate costs with Kubernetes. Hi, I'm Shiva Rezaei. I'm Iranian. I migrated to the U.S. 11 years ago and I just graduated with the CS and Psychology degree from Harvard School of Continuing Education. And I worked in many different industries, but this is my first tech job as a developer relation for CDERA Labs. So you've been warned. Just adjust your expectations. Hi, I'm Steve Francis, the CEO of CDERA Labs. I graduated with a CS degree and a law degree a lot longer than Shiva did in Australia. So I've spent most of my career as an SRE, working for SaaS companies. And then about 15 years ago, I started my own SaaS company, LogicMonitor.com, a data center monitoring SaaS company, used by PayPal, Netflix, eTrade, at least it was at the time when I was there. And while at LogicMonitor, I worked with Andrew Reinhard, who was doing our migration onto Kubernetes eight years ago, something like that. And then Andrew went on to start an open source project, Talos Linux, which makes deploying and managing Kubernetes easier for the work he was doing. And then he started a company around that about four years ago, and I've been CEO of that for the last three years. So now I can confidently say how little I know about Kubernetes, but little more every day. Okay, so today we're going to talk a little bit about our company and our products and briefly. And then go into how we can cut climate costs with Kubernetes. We're going to do a canned demo and then take questions after that. So SideroLab, as Steve just mentioned, was founded around Talos Linux, which is a Linux distribution written from scratch just for Kubernetes. And it's like Linux and Kubernetes had a baby and send it to Avenger Training. And then it came out very fabulous, but only does Kubernetes. So there's no SSH, no bash, no package manager. And it is API-driven and configured declaratively just like Kubernetes. It's very minimal and little surface for tech. So then the other products that we have, we have SideroMetal, which is cluster API provider for bare metal creations and management of clusters using CAPI. The other, this mic is so loud, the other product that we have is Sidero Omni, which is a SaaS simplified cluster creation. It's a SaaS that simplifies cluster creations and management on bare metal edge devices or cloud devices. And it, how it does it is by booting off ISOs and AMIs and, or other image. And then it goes, uses API or UI to create the clusters in few clicks. If you have not used it, you should try it. It's few clicks, it's much easier. So I know from my days when I worked as an SRE that the easiest way to solve scaling problems is to have a provision. Especially if you're talking about bare metal, like back in my day, it was like, well, we're going to need more servers. So you order them from Dell and they'll show up in two months. And then it takes another two months to get power from your data center and networks provision. So to avoid those things, you just deploy more than you needed at any given point in time. I know when I ran data centers for an ad tech company, we could get by, like we had peak demand during, you know, US East Coast morning time. And so during that time, all, you know, several hundred servers were all almost all busy running a pretty high CPU load. But outside of those times, we could have got by with probably a quarter, maybe even less of the servers at any one point in time. But we didn't, we left all hundreds of servers running on all the time, all consuming energy, all generating heat that the heat required more cooling, which consumed more electricity. So not very efficient studies show US data centers contribute between one to two percent of all US greenhouse gases. So that's a pretty big opportunity to make a sizable target. So this talk, I think in, is in the program is about using cluster API to solve climate change. We changed that. We initially thought this was going to be a cappy talk, but because our largest customers like Nokia do in fact run sedero metal, which is a cappy provider for their metal. And so, you know, having Nokia be able to power down part of their servers and fleet, great idea. But we also have a whole bunch of customers that use Talos just by itself without a cappy to run relatively static clusters. And then we have, as she mentioned, we have Omni, our SaaS for Kubernetes and Edge devices. We have customers running hundreds of clusters on that. So we wanted something that could be more universally applicable than just cappy. So we can find a way to shut down extra servers. We want to do it for everyone. So what we came up with is not cappy, not cappy specific. It's not even Talos specific. So it can be used by everyone, whether they're running sedero metal, Omni, Talos, or even if they're foolish, one of these other operating systems that is not Talos. If they're not doing that, you're going to ask, what are they doing? So there are various systems in Kubernetes that make things dynamically adjust to the workload. However, none of those met our design goal for this specific, which is being emissions aware and broadly applicable. So those options are the horizontal and vertical pod autoscalers and cluster autoscaler, which quick review of those. So horizontal pod autoscaler will add or remove pod replicas in response to the workload. And so if there's not enough, it will add. If there's too many across all pods, then it will remove. The vertical pod autoscaler will change the CPU and memory request and limits of a pod. So it doesn't schedule more or less, it just changes that specific pod. And neither of these two will affect the infrastructure to achieve what we want to do, which cluster autoscaler does kind of. So cluster autoscaler automatically adds or removes nodes in a cluster based on all pods request resources. It looks for unschedulable pods and will scale up clusters if needed. It also tries to consolidate pods that are currently deployed on only a few nodes and if it can free up. And this is a large part of what we are doing here, but it's not a mission aware, so it doesn't check what's happening with what type of energy we're using. And for that, it doesn't consider priorities in a scheduling, so we have to consider that. And bare metal only works with CAPI managed clusters like sidero metals. We wanted something more broadly applicable, so that's what we did. Yeah, so we would have had to write a scheduler anyway to work with cluster autoscalers, so we decided to make it simpler and just write an emissions aware scheduler. Kubernetes allows you to run multiple schedulers, which is a good thing. So you can apply specific schedulers to specific deployments, and this is good because it means you can use the generic Kubernetes scheduler to run your critical workloads and your control plane and everything like that, and then your workloads that you are possibly less critical, you can apply a custom scheduler for. So the default scheduler, yeah, we just left alone. So at a high level, what this project does is it checks the emissions on the grid in the area where your data center is, and gets back an answer of zero to 100, where zero is the energies all renewable and clean and sunlight, and 100 is the worst. It's made by burning baby seals and owls, and tearing down some redwoods just for fun. So then it compares the emissions that are currently going on to the pod priority, and only runs pods if they are a higher priority than the emissions. So the scheduler has three components. It has the scheduling logic, pod manager, and a node manager. The scheduler looks at pod priorities and emissions. We will go through the actual process in the demo, but it will decide whether a pod worth running in the face of dirty energy. If your pod is time critical, you'll want to make it sure it runs even if it costs an owl or two, but given lower priority pods could start when energy is clean, then later the energy changes and what you're going to do. So we need the pod manager to evict those ones. So it will take care of that. And then to save energy, the node manager will go check, okay, what is happening? If the nodes are needed, they will power them. If they're needed, they're on, but if they're not and they're idle, it will turn them off. The way you create a scheduler that behaves differently than a default Kubernetes scheduler is to use plugins that can change the behavior at different extension points, sorting the queue, pre-filter, filter, and so on. So if you don't override the extension by using one of the plugins, it will just use the default which will do the upstream scheduler code running, and that's perfect, that's nice, that gets what we want. So we just affect the ones that we want and the rest will work perfectly. We want to affect the pre-filter plugins in this project, and our plugin filters whether a pod should be run by comparing the emission and emission level and the priority of the pod. Yeah, so just a note about the cluster topology that's in use in this demo. We're going to show off a Talos Linux feature, but not just because we think it's cool, but it is, because it's actually applicable in the energy-saving use case. So the way this scheduler works is it powers nodes on and off by connecting to the BMC, the Baseball Management Controller, using IPMI. So in order for that to work, wherever the scheduler is running, it needs network access to the BMC cards of the servers that are going to be powered on and off. So normally that's not a problem, because clusters tend to run in the same data center. What we often do at CDERA Labs is not that. Talos Linux has a function called kubespan, which transparently encrypts traffic within a cluster when it's needed to. And this is great because it means you can run your control planes and workers in multiple locations. So one thing we often do is, you know, if we're running big, beefy nodes on, like, equinex metal, they're large and powerful and can do a whole lot of CACD stuff, very powerfully, but they're way overkill for a pretty small cluster that doesn't do a lot of dynamic changes. So we run our control plane nodes in some way cheap, you can get small virtual machines in Amazon or Azure or in this particular one that's in Vulture, and then the workers somewhere else on bare metal. But this also gives rise to an issue because the kubespan takes care of all the traffic within the cluster, but the BMC nodes aren't within the cluster, so they're not part of the kubespan mesh, the wire guard overlay. So in this case, we actually had to run the scheduler on a worker node that's in the same data center so that it was behind the firewall so it can actually attach to the BMC nodes. So that's kind of why it's not just a simple deployment on the control plane which you would normally do. So we will now deploy our new scheduler and see what it does. Basically, these are the steps that you will need to go through to deploy this. It's all, you know, up there on GitHub and there's a readme and documentation and stuff, but these are the high-level steps and we'll go through all these things kind of one at a time. So you will need to create a what-time account to access the real-time energy emission in your area by an API. What-time is a non-profit and writes on their home page. You can enter your zip code and it will give you which we did for Chicago this morning and it will tell you how it is and Chicago is not that bad, not that great either. So the next step is creating priority classes because Kubernetes loves complication or abstracting depending on how you look at it. You can't set priority on a pod directly. You have to create a priority class and then reference that priority class in your pod creation manifest. So in this case, we're just creating two priority classes. One is high priority with a value of 100 and one with low priority class. The high priority with a value of 100 and one with low priority with a value of zero. So the high priority one is going to run basically no matter what the emissions are, ours be damned and the priority of zero is only going to run if the energy is completely renewable and coming from the sun and wind and nothing else. So we create the priority classes and then... Yes, and then we create the classes defined by the usual kubectl apply and that will go through. All right, so now we've defined our priority class, we've created our priority class. We now need to say, all right, this particular BMC is associated with this particular piece of hardware and the way we do that is we just annotate the nodes. So here we're just adding in three annotation endpoints for three annotation values for each node. So the BMC's IP address and the username are password to access that particular BMC for this particular node. So you do this for each particular node in your cluster that you want the controller to manage. We also need to edit the demand set to configure it with the what time credentials, which you'll see here. I don't know if you're blocking it, it's down there. And then we are going to do, I think, San Bernardino. Yes, I'm literally in front of it. And that is done. Then we'll go to the next step. Now we can apply our demand set and we get all the fun outputs showing that it is configured. Okay, so now we've deployed it. What happens? So we've got our scheduler. As we've said, the scheduler just checks the current emissions from what time for the area you've said your data center is in. If there are pods that are in the scheduling queue with a priority higher than the current emissions, they'll pass the pre-filter and continue on with the scheduling process. If there are pods that are less than the current emissions, they will not pass the pre-filter. So that's all the scheduler does. And then the pod manager basically checks any pods that are running now, if they are less than the current emissions, they'll get evicted. And then the node manager does one of two things. It turns on or it turns nodes off. It'll turn a node on if there are pods that have passed the scheduler and waiting to be run, but there's nowhere to run them. If that's the case, it'll say, all right, great, the scheduler has decided these pods are worth running. Given the current emissions level, I should turn on another machine and it'll do that. And then it'll also say, okay, if the pod manager has decided all the nodes on this machine are worthy of being evicted, I can shut down that machine. To apply the scheduler to deployment, I'll set the pod priority in the deployment YAML file. Okay, let's play it again. We're just adding a scheduler name and a priority class name. The scheduler name is the scheduler we defined before on the priority classes, in this case, high priority. Okay. Then when we deployed, a high priority will cause it to be scheduled, regardless of emissions. And if there is nowhere to run, it will cause the node manager to power on another node. And this is exactly what we just explained. We can see that one of the nodes at the bottom. I don't know if you can see, but we see. Right down here, it's switched from not ready, scheduling, disabled to ready, and in response to the need to run this pod. Okay, another one. So yet a low priority pod, which just got deployed, the workload was created, and then now we look at the pod status. We will see the high priority one is running. The low priority workload is still pending. So now if we describe that low priority pod, why is it still pending, you may ask. And it's upon to the next one. So we can see it failed scheduling. That's why it's stuck in pending because zero of the nodes are available because the pod priority of zero is lower than the emissions index of 44. So in this case, had the pod priority had a priority of 50, it would have run. If it had anything less than 44, it doesn't pass scheduling pre-filter, so it fails scheduling. And here we see if we change the priority of the running deployment to low priority, it is evicted. Low priority and get pod shows that I think we're doing get pod now, and we'll show both pods are now pending. And because that's now freed up a node after a few moments, it will the node is powered down and it is now idle. Then node switches from ready to not ready with scheduling disabled. It just changed here. So that pod, that node no longer had anything running on it, so the node manager said great, I can shut you down, save energy. Okay, so we kind of, this is a part of the project that still needs work. We used a poor man's bin packing system. We take advantage of two different attributes of the Kubernetes scheduler here, the default Qsort and a different option for node resources fit attributes. So our scheduler uses the most allocated option for node resources fit. This basically means it sticks things on machines that already have other things on them. Normally by default, Kubernetes will spread out the workload amongst all the workers in the cluster. The most allocated option says if you've got whatever say five nodes and one of them has a job running on it and you get another deployment, it'll put it on the same machine if it'll fit. So it tries to put things where they're most allocated. And the other attribute is the default scheduling system schedules high priority jobs first. We didn't do anything for that, that's just default Kubernetes, default priority sort. So basically if you turn on the cluster and there's a variety of high priority jobs they'll all get scheduled first and with the most allocated, they'll all get scheduled first on the same node. So this means you'll get some nodes that have all the high priority jobs, the medium priority jobs and the low priority jobs and then the low priority jobs get evicted, that'll tend to free up an entire machine. So that's kind of like a poor man's bin packing. It's clearly not perfect, it's not dynamic. It doesn't take account of taints and tolerations and resource constraints and some things having GPUs and different machine classes but it works if your workloads are fairly consistent, your cluster's consistent, it'll work. So it's definitely not perfect. So that's kind of it, things to improve. Everything. Bin packing, yes, certainly. We don't have full integration with Omni at the moment or cluster API or the cluster roto scaler, but these are things we can add in and we plan on doing things like this certainly and we'd love to get ideas from the community. Yeah, so good use cases for this would be generally any bare metal workload that has a periodic demand. Things that can be time shifted, things that aren't time critical. Not things that are dealing with interaction with humans because they tend to need to happen when the humans need it. But anything that's like a batch job bank financing that happens after hours if it needs to happen in a 12-hour window, you'd rather schedule it in that 12 hours in California that's usually in the day time, that's when solar energy is online and not kind of late at night. That's all we got. Now we can take questions or we'll change the slide to the, well we will have access to all of this links and we have a happy hour after, but we'll take questions now. Yeah, the GitHub repository is there, that's where the project is we would love you to rate our session highly. But you can rate it lowly too. Higher would be better. And we are doing a customer mixer customer mixer after this at Simone's bar in an hour, which is not far away. But we would also love to open it up and hear questions, feedback, discoveries thoughts on the talk we just gave. Yes, I don't know how the microphones work. Do you go to a microphone or do the microphones come to you? Two, one over there, one over here. Does your current implementation have any sort of mechanism for ensuring a maximum wait time, like if the local grid goes down? Well if the local grid goes down, then your data center's offline. The clean energy grid. Okay, yeah, no, it doesn't. That's actually, that is exactly a point of improvement I had in there that I took it out because I thought it was a bit complicated, but apparently it's a real use case. What we were thinking is having the priority of a job kind of increment over time, so if you schedule a job at ten, it'll only work when the energy's clean, but if the energy grid never gets that clean, your job's never going to get scheduled. So you may want a process that says jobs that are using this scheduler that have a low priority every ten minutes their priority goes up by one and eventually it'll get run. But that's not implemented at the moment. Yeah, thank you. Oh, I should look at the slack too. What sort of audit trail exists so I can see what the impact was and sort of try to understand we were able to save this much energy or some metric that I can show back that says, yeah, we actually did something and it worked. Yeah. That's a great idea. We would know that information from within the Kubernetes space, but if you're monitoring your power draw within your data center, you can definitely do it from that side. The system can report up, you know, this number of nodes were powered off at this time, but it doesn't know if it's a 10,000 watt server or a 500 watt server. We're using, it's not polling, it's using the... Yes, yeah, we do. Yes, yeah, so the question was that if we're using logging events then those metrics are things we can use. Did you consider using Carpenter as the whole pod lifecycle, like shutting them down and bringing up new nodes? Uh, no. But I can't tell you why. I just said take questions. We didn't say we're going to answer it. It sounds like there's a huge overlap with the whole consolidation, it's an AWS thing and they take money as the kind of input parameter. You make it cheap, you make it green. Yeah, yeah. Like we said, there's overlaps with the cluster autoscaler from Google, there's overlaps with Carpenter, there's overlaps with Gardner. But this was... We tried to avoid adding in access complications. So this is kind of like just looking at emissions and scheduling in that way. And so there's certainly room to integrate it into these other systems. I think that would actually be really helpful, but at the moment this is kind of the standalone project. Yeah. Sorry. Oh, I'm on the other mic. Sorry, I jumped over here because it was easier. I was also going to say with Carpenter, I work on the Carpenter team and there's not actually hooks available for something like a data center, like API endpoints. You'd have to have something like Cappy there to be able to call out for a fleet of machines. You need both extensions of that available before you integrate it with Carpenter. My question with the shutdowns was, like a rack where I could tell, let's say I actually have 50 machines that can go away and there is overhead available. I would love to be able to shut down a rack in the data center so I can turn off AC to portions of the building because that extends how much power I'm saving for something like a batch job. If I'm only using the batch jobs for 50% of the data center at a time, is any of that also a co-location of servers available in there? Not as it is, but that would be doable if you annotated your servers with rack location and then you had your node manager prioritize common rack locations first. I was thinking rack and heights of servers because the AC costs of lower versus higher in the rack would also be valuable to say, let's turn off the high ones first because they take more to cool. That would be the talk number two for the next North America or we all meet in Paris in six months. Give us some time. Are there data sources that support energy outside of North America? I was messing with it and I threw in some UK zip codes just for fun. What time doesn't I don't know if there's a equivalent API outside of the US. Because also probably each country and continent has their own metrics and different things. It is possible to find each individual ones and then put all... I would actually think it's probably easier in European countries because the countries tend to have unified power grids as opposed to the US which has fractionated and crazy and private enterprise power grids. I think maybe green pixie in the UK might be the supplier there. And some probably don't even have anywhere to report. You don't get information if you're in Texas. I was thinking you're wrong but not make it political. Anyone else? I would like to say who uses Talos Linux now? Show of hands. Some. Good. What are the rest of you doing? You're doing it wrong. I guess that's it. Five minutes of your day back and you can use that time to commute over to Simone's Bar or wherever your next event happens to be. But if you are with us, Simone's Bar at 6.30. Thank you for coming. Thank you very much.