 Okay, I think we're going to get started if everyone's okay with that. Hi everyone, thank you for coming to our talk today on operationalizing sustainability in Kubernetes. In other words, how to care about the planet from behind your desk. Just to give a brief intro. So my stunning colleague on the right here is, oh, I see two stunning, is Gabby. Gabby is an amazing platform engineer, incredible system engineer and a very mediocre snowboarder. This is why she's actually interested in saving the planet. She needs all the practice she can get. Well, he's not wrong. I'm better than him. Brendan is a very versatile engineer, front-end, back-end, networking, telecom. You name it, he's probably done it. And well, his passion for knowledge and bettering himself and those around him, which really feeds into his work that we do at Resync and his own personal blog called the GreenCoder.io. So that's a little about us. And we work for a company called Resync. So this is kind of why we wanted to do a talk on this specific topic. We've been doing research into just the impact on the climate that IT has had as part of our work at Resync. And Resync is just a consultancy that's looking to reduce IT-related impact on the environment. And mostly just consulting on best practices. Great. So before we dive in, we're going to get on the same page and provide a bit of context, specifically, why are we bothering to talk about this? What are CO2E or what is CO2E? And how do you measure sustainability and how do you measure that in a full system and in a virtualized system? So just bear with us through the theory. We do have some pretty cool experimentations at the end. So to start off, what is carbon equivalence emissions? So generally, we talk about carbon equivalence, which is an overarching term for any kind of greenhouse gas, so methane, nitrous oxide, carbon dioxide, obviously. And we call it carbon equivalent specifically because we... Yeah? Sorry, first time on stage, just FYI. We call it carbon equivalent because essentially we measure all of these gases in terms of their heating emission compared to carbon dioxide. And obviously everybody's heard of the big bad carbon dioxide. Now, if you see any of these two symbols, these designate any greenhouse gases and not only carbon dioxide. Right. Why should anybody care about this? Well, the IT sector currently uses around 5% to 10% of the world's electricity. We then change that into greenhouse gas emissions. That's 2% to 5% of greenhouse gas emissions. Now, at scale, that's a monstrous amount. When we moved to the cloud, we had this idea of the infinitely scalable cloud. You can pretty much set up any of your applications in minutes. We don't really care about optimization as much. Obviously, it does come into effect at some point. And with this in mind, we've slowly started seeing this huge scope creep of redundancy and waste of server utilization. To us, we're hoping that IT should be part of the solution and definitely not part of the problem. And essentially, we have to face the facts. Paper straws just aren't going to save us. So great. Now we know what carbon emissions are, but how do we measure them in our systems? Because without the ability to measure, it becomes challenging to optimize and reduce since we won't have the means to quantify those changes effectively. So first, to calculate the total carbon emissions, it's composed of two parts. The embodied emissions and the operational emissions. We're aware of the SCI. This is very similar to that, but just without the rate because we didn't require the rate in our calculations. So we just focused on these two components. And so Brennan's going to talk about what embodied emissions are. All right. So we've kind of gotten into this mindset that carbon emissions is a synonym for electricity or power usage. And unfortunately, this misses a huge part of the puzzle. When we're looking at IT hardware, we're essentially missing a huge percentage of the emissions as part of the manufacturing life cycle and supply chain to get the hardware where it needs to be. We look at a general server that's used by a lot of cloud providers, the Dell PowerEdge R740, and shout out to Dell. They have an amazing research on every single hardware component that they push out and the emissions that it produces. And in this case, we're looking at about 15% to 16% of the emissions as part of the manufacturing and end-of-life parts of the server. Now, this is based on a... Oh, obviously the question is how do we reduce it because we don't really have control over these emissions. This percentage is based on a four-year life of a server. Now, four years was the default a couple of years ago for most data centers. And what a lot of the cloud providers have started doing, or data center providers, is they've increased their server lifespan by two years. So Google Cloud, AWS, and Azure all have the same six-year life cycle at the moment. And this has reduced the embodied emissions of those servers by about 4% to 5%. Now, AWS specifically has actually reported that just increasing the lifespan of their servers has given them around a $900 million benefit in profits because they're not spending that money on upgrading servers. Well, in any industry you obviously get over achievers, there's actually a company, there's a booth downstairs that you can go and visit if you have questions. They've figured out how to increase their server lifespan for up to 10 years. And this is Scaleway. Now, Scaleway did a very big research on all their data centers and found out that in most of the instances, the component that failed in their servers was the RAID controller. And this is due to the RAID controllers having batteries in them. And nowadays, a lot of the servers don't actually require hardware cards, right? So by removing the RAID controller and retrofitting these servers, you can actually push the lifespan of these servers up. Now, they did this by around 14,000 of their servers and there's an awesome blog post about it. And this is also one of the strategies that they've incorporated to try and get ahead of the worldwide chip shortage that we're seeing due to our friends in the AI department. Okay, and now that we know what embodied emissions are, we're going to look and talk a bit about operational emissions since these will be what you probably have the power to optimize over the most. So operational emissions can be calculated. They're essentially the workload you're running on your server and they can be calculated by the energy consumption, by a grid carbon intensity coefficient. Yeah, that's a mouthful. We'll get to that later, but first we're going to talk about energy consumption and that's the amount of power being utilized over a period of time. And you can get that for a server in a couple ways. I'll show you two, but there are definitely more like EBPF if you've been to the Kepler talk and other ways. But one of those is smart plugs. They're probably the most precise measurements because they can give you the entire power consumption of whatever is plugged into it and can give you the consumption of the entire system. They have built-in Wi-Fi and API. And a second way to measure is called Running Average Power Limit, or RAPL for short. This was developed by Intel and it's a hardware component on your chip and can expose real-time power metrics of different power domains like CPU, GPU, and RAM. It's exposed by the power cap framework, which is an interface between the kernel and the user space. And so if you ever want to know if your system has access to that, you can just look in the power cap directory. And if it is present on your system, this can be utilized by various tools, including the Prometheus node exporter and another being Linux Perf. And so as you can see, these are some of the domains that you can measure. And we're gonna do a little measurement of my laptop and we'll focus on the energy package domain because that is the energy consumption of the entire socket. So pretty much the cores, the GPU, and the on-core components like the L3 cache and the memory controller. So it's as similar as measuring the whole system as you can get based on that. So I ran this on my laptop the other day and used StressNG to force a load of 85% over 10 seconds of time. And as you can see, it reported 313.49 joules of power was being used. So now that we know how to get energy consumption, let's look at the grid carbon intensity coefficient. So this is a coefficient that represents how many grams of carbon dioxide are being released into the atmosphere per kilowatt hour of electricity being created. So it's pretty much the intensity of your carbon. If it's a higher number, that typically means the electricity created is probably from a coal plant or another not-so-great source. However, if it's a lower value, it probably depicts wind or solar power and that's what you wanna aim for. One way of being able to get that value is a tool called electricity maps. They provide real-time data of the carbon intensity for a specific region at a period of time. So on Monday when I was running that PERF command, the carbon intensity was about 40 grams here in France. And great, so we're gonna put it all together and get the total operational emissions based on the Rappel command and the carbon emissions, or the carbon intensity. We first convert joules to kilowatt hours to go from power to energy and then multiply that by the coefficient and we get 3.4 times 10 to the negative 4 grams of carbon from just that little running. So now we know how to do it for a full system. We need to talk about virtualized systems, right? Because in most instances we're talking cloud providers, data sensors where we don't actually have access to the underlying hardware components. So we have checked cloud providers aren't really keen on us walking into their data centers and starting to plug in some smart plugs. We're still waiting for responses. Even though Rappel is quite accurate, most providers have disabled access to it. There's a CVE where privileged users with local access were able to access unauthorized disclosure of information. So in virtualization we need to start jumping to the different components that actually use power in any system. This is mainly the CPU, the memory, the storage and networking. Storage and networking do play a part in power consumption, but in any system majority of the power usage is memory and CPU. This is depending on what system you're looking at and workloads, of course. We also have to face the fact servers are complicated. There's different compiler flags that you can run with servers. There's different hardware components, different energy profiles, and unfortunately cloud providers aren't really forthcoming with all the different configurations that they have set up underneath. So what the tendency for a lot of tooling in this area has done is they've started using machine learning models based on a database called the spec power database, which is just a huge database of hardware components with their full energy profiles. And using this idea with information that we actually do have around virtual machines, we can then, such as the CPU type, the memory type, different utilization levels, we can then run that information through these machine learning models to get an estimate of the current power consumption of a server. We are seeing different machine learning models being used, so I've seen K-means, SVM regression, not so much linear, like full linear regression models. However, all these models, we are still a bit in the dark due to lack of data from the cloud providers. I'm iterating that in case there's anybody that works for cloud providers come chat to us afterwards. To sum it up, how do we measure power consumption in emissions in our systems? We make a whole bunch of assumptions, a whole bunch of estimations, and whatever data we can glean from cloud providers. Hint, hint. Okay, now that we've chatted at you for a while, let's get to the good part on how you can actually optimize your clusters for carbon emissions reduction, reducing carbon emissions. All right, so we've got this idea of time shifting, which we'll go through first. As an industry, we've run cron jobs for, I don't know how many decades, and the idea behind time shifting is we slowly move our workloads to shift to schedules that coincide with lower carbon intensity. In any given day, so this was a Monday morning, and as you can see, the carbon intensity does fluctuate throughout the day. That nice big orange spot over there is otherwise known as the sun, and happens between 7 a.m. and 7 p.m. Not in Amsterdam in winter. However, with such a green energy source, we do also notice that carbon intensity is much higher during the day between these times due to most of us working. So literally just shifting your workloads to later on in the evening, so between 7 p.m. and 7 a.m., you do get a drop in carbon intensity for your workloads. We also have this idea that not all workloads can shift in time, so if you do need to run your workloads at a very specific time, you can shift your workloads to run in different regions that do produce greener energies. With my little experiment, we had the Netherlands Monday morning when I started working. We had a 404 gram carbon intensity, and if I was to run my workload, that's essentially what my workload would be using. If we shifted it to our northern neighbors, still in Europe, still under GDPR, I do understand that running your workloads in different regions does come with some compliance and security concerns. We have a carbon intensity of 24 gram per kilowatt hour, right? Now, exact same workload, all I'm doing is shifting it from Netherlands to Sweden, and I'm getting a 95% reduction in carbon emissions for my workload. With that idea, or with these two ideas combined, we get this follow-up of the sun methodology, which is very much stolen from our support engineering colleagues where you've got an engineer in every time zone to follow the support and follow the sun in emissions. We're talking about the software following wherever the carbon intensity is lowest, and this is being designated as carbon-aware software. Your carbon is actually aware of the carbon intensity that it runs on. This is one of the four running ideas in the industry to reduce your carbon footprints, and we didn't really feel like going too much into implementation details on this because it is very, very, very, very well documented. Yeah, so then we decided to think, well, how would node utilization come into effect with this? And it was observed that a majority of utilization is only sitting between 10% and 20% on servers. So this graph may be from 2009, but we found other data that was very comparable, like McKinsey and Company found global utilization falling around 15%, and then in 2022, IBM found that server utilization was only falling in between 12% and 18% of capacity. So not only is that limited utilization a waste of idle resources, but let's look at how it affects the carbon emissions. So when comparing the way of utilization of based on power, it's easy to kind of assume that they would be linear, right? The more workload you have, the more power you would need to power that. But that's actually not correct. The rate of power consumption as utilization increases actually decreases, which means the more utilization, the more value you'll be getting out of your energy consumption. And so we wanted to show how this played into effect with a little experiment. So just to set the stage for the experiment, we tried to keep as many variables consistent as, you know, we wanted to make pretend scientists, and we created a single cluster, single control plane. We used Google's Microsoft versus demo just because we felt like it's a good representation of workloads running nowadays. We set the region of Europe West 1. We turned off any autoscaler. We had two node pools with a single node in each. One of the nodes had four VCPUs. One of the nodes had eight VCPUs. And we removed all the resource limitations on any workloads that we were running. So they could essentially grow and shrink as they needed, as they wanted. We then scaled the load incrementally in tens and we left around 20 minutes between each scale to see the effects that would occur. Yeah. In order to measure, we used a CNCF tool called Kepler. Kepler is an EBPF based power, it measures the power on your Kubernetes cluster. Sorry. We also decided that we needed something that also measured the full underlying system. Cloud Carbon is a terrible name. We're looking for a new name and we appreciate that. But essentially we created our own tool that measures VM utilization and converts that to emissions as well. Great. So as Brendan said, we used the Google Microservices demo and it comes with a load generator and we started with one instance and then we incremented it by 10 instances every 20 minutes. And so as you can see, as time progresses between the first to the 10 instance jump, it's about 400 watts and 0.4, sorry, 0.4 watts. And then as you see between like the 50th to the 60th instance, when we made that 10 instant jump, it dropped to about 0.1 or 0.05 watts of energy. So it's showing this, it's showing that nonlinear rate of growth and rate of change and the arc is very similar to that energy proportionality graph. So in other words, the more utilization, the more value you're getting out of your energy consumption that are for the planet. Now this was where we started bringing in our tools. So Kepler specifically measures like containers and namespaces in Kubernetes, right? So if we look at the previous graph, we notice that the increase was very similar because it was actually measuring the exact same workload. What we wanted to measure was the underlying virtual machines of each. Now the blue line at the bottom is the VM with four VCPUs and you'll notice that it has greatly less emissions than the VM with eight VCPUs. Yeah. Now this is because more VCPUs equals you need more energy to run it, right? As we increase, the increase is again proportional but not linear and we get to a point where the smaller node actually started flatlining at 100% utilization. We don't recommend this for any workloads ever. So this was just an experiment but the bigger node actually had a lot more space to grow. So as the utilization of the bigger node increased, it could still increase in energy consumption. Which, yeah, we don't want you to flatline your nodes to run them over 100% utilization. So we wanted to see how autoscaling would take effect and change this information. So as you know, autoscalers is pretty recommended and well utilized in Kubernetes both in the way of scaling up to handle load and scaling down to not waste resources. And the autoscaler typically makes these decisions based on the resource requests that are specified in the manifests. So what we did with our experiment is we enabled autoscaling but then we also set the resource request for that load generator deployment to be 300M. And this is because this is what Google specified to be and we figured they know their application better than we do. Let's go with the default and, yeah, we ran the exact same experiment. And these were our results. So the bigger nodes are the green and the smaller nodes are blue. And as you can see, the smaller nodes had scaled up to 5 and the larger nodes only scaled up to 3. But the total emissions or the energy being used was about 5 watts for the small nodes and the larger nodes only hit about 3. Whereas before you see it only got to about 1.6 watt hours. So, yeah, utilizing more resources is causes more energy consumption especially because the CPU utilization was only sitting at about 20-30% and that's leaving a lot of room for more growth. Yeah, more commands. Yeah. So then we thought, okay, well let's try to set the resource request to 100 and see then how that will manage with the autoscaler. And as you can see, it performed a lot better. There are a lot fewer energy being consumed and the CPU was being utilized a lot more with 50-60% utilization and only the smaller nodes increased to 2 and they reached about 2.5 watts of energy consumed and the big nodes never even had to scale up. So with that utilize your nodes more. Take the time to analyze your workloads and understand your resource requirements and what you need. Just throw resources at your clusters because you can. Yeah, really take the time to utilize them effectively and get the more value out of your energy that you're using. And I would also just like to point, when you do utilize your nodes more, I mean exact same performance, your applications running nicely, but you're using about half the amount of energy that you would in an optimized environment. And yeah, the fewer loads, less carbon. Let's just hit that home. We are ongoing with our research. If you have any feedback, any comments, any questions, please reach out to us. There's all of our handles LinkedIn. We're a friendly bunch no matter how we look. Resync.com is our company. We have a blog there. We have my personal blog, the GreenCoder.io where we're just trying to get the information out there for everybody to be able to take control of their climate-related impact. Yeah, we would love to collaborate and answer any questions. Yeah, and then just a quick thank you. Obviously, talks are quite intense. So thank you to Anna and Jessica for all their help. And yeah. And our team at Resync. And that's it. Hi. Seems like the number you put it on, the 8 to 10% of the ITC sector consumption sounds very high. So do you know where the source of that data? I actually got that from the EU, the EU's European Parliament websites, I believe. We do have links to any information that we've shared. So we will share the slides. Okay, that would be great. So I think there are multiple sources. Sometimes they're saying nature papers saying different stuff. So it might be really great. Thank you. Hi, you mentioned your tool being cloud carbon. Yes. Could you expand a bit more on the shortcomings of Kepler and why you needed the other tool? And is this tool open source or on GitHub or something? Yeah, so we don't actually see any shortcoming with Kepler just to be frank. Kepler measures Kubernetes clusters. So namespaces, containers, pods. What we found is, especially in the cloud environment, this doesn't necessarily mean that it measures all the underlying Linux processes that are also running there, right? We wanted a more a bigger overview. So any overhead that you need to run your workloads and that's why we created cloud carbon in the cloud environment. So we wanted to make sure that we had the same pending. CNCF, ENV tag sustainability that mentioned it was a shit name. Excuse my language. But yeah, it's 100% open source. All of our data is there. We really have tried to make it as transparent as possible in case anybody does want to do the same as we've just done. Thank you very much for great talk. You mentioned time shifting our workloads to a more optimized time like later in the evening maybe. But in our case that would sometimes require maybe not scaling down to zero which could be a bad thing. Do you have any numbers on that or how to calculate that maybe? So I think the key here is measuring your workload like the energy consumption Kepler great project to do that. And just kind of doing that research on how much of VM actually uses to run versus scaling up your system. I mean the optimal solution and I am going to say it is to just turn everything off. Companies don't like that answer so it is something we're actually looking into. We're going to take a lot of this and start posting blog posts on different things that we're researching and this is definitely something that we will be looking into. Thank you. Done? Yeah? Yeah. I would like to know if you started any reflection on the useful workload. What I mean is that most of the time you have, let's say your application workload and then you have the rest meaning metrics stuff like this and is there any indicator that you are reflecting on that talks about percentage of useful workload? Business value versus business support. Yeah. We have started looking into this and I think there's actually quite a big push to actually look at this. I think it's a good idea to look at some of the things that we're doing groups to condense modern tools to be like one or two agents per node that you're not running like five agents for your logging, your telemetry, et cetera, et cetera. This is definitely an indicator that we are looking at but unfortunately business support and business value do go hand in hand with any kind of numbers towards that. I think in general open source tooling needs to maybe come together and try and lessen their footprint because that scale I think that could be the best option for the specific problem. Yeah. For the spagamins that you run, what was the OS you used for the nodes because I was wondering if this big chunk that you saw was used even without workflow had to do maybe with high resources of the operating system and if you did any research like changing operating system if you find one which was more energy less energy hungry. We haven't done any research to that side. That would be really fun. I want to try to do that maybe next. I think we used the container optimized for the Google's container D runtime. It's like the read-only file system with it's cause, right? Yeah, I don't know. Forget the name but it's essentially the one that just runs the containers and nothing else. That's a really good question. I'll get back to you. Hi. Thanks for the great talk. I have a question regarding the shift your workloads approach to maybe Sweden. Have you any take on automating the process because it sounds kind of hard to do this manually like your provision resources in your cloud provider and maybe your saving plans and something? There is actually the Green Software Foundation has a C-Shop library or .NET library I think called CarbonAware SDK to incorporate this and there's a lot of research around it. So I think it's still in its early stages but I do see it's seeing traction, especially with the European Parliament's new laws coming in that companies are going to want software that does kind of go down in carbon intensity. So I hope that answers your question. Okay, thanks. It sounds like the cloud provider is in charge for this but I think we need to integrate those features because if every user who uses Kubernetes in the cloud I think it would be very hard if everyone develops their own solution for this. 100% and again calling on all cloud providers to be more transparent with their data. I think the one step down here. Thanks. Cool. Good. I know it's lunchtime so we can go eat. Thanks for coming and thanks for listening.