 It's very nice to see so many people joining a presentation on a relatively narrowly focused topic, namely minimizing energy consumption in bare metal Kubernetes clusters. I'm David Miedermauer-Elli. I'm working for one-in-one mail-in media, which is not immediately obvious what we are doing. We are Germany's largest email provider with about 40 million active customers. I'm the lead architect for our infrastructure platform, covering the entire lifecycle of artifacts from the first line of code until running it in production in Kubernetes clusters. Along that way, I became sort of an expert for continuous integration and continuous delivery. And as you can see, I spent quite a lot of time in IT by now. So that's my history in a nutshell. Hi, everyone. I'm Marco, and yes, I'm getting old and doing IT operations since a while. Currently I have Deanna to lead a group of people, basically cloud engineers within the company to provide the platform based on Kubernetes that David has mentioned. But of course, yeah, this is not about me or us today. I would like to start with a question, actually. So who of you in the audience is doing bare-metal Kubernetes? Wow, okay. That surprises me a bit, honestly. But then, of course, you're in the right talk, and maybe we have a chat afterwards, and we can have a conversation about our challenges. But today the topic is basically energy savings in bare-metal. So to give a bit more of context, David mentioned. We are the largest email provider in Germany, over 42 million active users. So that's quite a lot. In the yesterday's keynote, I saw someone telling why they do cloud native because it accelerates their development. It helps growing the business. And this reminded me, we had the same kind of journey. Five, six, seven years ago. So the company decided, okay, we break it down into microservices. And then it gets very, very easily in the direction of having a container runtime. And so today we are doing a multi-tenant Kubernetes platform for internal users on bare-metal since 2017, by the way. We do this on-premise. Why do we do it on-premise? The short answer is, yeah, because we can. The bit longer answer is, we have some data protection constraints. We take security very serious. And also, Jonas, the company who operates the data centers for us, is also a member of the larger corporation that one-on-one mailing media where we work is part of. To give some numbers there on the slides, we have about 70,000 CPU cores in our clusters and a lot of energy consumption. Network ingress is like 600,000 requests per second. Just to give you an idea, this is kind of huge. So this slide should illustrate what's actually the problem with having bare-metal. So when you have your own servers in your data center, the scale out is not as easy as it is with public load. So you usually have too much hardware in the data center because this four-step process that I have here is simplified, of course. And it should show what happens when we decide we need more hardware. That's not a very easy thing. So we have to have budget constraints. We talk to the procurement department. They get vendor quotes. Then the hardware arrives. There's logistics involved. Racking, you know that. When you have a data center, you know this. This takes time. This takes about, in our case, a few months until we have the boxes arrived and really available. And the fourth step in this process here is the only step that we control that my team actually controls. The Kubernetes node provisioning. So what this means in a simple sentence is we need enough spare capacity available in the data centers all the time. All right, let's get everybody quickly on the same page. Why are we talking about the topic? Well, it's trending anyway. But anyway, save CO2, reduce your carbon footprint, help save the planet. That's the overarching story. But then, particularly here in Europe, you probably all know for the last one and a half years, there's some kind of energy crisis due to geopolitical events. Energy prices skyrocketed and wildly oscillate. I brought a little diagram on the bottom right that shows the German electricity prices over the last one and a half years. So that's sort of crazy. So saving any amount of that saves you varying amounts but sizable amounts of money. So that's all good incentives to work on minimizing the energy consumption and on top of that, if you can measure and quantify that properly, that's a very good addition to the mandatory sustainability reports we need to publish, of course. Okay, let's jump to a very quick and simple solution for minimizing the energy consumption as the title says. No servers, no energy. Okay, thanks for being here. Oh, no. Okay, obviously there's not only energy consumption but things that need to be fulfilled as boundary conditions like providing compute power to our users. So let's quickly start thinking about what we actually need to measure and what we can optimize based on these measures. So obviously only the energy consumption is not the only thing. So we need to measure the energy consumption with respect to various aspects. I'll go into that a little bit deeper in a moment but let's first think about what KPIs need to do so we have a sort of recipe for constructing KPIs. So they obviously need to be reliable and repeatable so when I recalculate my KPIs from last year I need to get the same results and also they need to be robust against changes and any kind of dimensions that they are not related to. Like when my customer base grows, obviously I need more compute power, my energy consumption goes up and then I could have a KPI that measures energy consumption per user maybe and so this is the measure that's robust against business growth and shows if you get more efficient with respect to users. So that was just one example and you'll see that opens the field for creating an entire set of KPIs that sort of illuminate various aspects of your operations and then if you're going as abstract as I just figured out, then it's clear that a lot of assumptions need to go in and lots of simplifications to be able to do these abstract, more abstract KPIs but of course there's a lot of things that are more basic and we can quickly have a look at what you can sort of do building up the stack of KPIs. The baseline for me is looking at the server idle power just wrecking the server, putting a power cord in there switching it on, install Kubernetes, run no applications there and then it will have some kind of idle consumption. I did actually a plot here it's a histogram of the idle consumption overall of our servers. Numbers are illegible by purpose but I can tell you that the peak in the middle is around 220 or so watts so that's the average idle consumption of our servers and that gives you the opportunity to look into optimizing the various hardware components and configurations of your servers and tune them up to save a couple of watts times a thousand servers that adds up quickly. Okay, then idle servers are not very useful in the end so let's step up a little bit and put some load on it. That would be the next KPI we propose to look at the server power consumption on the load. So this opens up the opportunity to optimize CPU settings thermal tuning because servers get hotter when they do more work obviously and I brought a plot here too. This is power consumption on the y-axis per average CPU load on the x-axis and the immediate thing that you will probably ask is okay why is there two groups or two clouds of dots? Well, the upper left one that's the service with GPUs in there so the work is not done in the CPUs but other where at a different place I don't have the metrics available yet for that so okay the dependency on CPU is a little bit different than the rest but you'll see there's some kind of correlation in that lower cloud of points that the energy consumption goes up and with a growing load which is what we expect of course but then the immediate thing that you can ask is what other factors does that depend on which brings me to a side topic of normalization you probably have a mix of various CPU generations various CPU models maybe even different brands like this Intel, AMD, maybe ARMS all of them behave differently you can tune the CPU clocks all that kind of stuff has influence at some point in the future of course need to disentangle that to see not just that huge blob but a clearer correlation so that will certainly help tuning the individual machine settings one step up look at the cluster performance we all have clusters and so there's a huge or a large group of servers interacting with each other and there's an interplay between them and maybe depending on the cluster tuning the power consumption can be optimized maybe everything gets more efficient when you distribute the load differently like compacting the load on more servers and get a few more on idle mode or maybe spreading in evenly between all servers we don't know yet but that's a question that's certainly worth answering and which has influence on scheduling on Kubernetes okay and then at the last example we can look at the applications so I mean load is a nice thing but in the end it's the applications that generate the load and so I'd really love to map power consumptions to individual applications and maybe also calculate power per request that's been processed in various applications so that starts creating a bridge to the business view on our products so I can talk to business people and help them make good decisions for the products and the development to minimize the energy consumption of the applications yeah so that was a little bit of theory of course there's projects that I learned about and this KubeCon like KDA for scheduling things like Kepler and other projects for measuring and quantifying energy consumption that's certainly something we'll put in and evaluate into our next steps but let's get away from theory a bit and look what we can do in practice and what can be measured with that kind of KPIs all right so let's move into a bit of concrete things that we did or the things that we learned so remember I talked about that we have always too much capacity available in the data center when running bare metal I think it's important to understand what kind of reserves do we actually have and when we made up our minds about the topic we realized we have basically three types of reserves we have scale out reserves for growth of business for example we have georetic density reserves and peak performance reserves I go in all these topics a bit more detail now so scale out reserves scale out is basically as I mentioned growth in users for example which is great because in the end it pays my salary so I like that but this is for future then this is for future use but essentially it means maybe we can shut down some notes now so David said before zero servers zero bots that's great no it's not great it would upset our users they would post terrible things on Twitter and on sites like downdetector.com and all those kind of things but how many servers can we actually shut down and we identified what David said idle consumption is one of the most expensive kind of things so we want to just power down machines that are not actually used either they have zero usage or low usage because the whole specific infrastructure that you have in Kubernetes like all these things that are up here the cubelats and all the demon set kind of things have consume load and energy in the end so one important question is when we need the hardware how fast can we re-enable them and there we learned that automation as often is key so the faster or the better automation we have the longer we can wait activating hardware and also what helped us in our setup is that we have basically we have immutable infrastructure and all Kubernetes nodes that means there is no SSH possible on the nodes you cannot change anything there you can put the nodes in debug mode and change stuff but then you have to redeploy them from scratch into a specific state so the nodes are yeah there's no config drift there's no puppet agent or anything and this helps because when we take a node freshly it has the same state as all the other nodes in our classes and the second kind of reserve is geo-region density reserve I think a bit context is important there so the main Kubernetes clusters that we have for our workload is running in two different regions within Germany which is in an active-active manner so we have no disaster recovery data center which is doing nothing we have active-active what does this essentially mean is both data centers have enough spare capacity that if one data center goes down the other can take over the load so we have not done that yet so that's a plan for the future but when we want to do that this has some impact so we need better automation, faster automation because when a data center breaks down it's a matter of some minutes and we want to have this kind of spare capacity available for the geo-region density so it's not an option to wait like half a day until all the servers are spin up and so what's also important is the management has to have a buy in there we need to convince C-level management because there is a potential cost-saving versus a potential risk obviously and what helps there is to create just transparency transparency about the risk transparency with some KPI numbers that David showed about the cost savings so regarding the risk what's also I think is important when you have all your operations team in the company doing regular emergency drills for such a worst case scenario this creates confidence in the operations teams and in the management okay, this graph is basically so I said we are a mail provider we have a usually daily curve so this is two days of traffic within our Kubernetes clusters 600,000 requests yeah so during the night obviously there is not so much email traffic because people are not reading their mail so often what that means is that it would be cool if we can put some batch load that happens during the day for machine learning jobs for example in our data processing divisions and do this batch processing not during the day on top of the usual user workload but during the night in the curve a bit that would be cool another thing is let's assume we have really low usage during the night wouldn't it be cool to just shut down those servers then if they are not needed like for example 11pm shut down a few hundred machines and power them up in the morning at 7am when people wake up and read their emails that would be really cool but what this means then yeah you have to be very flexible in the hardware and server management alright so another thing that we looked into is HPA horizontal pod autoscaling a few years ago I thought this is only a thing for cloud providers for public cloud users because the hardware is there anyway but yeah when workload is down because usage is low automatically this would enable us really to shut down servers then in the end as a consequence and also what we are looking into that's the screenshot is actually the VPA recommender VPA is a vertical pod autoscaler it has the possibility to really automatically adjust the resource requests within containers based on the current usage like for example the last 24 hours but we don't want to do that we are a bit more conservative we want to use the recommender this would help our users, our tenants the teams who are using Kubernetes to see okay I requested for example 8 CPUs for my application they only need two they have an incentive to help us save energy to reduce the actual demand and we can maybe remove servers alright this is interesting so this is basically I think that surprised me a lot so some team members did some investigations that our servers are really we have those one rack unit and the fans are usually very fast spinning and we turned them down a bit so spin lower and this saved us about 15 watts per server 15 watts okay not much but with 1000 nodes it's actually 10 megawatt hours per month it's quite a bit this is just an example also the small things matter and really a quick win alright great so we saw that also in practice measures with different levels of complexity can quickly save up a couple of megawatt hours per month or a couple of hundred maybe so as far as I understand we are saving another at end no 40 megawatt hours by just switching down the scale out sorry switching off the scale out reserve so that adds up so it's really worth entering into that effort yeah and for closing our session I just want to have a quick look again at where we where I want to go with evolving that KPI and measurement and optimization scheme as I mentioned we can go as abstract with the KPI as we want and look at applications requests products product components if we manage to map on that we have a link between the operations stuff and the business and also the development and just imagine that's going a bit beyond the topic of that talk and very generic of course but imagine that a developer could save 20 or even 50 percent of CPU cycles by just optimizing their algorithms the implementation that would be 20 or 50 or whatever kind of percentage of your infrastructure that would be huge so talking to them and giving them the measures to optimize their stuff is definitely a way to go so creating that transparency will most likely open huge opportunities and well I took the opportunity today to talk to a couple of people from the tech sustainability here's a couple of members so I'm definitely looking forward to getting connected to them to figure out what's the best way to go forward and that leaves us as planned fortunately quite some room for questions so there's two microphones I learned so if you speak them in a microphone we can hear you well and the remote participants can hear you just as well I was wondering you had a slide that showed a certain kilowatts per service I'm curious how you calculated or measured power usage per service well thank you very much for that question that's a pretty good one probably most of the people running bare metal servers are familiar with the fact that most servers have a baseband management controller and many of the servers have internal measurement of their power consumption that might not be too precise but it's a relatively good measure if the server consumes more power than that number goes up we are tapping into that and reading that every I think 30 seconds or so so it's relatively fine grained so we get a total consumption of power okay follow up question only one service per server no we are running because the usual mixture that Kubernetes loads on the service so mapping from a power consumption to sort of server based power consumption to an application based power consumption will require some kind of additional measurements some kind of assumptions sort of dividing up the total power consumption to the various applications that we have the numbers actually available in Prometheus like the CPU consumption per application per port per any kind of thing so we can use that as a proxy but then I learn that Project Kepler is digging using eBPF into the actual CPU cycle counting and stuff like that so leveraging that as well as an opportunity we are certainly going to evaluate what's the easiest and quickest and most reliable way to do that okay we have question here yeah I'm curious what do you use for provisioning the servers and cluster management very good question thank you so basically we are having the flat car container linux as an operating system and we have I mean we can have more detail talk about all the details basically but it's a combination of the core as container linux mechanism like the matchbox and we have our own company asset management servers and which kind of CPU and stuff so there are automatic pipelines run by GitLab CI who is then configuring the servers do you use something like ironic or open stack ironic I mean no no that's also good question we're running really directly on bare metal we have no virtualization layer yeah ironic can provision bare metal servers so that no no we have our basically homegrown solution based on a bunch of scripts and the tooling that container linux provides okay it would be very interesting to talk more thank you you're welcome maybe very basic question here but the servers are very big like 800 gigabytes of memory each and there's kind of a limit of the amount of pods per node that you can put on Kubernetes so how you handle that are the pods very large I mean they all consume a lot of memory or if you have a lot of micro services how can you concentrate like thousands of pods in a bare metal server yeah we have some workload really consumes a lot of CPU cores so but regarding the number of pods the colleagues my colleagues can answer the question better but I think we have a constraint about IP addresses from Calico that we use as an SDN which is I'm not sure if it's is it 250 we enhance that so I think we can run about 500 pods per machine so I have two questions one it's about your power usage effectiveness do you have some numbers if you have 50% bare devices and the other one is have you tried to use other CPU processor architectures not only Intel switch to ARM or something like this because this could be also promising which is the same or more workload to consume less energy so let me answer the second question first so we looked into other we didn't look into ARM but we have basically a mixture between Intel and AMD CPUs so the newest generation of servers that we use are based on the AMD Epic CPUs so we measured more efficiency as well per core or you could break it down into applications as well so they are more efficient what was the first question sorry okay well as Marco mentioned the data centers are not fully under our control they are operated by the colleagues of IONOS so I can't probably answer the questions related to them we are in contact but they are running the data centers I don't have the numbers in the top of my head but of course they are also optimizing the data center set up as good as they can for example I had discussions about terminal setup rising the data center temperature reduces cooling requirements stuff like that that's being done but our work starts when nobody needs to touch a screwdriver again one silly question but I know you have lots of spam and it's still exploding so how much energy do you spend for cleaning the spam so you saw the spikes in the daily variation graph that I showed yeah that's because spammers seem to use cron, hourly cron yeah I cannot really say what this exactly means in kilowatt or megawatt or whatever but I would say it's a lot we are not so far regarding the KPIs that David showed this is in a phase where we spend a lot of thinking about measuring but we don't have like today we don't have this metric available I cannot say this spam filled application is causing this amount of energy we hope we can maybe in the next QCon next year yeah that's certainly the goal and well giving a rough customer that's a low digit number of megawatt hours per month to filter out the spam but maybe next year we can tell a little bit more precise numbers okay next question will be that side thank you very much you touched upon some great subjects here and really happy to discuss them further with you in the tag environmental sustainability tag that is and I just want to pick one of these subjects which was defense speed which you reduced did you also measure how much that increased the temperature on the chips and how that affected the lifespan of the devices very good question thank you yes we did we did measure that I don't have the concrete numbers on top of my head right now but okay my colleagues also not having it I haven't had a look and we have a Grafana dashboard I'll have a look later if you want but definitely it didn't decrease or worsen the lifespan of the chips or of the server or something we can compare a bit our data center provider increased the room temperature in the data center from I don't know the exact numbers but they increased some degree centigrade and it didn't affect too much anything so it's basically you don't run your data center 16 degree centigrade anymore today 23 or so is fine that's really interesting because I just read an article about that and I'd love to hear more about it on the tag maybe so short answer we didn't see a correlation on lifetime of servers because of cool thank you very much I just wanted to piggyback off of an earlier question regarding pod density I just wondered if you guys had any issues with storage IO performance and running lots of pods at once on those big nodes short answer no not yet long answer most of our microservices are actually stateless so they essentially get fed their data by network streams and they reply by network streams and we have the servers equipped with 10 gigabyte per node and newer nodes with more CPU cores with 25 gigabits and an option to double that so we are safe on that side up to now we have a couple of applications that are accessing remote storage so that's not supposed to be super fast anyway and they seem to be happy with what they get as a side remark we probably never will put the real big storage requirement application like the e-mail store or the cloud file storage on Kubernetes that doesn't make sense in the multi tens of petabyte range so we are doing that on bare metal by themselves and that's the same way to go anyway okay we are probably very close to the end maybe one question left maybe one last question we'll stick around the entrance area so you can have a chat with us afterwards if you like so the question is on the energy consumption of the network is this factored in, how much do you have any estimates and how can you measure that if this is sometimes external to the actual machines that's such an excellent question I had exactly the same discussion with David a couple of days ago and the short answer is unfortunately we don't have these numbers because the network operations as well is not within our direct influence it's not a company that provides the data center and yeah we don't have we just have not the technology right now to combine those kind of numbers but my personal goal would be to get there and to be able to do that as well because it only makes sense if we have the full picture from the hardware network up to the application and the algorithm efficiency you need everything