 All right, and thank you everybody for joining. Thank you for taking the time I hope you all have a great QCon so far today I will talk about energy and clouds an Emerging field very new very exciting. I Hope you all get something out of out of it. Maybe you can Move some of the learnings to your companies to your projects We will see But you don't need to be like an expert. You don't need to know much about resource consumption and cloud This talk is very much giving like a proper or a good introduction What it is all about so we will explore all the layers sort of I think it makes a lot of sense looking at the layers And at the end we will reach like the cloud layer and then we will kind of figure out What are our requirements in the cloud when it comes to resources and Energy so my name is Leo. I'm one of the chairs for the CNCF technical advisory group for environmental sustainability You may have seen seen us. We have booth. That was also maintain a talk earlier today so If you like to chat with us Tomorrow, there's also the booth still open Come step by Would be great to talk to you So I think first things first Like the obvious question. Why should I care about resources? Why should I care? About energy and I think this is like a very common pattern in software that we we are the software people We don't really care about the hardware side of things, right? We don't this is like data centers resources all of this That's like we are we choose to build software that we don't want to be Involved too much with hardware So I think it's like also like my mindset when I started building software that I choose this path I like software engineering and I'm not a hardware guy. I'm a software guy and but I think so over the time That this attitude sort of that we are so disconnected from the hardware from the resources at some point will bite us a little bit So I think it's going fine Right now, but I think we should be a little bit more aware of the resources that we Consume especially if we build innovative technologies like blockchain stuff AI stuff all of this which is very good in a lot of aspects But in terms of resource consumption, it's terrible So it would be great to invent those tools, but a little bit more resource efficient. So I think in general To move this space forward, we will not get rid of software We will build more software, but the software that we built sort of also increases in complexity But also increases in resource usage and eventually that's like it will collide at some point, right? So if you build more and more software and it will consume more resources It would be better to plateau this at some point So I would say like software engineers in the future should be aware of the resource consumption Of the software they develop. So The only reason why we have hardware is to run software So there's like a direct dependency between the software that we built and how we utilize the hardware and There are also like more reasons why to care about energy. So for example, if we have more renewable energy, so Renewable energy for the most part is based on weather Things, right? So the sun is shining more wind is blowing and Weather is a chaotic system. So we cannot really know like how much energy we have in a month Maybe we have a lot of excess energy and would be good to make some use out of it, right? so maybe we can we can stop some processes or We can we can start we can stop just having knowledge about the resource consumption of our software and Utilizing our software best when there's like more resources available for us But stopping when there's not enough resources because some industries they require like a steady power load But we can just turn off software servers, right? We don't that the hardware resources do not Explode if we just don't set them under juice for a minute. So we have the possibility of Being like dynamic in terms of reacting to the energy grid and so on right and So I found this in the study The studies a couple of years old So probably like the metrics changed a little bit, but I think this illustrates sort of Very good like that. We have like an invisibility factor So we don't really have like a feeling of the resources that we consume. So for example Google Docs in this case consumes a lot more energy than Leap or office and there are good reasons for this because there's like some features in Google Docs in terms of syncing with the cloud back ups Spell checking which are just features that some like g-edit just does not have but the point is That we as users or also as software engineers in general do not really have like a feeling About the resource consumption. So if you use Google Docs, you don't know that you maybe consume Double amount of energy then if you would use like a different project, right? And you can apply this to any other service that you develop, but I think this is like very illustrative And this is like obviously not not like a new thing, right? Beloated software. This is like this is a quote Like 30 years ago in the paper and they said like 25 years ago So if you say like over 55 years ago, they built tax editors with like similar features They were just 8,000 bytes and then 30 years ago. They already were like hundred times as big And this also like increased in in years, right? So we use more resources Obviously, we also solve more complex problems if we solve more complex problems There's you can definitely make the argument that you consume more resources, right? There's a requirement But At the end we need to look out for the resources. Otherwise, it will get maybe a little bit out of hand And so the question is like what is our job? as in the clouds as cloud native engineers We are not hardware people. We are cloud engineers. So how does this? affect us how can we deal with this emerging field and I Like to look at this problem Just to reimagine or just to look at look at is in abstraction layers So just to recap that we are basically right at the top, right? And we have a lot of dependencies a lot of abstraction layers API's we need to dig through and one of the challenges that we have with energy and resource consumption is that all those information is writes at the end Right, we we do not have this already exposed as an API on the node level We need to dig through a lot of layers to get to sort of like the source where the Matrix is collected, right? So Now we will just move a little bit from level to level and try to understand like what are the obligations and requirements for each level And then at the end we will reach the cloud cloud side of things And then let's see where we where we end up Right, so at the start we have the hardware level Nothing too exciting. So we have in the system. We all know CPUs and RAM and all those hardware components and there are a lot of different like vendors that create those hardware devices and Most of these vendors in the last ten years or even before then exposed APIs at some point to just Move those information about energy consumption up that we have knowledge for the operating system and also for user applications About how much energy do we consume? So there are APIs They are also quite good and we can use them. So that's all very good news But there's also like one catch and I think this is also very interesting to point out is That all those APIs are based on Assumptions so if you don't have like any hardware device like a card which you plug in your PC or if you have like something in the power outlet and Which gives you like the exact numbers All the software side of things are based on like approximations which are for the most part pretty good So we don't need to worry about them too much. But if we look for example at one of those APIs from Nvidia in this case They're also for some Resources quite a big of an error margin. So sometimes there's like an error of five percent So if you measure energy with their tool Can happen that you have that we record five percent more or less energy than it's actually being consumed So if you think this like in the larger scale if you do like big machine learning models and You don't have one card But hundreds of cards or thousands of cards five percent is quite a lot but those tools are getting better those APIs are getting better and The good thing is we are in the cloud. So we don't need to worry about this too much That's more a job for Nvidia and for Intel At the very low level which is integrated in their hardware But we can use those APIs and they are quite good. So for the most part and this is one exception maybe so at the next level we have the operating system and So the hardware just to recap was very much about moving information up It's not about acting on the data, right? The operating system is different because the operating system Consumes this information and actually does something with it And we have this pattern also later Again basically on the next level. So the operating system you have either like static power management or dynamic power management If I close my laptop It will go into sleep mode some like the LCD display will turn off. It will save energy So there's like some things that the operating system does just to facilitate between the Hardware requirements or capabilities and the application requirements Also like thermal power management. So just to ensure that the hardware is not getting burned and so on so Trying to move the operating system is trying to make the best out of the resources with the Requirements that we set on the software side. So if we deploy more applications the operating system will try to Give like pros CPU time to all those processes, right? So on the next Section applications, that's slowly getting in the territory where we are also developing apps, right? So if we have any Systems also in the cloud at some point there. So communities control plane also is running on some note, right? So Now we are getting in the territory where it's getting more interesting and But you can kind of make like a cut between applications Like scafandre which which I will show in the next slide which are more about Collecting all this information refining the information maybe also matching it with like power grid to get Assumptions about how much see you to Or emissions you produce. So it's about refining All this data integrating with different APIs depending on which operating system you are and we have like a bunch of Applications that does does this thing one of them is for example scafandre a Very nice tool. I like it a lot. There's also the link if you would like to check it out So basically what it does is you can deploy a scafandre on windows You can deploy it on macOS and then we'll check just like which hardware resources do you have and you don't really need to care about Like connecting to those low-level APIs, right? All this stuff is being taken care of you can just run it You get metrics That's sort of like the level that we have like at the user level And and at the next one Yeah, right then on the cloud. There's like the question So what what are now our obligations in the cloud? So if we have taken care of all of this we surface the information from the hardware level We have operating system to kind of do Like facilitating processes and so on and we are also transforming all this information. What is the requirement for clouds? Right and I think for the cloud we have again like the same thing also on the user level We also want to collect from all the nodes that we have information Right, we also want to refine it We want to map it maybe for in terms of how much energy do we consume at each part But we are also taking action so we are we want to schedule resources, maybe if there's more energy or Like in different regions if there's like a better see you to see a two footprint or things like this We schedule down we scale up. So we are also taking action at the cloud level So at the user level we can also do that but usually this job is being done by the operating system and As we all know cubanities can be considered the operating system for the cloud So we we need to take action for the cloud because this is just like an abstraction layer where Each machine just cannot act right because we need to have like a higher level point of view to actually know the utilization So what is the state of the art currently in the cloud? It's it's for example a Kepler, which is a very nice tool It's a CNC of sandbox tool was donated. I think last year and It gives a lot more refined information. So it's a native for cubanities. So you can deploy it and have information about process level pot level and Also node level so it's about basically checking off the checkbox that we like the first checkbox we talked about about Just moving up this information Refining the information in terms of okay. I have this pot. I want to know how much energy do we consume? but if we think about Right, so basically exactly so can we tell how much energy in application we deploy consumes if we deploy something at Kepler We can we know we have a number we can work with this But what about like all the other like capabilities of cubanities and and For example, if you have like scheduling if we do all those Like self-healing rollout rollback. How does this like affect energy consumption and resource usage? We cannot really tell because we don't have any metrics about this. We know, okay This part consumes X amount of energy But how does like all the capabilities? load balancing and and so on affect like resource consumption and I would argue this is maybe the next adoration or maturity stage when it comes to cloud native Sustainability environmental sustainability and resource consumption that we also take a look at all those metrics which are more familiar to us as platform engineers and So we can actually make like some assumptions or drive actions So we know okay. Maybe this pot is breaking a lot But it's and because we need to self-heal it all the time We have X amount of energy wasted. So maybe it's worth it to invest some time to fix it things like this so Right, so how does self-healing affect energy consumption? I mean you need to do like very deep analysis to get this information It's not very easy to get this information right now and the same thing also with other metrics right So right so currently what we do is we collect metrics from all the pots all the notes But those metrics do not tell I mean they tell a story But they don't tell like a story where we can drive a lot of action based on based on those right If we want to maybe do like scheduling in different regions, it would be good to know how does Like the energy consumption effect like the scheduling right so if we want to improve this We need some sort of different metrics. We just talked about And there was like a lot of capabilities communities capabilities capabilities or in general like Container orchestrator capabilities And that we that we can map to this logic So for example some of these but I was this is just like spitballing Something that I rose I don't know yesterday or whatever basically we need like some more proper Investigation what these meet metrics could be like And maybe enhance like a project like Kepler or a different project to get those metrics So we can work with this and have like something natural to us because energy if we just see I don't know Like kilowatt hours or whatever. It's like we don't really know. Okay. This is good. This is bad How does this like effects like from release to release? Maybe also our software So we need to tell like a better story. I mean maybe some of those metrics and more metrics can help us sort of achieve this goal Right and now we are just coming to the ebp f part, which is but very short basically All of this is not developed. We don't have those metrics right now if you deploy Kepler and or any other tool We don't get those metrics But ebp f can be one of the ways to get those metrics So what we can do with ebp f is not that we collect more energy usage We don't need that Because we have the API is already we have quite quite a good idea of how much energy do we consume? But what we can do with ebp f is that we can map it to those capabilities that we can split up Sort of like the total energy and map it to towards like like how much traffic do we get from certain services? How much like how far is a certain service? Maybe we can do like dns dig or something and try to estimate how many like Like a network wise how much this service increases the energy consumption, which we need to care about so With ebp f we don't measure the energy, but we unlock it in the way that we have more transparency of the capabilities of of the cloud so So if you would like to learn a little bit more about this, there's like a blog post which I wrote It's split in three parts which gives a lot more detail and Yeah, so if you're interested in this type of thing all those levels, there's a lot more information a lot of more sources I think this is is a great source to to go and If you're interested in exploring this field, maybe even like starting a project or something about exploring those APIs As I said in the beginning we have a technical advisory group for environment environmental sustainability And I'm always emphasizing the technical part. We are not an activist group. We are technical advisory group So if you would like to explore some for something like this Step by the booth or join one of the meetings community meetings And maybe we can explore this field. Maybe drive some standards something like this and That's it's the end of my talk. Thank you. All right We have time for questions, I think if there are any questions Thank you. It's a pretty nice area of energy consumption here so what I want to know is like did you Exclude by purpose like the IPM IDC MI Yeah, yes Yeah, I mentioned this in the blog post but I Have not talked about this here just to get towards cloud and just to not mention too much Focus it a little bit more, but like I just said our appeal is like a very it's a model This process specific more that can can it's not real power consumption, but IPM IDC MI It does give like a real power consumption because everyone has to Implement that at BMC level. Yeah, and it's and it gives like a real value for the energy consumed by your node at the node Yeah, which is pretty cool. Yeah Yeah, I mean exactly we know the energy consumption of nodes and processes so So what Kepler also does is So the the information that we get from those APIs Rappel for example from Intel and so on those are Do not give us information about the process level because they know they cannot give us information about the process level Because this is like something which is happening on the operating system level So what Kepler for example does is it uses ebpf to break down those numbers into like smaller numbers Just to map map it towards like how much energy CPU utilization do you have? And yeah, so so this is great I mean this is like basically the the basis that we need to have to drive further action, right? We need to have like some sort of metric, but now it's about transforming. I mean, that's like what I'm arguing Transforming this information to try to tell a story try to Understand, okay, how can we change the system that we build? Does it make sense to integrate this service? Do we need to relocate maybe the service to another location? Do we even like utilize all those capabilities that Kubernetes has maybe we can also use like a I don't know cage 3s or something like smaller or something like this. So Yeah Thank you. Yeah, thank you for your talk What? Do you should also say is that we are at the limits of our data centers at the moment? So we don't get power lines. We need planning for 600 megawatt data centers in Urban regions. This is virtually impossible to get a power line there All the American clouds who start at 600 megawatt per data center face heavy Let's say Conflicts in the area because of water consumptions and of Environment conflicts on the other side if the developers don't care This is some personal experience. I've seen we created an application on an open shift cluster They came up with 10 H base systems. So everything distributed huge databases And then we try to measure the load of this thing and After that I've told them yes H base is a little bit Exaggerated You would you could run it on and as a Raspberry Pi Simply and this is kind of some observation. I've made There is something like server machism the guys and these are normally really the guys who Owns the most resources in a team are the alpha members and this is something Which is really driving the entire Development into the wrong direction. So you should Definitely plan your power consumption before your application scales Then effectively if you hear on this conference that you have 20% load of your GPUs You are wasting a factor of five of Of of energy and this is something which goes into the millions Yeah, yeah, thank you. Yeah, I mean exactly. So if we deploy something to the clouds That's we take energy for granted That's something which the cloud provider needs to deal with The cloud provider needs to make sure there's enough resources hardware servers connected death death and not enough power connected to those resources, but exactly So this this mindset will break at some point if infrastructure-wise So it is good if you To future-proof yourself sort of to think about the resource consumption be more mindful about the resource consumption and this starts with Just bringing up the metrics understanding the resource patterns integrating resource like sustainability mindset in every single part of the Design of software in the beginning but also in the maintenance and so on Yeah Hi Is it already possible to relate these? metrics to tracing for instance, so you can have insights in specific parts of a From beginning to end and back where you lose the most energy Instead of like only a specific micro service So in terms so there's like some tools I've seen where you can analyze. Okay, how much and resource consumption That's this for loop for example produce or things like this in terms of tracing I don't think like most of the observability tools and Integrate like something like this So it's it's something that we need to do at some point that we further understand Where's the energy coming from? But yeah Okay, thanks. Thanks for your talk. Do you have time for one more question? Yeah, all right So from what I understood also from the two of you that asked before me like We should think of energy consumption Like at the time of development already rather than like only analyze it when it runs in the cloud Do you know of any tools or any any way of how can we? Shift this left as so many things To like the development life cycle We're like at the time of testing or in the CI CD pipeline where we could run an analysis on this like Like at least that gives us a plus or a minus this will increase or decrease your energy consumption Do you know if there's anything like that out there? So I Mean when we start like right at the planning so depending like how you approach your project, right? So if you start at in the planning stage, we don't have like any software to test so I'm not sure if there's like Things that you can do at this stage probably they are But as soon as you have like something to work with as soon as you have like features and if you This depends like how you cut your commits, right? But you can measure from commit to commit and see how it Changes and those tools so something like skiff under for example You can you can just deploy it and expose this energy is not just in as a Grafana dashboard But also as a Jason fly so it's like a format that you can work with you can Deploy it you can run it in in your pipeline X Or have this as a service and expose then those metrics and get feedback Based on the commits so that's possible. Yeah, all right Hi, Joe time for one more question. I think so. Yes. Yes. Thank you. Thank you for the talk So previously we were talking about how to Estimate energy consumption before something is deployed essentially and I was wondering how Good an approximation Is to evaluate energy consumption by resource consumption such as just CPU memory something like that Does it correlate well or is it just not the same thing? so The further you kind of abstract the more Kind of unsure you are about the real resources, right? So so on one of the slides, which was basically about okay software Measurement on for hardware devices. There's already sometimes like an error margin of 5% So if you move the obstruction layer even further up You are getting more away basically from those metrics, but So for example, also what Kepler does is in the cloud we we have usually like virtualization layers So if you don't have access to the bare metal machine You cannot talk to those APIs, right? So you need to have like some sort of Approximations some best guesses sort of and those are Based on ebpf as I said before and but they are pretty pretty good I don't have like I don't know if there's like a study about this to under Basically the same study which which I showed before about error margins of NVIDIA SMI I don't know if there's like one which would be interesting to see like what are the error margins and of Kepler if you Don't use bare metal, but if you do like like higher approximations That would be very interesting to see but I assume it's a little bit worse But yeah Thank you. All right and I think we can call this a talk and I hope you all have a great Qcon and See you around. Thank you