 Okay, it's time to start Yeah, first of all just to introduce a little bit I also want to highlight, you know give the all the credits to Meng and we just their presentation They are from IBM China and unfortunately they couldn't come because of the visa and I'm a colleague I'm working IBM actually in Tokyo in Japan I collaborate a little bit in this work and I'm a backup speaker I'm going to present their their work here. Also, they have a record a demo that I will play later So the presentation is about a Reaward experience of deploying some framework that they developed to Help for companies to be more sustainable from what they call in full stack It's from applying, you know optimization for the hardware infrastructure perspective up to the application That's what it's the definition of full stock that they will be talking about in this presentation Okay, so Yeah Okay, main she's the CTO of IBM China and She's managing the IBM system labs Yeah, who are you? Yes, sorry the presentation It's a techno solution architecture that's implementing and working the framework and most of the demo he will be You know that will play he's showing how everything works, and I'm Marcelo morale. I'm from IBM Research in Tokyo Okay, just some quick introduction about the motivations here so IBM has done some surveys about with some CTOs and seals from in companies around the world and for example in 2021 31% of the seals they were saying that sustainability It's something important to be tackled for the next three years and The new survey now in 2023 51% of the seals are saying that Sustainability it's something important to be addressed on the companies most especially because most recently, you know, especially in Europe. There is some Demands from the governments for companies to become more sustainable especially for AI workloads. So there is this new requirement in Europe that Companies that are working with AI workloads. They need to report the energy consumption So it is one of the motivation that sustainability is becoming important The other thing is another perspective is how much energy is being spent in it. So there are some analysis some predictions that Data centers in the world concerns about, you know, more than 200 and 250 terawatt terawatt hours Which actually something similar to what all the entire country of Australia's is consuming So it's very very big something important to analyze and take care Yeah, so given that the IBM China with the China government the the Ministry of Infrastructure and industry Develops some guidelines how companies can especially on China so can Apply these guidelines to become more sustainable And they come up with this idea of the full stack which means apply things from the infrastructure up to optimizations to the application and Also, it's not only to reduce the energy consumption, but stand ability in a broad more broader broader, you know definition means ten ability of the software as as well. So then They have this guidelines Let's see if I can have a pointer here Yeah, so they have some guidelines like you know to improve the security because needs to be sustainable But also needs at the same time needs to be some technology or some tools that are secured and And also have some compliance You know stability in reality Reliability of the software and infrastructure everything that's been proposed to be Sustainable needs to also you know comply with different I know dimensions here that it's and also in the end of course the technology needs to become green and Also low low carbon emission There are different phase first The company will needs to be compliant with the guidelines and then apply optimization there Infrastructure and application then transform the whole thing to be more sustainable and in the end It's I think the best scenario will be some company that can lead the directions for Sustainability so in this survey that they created they thought they actually interviewed more than 100 companies and Trying to define what's the best you know metrics That can be used for measuring sustainability with something that it's not Homogenized it is still like it's so open discussion I think this might be something also for the future for the open source community Maybe you know to create like a some open standard for sustainability I think should be interesting. Oh, I understand that there are some agents and that it's still doing that But it's is it's you like every company is doing They own definitions for things so in this was one of the attempts to talk with different companies and try to bring some key metrics and components And they are using their framework there. They are proposing here for this work it comes just You know very high level is Like you know power unit Efficiency that most of the companies are using for data centers and also performance up to performance metrics of applications Because of course we need to be more sustainable, but also meets performance requirements for applications okay, so Just overview of the the implementation of the framework That comes from you know what they say that level level zero up to the level four It comes from again that I was mentioned from the infrastructure perspective What we can do up to what we can improve in the application So and this comes from monitoring. We need to collect metrics collect data from infrastructure and application Enable observability for that, you know to me easier for users administrator to look on that on this data Process this data and then we can manage it better. So, you know Become the data center more efficient. We do consolidation, you know, also Can can be right now is not in just them is not that Actually, there is something that really will show in the demo for that, but to do some some, you know change the power knobs for example, you know, we can play with Very the CPU frequency GPU frequency we can save energy with some techniques also do some consolidation If we I'm not going to show it's like that, but just to show an example what we can do concretely to save energy If you see the power curve for energy consumption Normally when you are Consolidating more workloads, there are more workloads that the power consumption it will be a little bit lower so it's a good practice Of course respecting the SLA for performance to try to consolidate Applications on the same server so that can be more become the server more energy efficient in the end So Then enable observability Do it the analysis and optimization and also create some automation that I was saying about scheduling and other possible Power knobs out on a change that we can improve the energy efficient for the cluster Here just You know highlight example that the demo that they they are going to show in this presentation and also That this it's open So they created this full the solution that I'm talking about With all the guidelines all the system that Connected between each other and going to talk a little bit more about that and they are Presenting that for clients in and in in their facility in IBM China. So here Though it's a little bit overwhelming. It's a lot of things, but just to show that is the software architecture For the solution that they have It's they have sensors connected data center Call a collecting temperature and power consumption metrics from the data center perspective note that this is what this is interesting because In the regular data center, we have energy consumptions from the node isn't it we can get CPU energy consumption But in the infrastructure level, it's kind of not easy to get that in it So for example, what's the energy from the rack? What's the energy for the cooling system? this is normally not exposed in the data centers and they are doing that with special sensors collecting this information and then expose this to The first the cloud infrastructure the infrastructure perspective and also for the software engineers how with this such information can't the how they can improve the Stannability of applications Yeah, so this is another Overview of that so But just to show that again, it's a little bit relate what I said but they're collecting information from the infrastructure here and then they are collecting information from the application itself and With that, you know processing this data and using some open source tools They are creating the solution and the idea here is to show You know a product, but that has dependence of open source and also with that They are proposing salute improvements from an open source For example, we're going to talk later about another project that it's the Kepler project It's the person that I more relate to I will have a presentation about that which more details about the Kepler in two days But I will explain it better later Okay, so About their observability, so the the idea here also is again, so we have like different levels of Data and how to expose that for you know the software engineer to the ceos the CTO So they will have different Interests in different level of metrics. So the idea is show You know for the administrator we have Energy consumption and co2 emissions from the data center from the infrastructure from the application level and in a general High-level not high-level detailed Drafts and then something that it will be more general for you know CSOs and the software engineer Reliability to people that also wanted to get this information per application and then at least this is like you know The energy consumption from different hardware's also they can include find hotspot and Maybe there's something it's consuming more energy because some failure something so they just should try to find find the problems and Manage them. Okay. This is the Kepler project This is the one that I'm more most more involved to that The Kepler project was initially created for x86 and you know Intel CPUs and This collaboration with IBM China. It's for IBM Z the mainframes that normal banks are using that also to Estimate the power consumption to the process and container levels in the cloud environment in kubernetes It's possible to run cat Kepler instant alone on for process but the main use case it's for you know on top of kubernetes and Calculate the energy consumption of containers on the cloud The this collaboration leveraged Kepler to collect power power metrics from IBM Z machines and Those and then break down that power consumption from the machines to the containers I Will give I have a presentation in two days about Kepler with much more details how everything works Okay, I'll show the demo Other primary cooling system including auto Let's give you a quick tour first for a closer look at our data center The first thing you will notice are the primary cooling system including auto Magnetical trailer and the cooling tower for data center cooling We also have backup air-cooled trailer and the other equipment to ensure efficient cooling operations Inside the data center server room various network cable trees are placed to serve different purpose This house is a net networks for high-speed connections and the fiber-optic cables for storage area Networks the layout of the hotel container system greatly addressed the cooling challenge of high performance computing equipment This is the backbone of the data center responsible for routing switching and securing all network data It plays a vital role in ensuring the seamless connectivity and the safeguarding network security Now we will share our experiments and the insights on how to manage and operate the data center using open source software First let's introduce dashi our go-to navigator for the daily tool we use That is not only helps engineer categorize different tools based on their needs But also allows for monitoring and diversification of corresponding metrics To monitor various type of data center infrastructure and equipment We utilize the SMP exporter to collect operating metrics across different dimensions Such as environmental power and cooling devices These metrics are then aggregated in premises and Zabix In case of problems happens tools like uptime kuma and chat-offs come into play Centralizing alert notifications through Peter duty By integrating with ITSM platforms and the IT Enterprise asset minimum tools like IBM control desk Engineers are permanently notified based on severity and alarm rules Allowing for quick issue resolution Real-time operational metrics from IoT devices are also fed into maximum Omen age combined with advanced analytics and machine learning capabilities This enables proactive thought detections and prevent from application and the system failures To simplify the deployment of container management. We often rely on tools like POTENOR First most virtualization environment is our trusted solution for managing at 86 sovereign virtualization environments Ansible playbooks automate routine operational tasks such as software installation service configuration and power volume resolutions To us help us in managing and maintaining the storage devices, volumes and file system within the data center Through various storage protocols like NFS, SMP and SCSI Data can be easily shared and accessed To handle the vast amount of daily logs generated in the data center We rely on the open source tools like ELK and Guarfaana looking To efficiently investigate error and mourning in the logs Additionally we use Guarfaana temple to trace and record the request change between different services Enabling comprehensive monitoring and management of cross-system request flows We aggregate different kinds of metrics and logs from IT infrastructure Using software like a permissives and Javix Based on specific requirements, Guarfaana allow us to create a customer dashboard to analyze the KPIs and monitor the overall business operations status If we reserve alert from our business applications We immediately use Instana to diagnose the problem and allocate the root cause Usually we drill down to diagnose the problem from the entry of the application service In the application of the ability dashboard, we can find the normal service with an alarm We can conduct in-depth transaction call action analysis And analyze the error log to fix the problem We also use the application of the ability dashboard for our website monitoring By analyzing actual browser request times and page loading times It allows detailed insight into the web browsing experience of the users And deep availability into application call passes We can easily locate the reason Of slow access to understand how to continuously improve efficiency We opened the real-time sustainable development indicators Including energy consumption, carbon emission, SLA of all layers of the entire IT operation system We can easily check the energy efficiency data Which we also called power utilization effectiveness of the current data center And the data of the last year This is calculated based on the energy consumed by IT equipment and facilities As well as the total energy consumption of the data center The data on the current page is played and calculated according to the time frame we selected And the data is constantly refreshed and changed For example, the real-time carbon emission of data and the data changes during this period The pie chart on the right clearly shows the energy consumption ratio Occupied by each component of the current data center For example, the blue part is the IT systems The yellow part is the trailer and cooling tower The green part is the indoor cooling system And the red part is the backup cooling and circulating water pump If we want to learn more about the changes in the energy efficiency of each part We can see more details related to different platforms and business applications At the bottom of the page Gizkan greatly helps data center operation and maintenance people to understand the current status And the reasons for changes in carbon emissions and energy consumption data So as to help us make plans to improve utilization efficiency And reduce unnecessary waste of resources We can also understand real-time energy usage through energy flow analysis To identify possible energy efficiency improvement solutions AI-based data analysis provides us the current hotspot hardware for energy usage And automatically suggests the job scheduler to suspend scheduling tasks to that hardware Spectrum RISF provides a variety of intelligent scheduling strategies to automatically match tiered electricity price For example, during the time of day when electricity prices are high The throughput of the cluster is reduced And some common or low parity drops are killed During the time when electricity prices are low The throughput of the cluster is improved The amount of the drops will be killed for execution In this example, the electricity price is lowest during the time period from 11pm to 7am We set the maximum number of the operable drops in the cluster to 96 And set it to a relatively low number in the time period with higher electricity cost The current time is 1.30am according to the intelligent scheduling strategy of ILSF This cluster is running at full load That is 96 drops are being executed When the time is 1.20pm The number of drops can be run by the cluster is 48 Now let's take a look at the graph of the overall operation of the cluster over the past day We can see that according to the intelligent scheduling strategy The throughput of the cluster matches the tiered electricity price curve So that the goal of the reducing energy consumption of the cluster will be achieved We also leveraged the resource measurement tool to find the suggestions for energy saving and reliability in performance Thermal make-up provides optimization suggestions including delete, resize, move, suspend, provision and reconvict with no impact to the business applications This can be applied to different kinds of resources, servers, storage, virtual machines, application components and databases For example, we can shut down the idle physical server to improve the sustainable indicators Now let's check back to the sustainable monitoring dashboard The energy efficiency ratio has been improved And both the energy consumption and carbon emission have also been decreased And we also use NVIDIA to capture and analyze the electricity and emission data To check the performance of our work As you can see, the detailed data on emissions reductions With more detailed data such as PUE, total electricity and IT electricity sink was in busy We can have more insight of our business workload and take more effective actions Sometimes we need experts to assist the data center technicians to resolve the problem remotely Maximum assist helps reduce the time that it is required to diagnose and repair the equipment problem For example, with the support of argument reality and AI-powered guidance through a knowledge-based of equipment data The on-site technicians can easily get assistance from the assist mobile app And successfully diagnose the problem and repair the more functioning cooling tower Which is a very critical facility for data center cooling In this demo, we are going to show you on IBM Linux 1 how could we do the finer-grained power monitoring The first thing we deployed our benchmark application Roboshop into the OCP or Kubernetes cluster And from this dashboard, the first part is about application performance So we sent the concurrent request per second into the environment, that is over 300 Meanwhile, we got around 320 millisecond response time And as well, in the last 15 minutes, we sent over 270,000 requests And we think that we have over 2,000 failures And we can show the detailed response time and throughput for each endpoint And in the following sessions, we show the total carbon emission that is calculated by the carbon emission factors So that evaluates based on the different parts of the world what is exactly the coefficient about pre-electricity That is continuous on the carbon emission So here, we use the United States numbers to distribute that into the coal, petroleum and natural gas So we can understand during the last, I mean per day, that is projected what is the carbon emission Just stop here, so we have one minute for questions If someone wants to ask something, yeah, sure Right, that's a good question So this system, actually there is a section here that's showing Let's see, yeah, it's just here Yeah, so this is the... Oh, it's out here So it's the Kepler project so that it's actually monitoring the energy consumption So basically, the energy consumption is proportional to the resource utilization So if we have a tool that has kind of low overhead, also impacts performance If the resource utilization is too high of the tool, it's also proportional for that And the Kepler project, we developed that to be as minimal as possible the resource utilization But everything depends on scalability, also the size of the cluster Because it can be more intense, everything But in this demo here, it's few nodes, it's not like we are testing in 500 nodes And see what's the energy consumption of everything But we can see here, compared to the water workloads, the energy consumption of Kepler is much lower, it's like minimal But it's a good point, so we should do that analysis for years and things like that To see what would be the impact Yeah, it needs to be accountable I'm more involved to the Kepler itself, not to the other parts But I think this should be also measured to be reported, of course Yes, sorry, I forgot to ask the question So someone asked me what's the energy consumption of the tool that's measuring the energy consumption of the system And how it's the overhead for that And I said that Kepler has low overhead because of the resource utilization is low And the energy consumption is also low Some other asked about if it's applicable to open shift So it's for Kubernetes environments, it can be a Kubernetes open shift Yes, because reducing the energy consumption also reduces the CO2 emissions But it's in the perspective of improving the energy efficiency of data centers and applications And of course maybe it has a different power source because we have like oil-based energy that comes But it can be also more sustainable power source And as they were showing like a scheduling application for a different time in the day that's using different power sources Well, we have this perspective Yeah, sure Oh, yeah, so someone asked if it's integrated with open telemetry And right now we are just exporting per meters metrics But the idea is also to connect to open telemetry Metrics, yeah Okay, I'm going to close then I think I'm on time Thank you very much for the attention And if someone has any other question please try to reach me or the presenters And I have like more details presentation today is about Kepler And please if you have time join Thank you