 Okay, I guess we can start hello everyone welcome to this sustainable con session and I'm I'm Chen I'm Chen Wang so the talk is mainly prepared by Fan Jingmeng and Hua Yif from IBM China, but unfortunately they cannot attend this session Due to visa issues So I'm here to help them deliver the session and if you have some questions Feel free to reach out to them directly and by the way, I'm also in Sustainability work, so I'm also interested in sustainable computing and Fan Jing is actually the CTO from IBM China System Lab and She is now leading the full stack sustainability optimization and technical innovation in IBM China and Hua Yif is the senior technical solution architect of full stack sustainable optimization in computing and then so actually the the sustainability is a rapidly particularly growing area of focus for many organizations and The IBM Institute for business value Conducted some survey and asked all the CEOs about Business challenges that kept them up at night and 51% of the named sustainability as their Greatest challenge for the next three years and this number is has been going up 32% since 2021 so in order to make an impact and reach the goals they have set for themselves companies will have to rely on New technologies and transformations so If we look at the data center energy consumption It's around like 200 and 250 terawatts upwards Which is roughly the same as the total electricity usage in Australia So the solution seems pretty obvious But companies just have to find a way using technology to reduce their energy consumption Making your infrastructure more efficient and your data center greener and So how how can we improve the efficiency of your infrastructure? Reducing your energy consumption and have a dark impact on your sustainability goals. So here We will share some lessons and practices. We have conducted in IBM you by using leveraging all types of open source tools so So Okay so Before we start to talking about the technical details and all the solutions we want to answer what is sustainable computing and Actually IBM and CICT which is China Academy of Information and Communication Technology is a sync tank under Ministry of Industry and Information Technology to gather published the sustainable computing blue paper Early this year and then it's mainly led by a funding and then you can see her picture here and in this blue paper IBM and CICT gave the definition of Sustainable computing as a new computing model which takes environmental friendliness as objective and IT software it includes IT software and hardware supporting facilities of business applications as key elements and It wants to achieve a globally green efficient reliable secure as a key marrable aspects for sustainable computing so they In the blue paper, they codify the key concept into a three cube model and then on the front side is our different stacks of the system including technical components within the sustainable computing such as Sustainable business applications the sustainable cloud computing platform and sustainable computing infrastructure on the top side They show the four dimensional measurement system of sustainable computing including you need to achieve the security and the compliance you need to achieve the stability and reliability and you need to have a confident and efficient operations and You want to achieve the greenies and low carbon objectives? So on the right side is really how the organizations are implementing those road map along the maturity of sustainable including four stages like you first you need to do complying with Achieve compliance optimize the The energy efficiency transform and Then the organization can choose to lead the whole industry So the blue paper is available in Chinese from the link shown here, and then they are also working on an English version Yeah It's just going on Okay, so In addition to the concept and the definition of the sustainable computing In the blue paper, they also give some details about the technical architecture of key components You probably want to include in your Sustainable objectives in your organization So we start from the top. We believe that all these sustainable computing should well align with the Strategy and sustainable development goals of your organization then we should have the sustainable business applications to support the strategy and sustainable development goals and it Could include ESG management system line of a business management system including core business systems business operations supply chain carbon emission management green energy transformation and Exceptional and further we should have a sustainable cloud computing technology and platform To deploy those business applications It can include the hyper cloud that the AI platform IOT blockchain and big data platform Besides the full stack security and the compliance and operations and many platforms Should also be considered as part of the sustainable goals so finally it needs a sustainable computing infrastructure to support this platform it includes the Sustainable IT infrastructure environmental infrastructure key Enablement of technologies. So this is the whole stack of the architecture in terms of sustainable computing and About the next aspect is how these Stacks are implementable. So in order to accelerate the implementation of the Sustainable computing. So we gave this four-stage method It gives an organization an option to select the right goal to meet their overall sustainable development goals The four goals are like four steps of complying organized transformation and lead so in complying We want to ensure Organizations meet the mandatory the mandatory compliance requirements for example the POE objective the resource utilization The service level agreement of their business applications and then to optimize It means we can take sustainability as their goal to optimize the organizations operations by improving in internal business operation Efficiency supply chain to reduce the carbon emissions then at the transform stage You want to leverage this opportunity to build new capabilities to transfer organizations business towards their sustainability goals and The organization then can choose to lead the whole industry drive the whole value chain and industry to take the the advantage of Sustainability in the market and lead the competition So it's not necessary to implement them in order But they can decide which stage they want to achieve and to define the strategy road map and actions to achieve it and In order to validate this concept They develop an integrated solutions of full stack sustainability optimization platform It's fully data driven So they collect the real-time streaming data from the full stack from layer zero Where it's the building and assets a two layer for which include all the business applications then they apply observability system a lot of them as from open source To realize those monitoring data transaction tracing data IoT data energy consumption data and etc and then They consume those data to analyze the energy flow to identify the energy consumption hotspots resource utilization where you can do more improvement and optimization and then potential failures and anomalies And even the non-compliant security issues. So finally they apply the automation of those To fulfill this identified optimization opportunities for example to schedule workflow in a certain way to save power to remedy identified issues to maintain equipments with AI and Even augmented reality support to make all the optimization easier So they develop the solution into IBM China System Center for daily operations. You can see the whole overlay of the their System Center and then they have improved the SLA and saved about 15% cost by applying the solutions in their daily stand-up meetings and help them identify any optimization opportunities and it's open for customers with it demo testing POC pilot and co-creation and I think that's the then They prepare a very nice demo So this is the whole platform Stack that includes a lot of open-source software and commercial applications to create this unique and powerful solution that cater to specific need and You can enjoy the benefits of community-driven development cost-saving customizations still ability and comprehensive support because of those open-source tools and So next I think I'm going to play their demo video Could you could you help with the blue screen? Oh, cool. I need to hold it Okay For data center cooling You also have backup air-cooled trailer and the other equipment to ensure efficient cooling operations Inside the data center server, various network cable trees are placed to serve different purpose This house is a network for high-speed connections and the fiber of cables for storage in every networks The layout of the hot-air contaminant system greatly addressed the cooling challenges of high-performance computing equipment This is the backbone of the data center responsible for routing switching and the screwing all network data It plays a vital role in ensuring the seamless connectivity and the safeguarding network security Now we will share our experience and the insights on how to manage and operate the data center using open-source software First let's introduce Dashie our go-to navigator for the daily tool we use Dashie not only helps engineer categorize different tools based on their needs But also allows for monitoring and diversification of corresponding metrics To monitor various type of data center infrastructure and equipment We utilize the SNP exporter to collect operating metrics across different dimensions such as environmental, power and cooling devices These metrics are then aggregated in premises and Zabix In case of problems happens, tools like uptime kuma and chat-offs come into play Centralizing alert notifications through pager-duty By integrating with ITSM platforms and IT Enterprise Asset Minimum Tools like IBM Control Desk Engineers are permanently notified based on severity and alarm rules Allowing for quick issue resolution Real-time operational metrics from IoT devices are also fed into maximum Ominous Age Combined with advanced analytics and machine learning capabilities This enables proactive fault detections and prevent from application and system failures To simplify the deployment of container management, we often rely on tools like POTENOR First-most virtualization environment is our trusted solution for managing 86 sovereign virtualization environments Ansible playbooks automate routine operational tasks such as software installation, service configuration and power volume resolutions TUNAS help us in managing and maintaining the storage devices, volumes and file system within the data center Through various storage protocols like NFS, SMB and iSCSI Data can be easily shared and accessed To handle the vast amount of daily logs generated in the data center We rely on the open-source tools like ELK and Grafana Loki To efficiently investigate error and mourning in the logs Additionally, we use Grafana Temple to trace and record the request chains between different services Enabling comprehensive monitoring and management of cross-system request flows We aggregate different kinds of metrics and logs from IT infrastructure using software like Prometheus and Zebix Based on specific requirements, Grafana allows us to create a customer dashboard to analyze the KPIs and monitor the overall business operations status If we reserve alert from our business applications, we immediately use Instana to diagnose the problem and allocate the root cause Usually, we drill down to diagnose the problem from the entry of the application service In the application of the ability dashboard, we can find the normal service with an alarm We can conduct in-depth transaction call action analysis and analyze the error log to fix the problem We also use the application of the ability dashboard for our website monitoring by analyzing actual browser request times and page loading times It allows detailed insight into the web browsing experience of the users and the deep visibility into application call passes We can easily locate the reason of slow access to understand how to continuously improve efficiency We opened the real-time sustainable development indicators including energy consumption, carbon emission, SLA of all layers of the entire IT operation system We can easily check the energy efficiency data which we also called power utilization effectiveness of the current data center and the data of the last year This is calculated based on the energy consumed by IT equipment and facilities as well as the total energy consumption of the data center The data on the current page is displayed and calculated according to the time frame we selected and the data is constantly refreshed and changed For example, the real-time carbon emission of data and the data changed during this period The pie chart on the right clearly shows the energy consumption ratio occupied by each component of the current data center For example, the blue part is the IT systems The yellow part is the trailer and cooling tower The green part is the indoor cooling system and the red part is the backup cooling and the circulating water pump If we want to learn more about the changes in the energy efficiency of each part We can see more details related to different platforms and business applications at the bottom of the page GISCAN greatly helps data center operation and maintenance people to understand the current status and the reasons for changes in carbon emissions and energy consumption data So as to help us make plans to improve utilization efficiency and reduce unnecessary waste of resources We can also understand real-time energy usage through energy flow analysis to identify possible energy efficiency improvement solutions AI-based data analysis provides us the current hotspot hardware for energy usage and automatically suggests the job scheduler to suspend scheduling tasks to that hardware Spectrum RSI provides a variety of intelligent scheduling strategies to automatically match tiered electricity price For example, during the time of day when electricity prices are high, the throughput of the cluster is reduced and some common or low parity drops are killed During the time when electricity prices are low, the throughput of the cluster is improved, the amount of the drops will be killed for execution In this example, the electricity price is lowest during the time period from 11pm to 7am We set the maximum number of the operable drops in the cluster to 96 and set it to a relatively low number in the time period with higher electricity cost The current time is 1.30am, according to the intelligent scheduling strategy of ILSF, this cluster is running at full load, that is 96 drops are being executed When the time is 1.20pm, the number of drops can be run by the cluster is 48 Now, let's take a look at the graph of the overall operation of the cluster over the past day We can see that, according to the intelligent scheduling strategy, the throughput of the cluster matches the tiered electricity price curve So that the goal of the reducing energy consumption of the cluster will be achieved We also leveraged the resource measurement tool to find the suggestions for energy saving and reliability in performance Terminomics provides optimization suggestions including delete, resize, move, suspend, provision and reconvict with no impact to the business applications This can be applied to different kinds of resource, server storage, virtual machines, application components and databases For example, we can shut down the idle physical server to improve the sustainable indicators Now, let's check back to the sustainable monitoring dashboard The energy efficiency ratio has been improved, and both the energy consumption and carbon emission have also been decreased And we also use NVIDIA to capture and analyze the electricity and emission data to check the performance of our work As you can see, the detailed data on emissions reductions With more detailed data such as PUE, total electricity and IT electricity sync with NVIDIA We can have more insight of our business workload and take more effective actions Sometimes we need experts to assist the data center technicians to resolve the problem remotely Maximum assist helps reduce the time that it is required to diagnose and repair the equipment problem For example, with the support of argument reality and AI powered guidance through a knowledge-based of equipment data The on-site technicians can easily get assistance from the assist mobile app And successfully diagnose the problem and repair the malfunction cooling tower Which is a very critical facility for data center cooling From the demo we can see there are different key components and optimizations that the whole stack platform has And then let's go through those components one by one now First, the monitoring part Real-time monitoring data pipeline assistance for the whole stack And it monitors both IT and non-IT infrastructures And then we actually deployed over 900 sensors to collect the power consumption The environmental metrics like temperature, humidity and air quality Monitor the whole software stack in biometals, virtual machines, containers and at the application levels And we use both commercial and open source software in this data pipeline And then we especially rely on a lot of open source tools such as permissions, the ABECs And others like Instana, PRTG, Omniu and etc So together we make the whole solutions very, that can be flexibly customizable for different objectives and involvement And so for each personal customer then they can actually have their own dashboard to visualize the metrics in their interest For example, data center general managers can have the real-time PUE, yearly average PUE data, total power consumption data, carbon emission, power distributions among equipment types And then the chief sustainability officer can have an overall reporting dashboard of the key metrics of the computing Including the total power consumptions, IT power consumptions and PUE So SREs can have more fine-grained power consumptions by rows or racks or even at the virtual machine levels and container level And analytics can analyze how power flow distributes within the whole system They can build power models for each node and identify the high spot of power consumption for further analysis of potential opportunities of optimization That key works but the forward key doesn't Okay so for business related management like application performance management It's one of the key systems to ensure the availability, reliability of applications And then in our heterogeneous environment we can have one stop observability of the whole environment And we can visualize the mainframe, the Linux one, the x86 and public cloud in the same dashboard And then besides it allows our SREs to drill down into the details of the services with the tracing data to get real-time service interactions So it can also allow us to analyze services, individual costs and even drill down to the code So it can give SREs a very flexible way to diagnose the problems and allocate the root cause of those problems Or potential opportunities for the energy optimization So then application resource management tools such as Turbanomics can help analyze the monitoring data and recommend optimization actions For continuous improvement of application reliability and resource utilization So the analyze results gives you some actionable insights about application failure avoidance, how to avoid failures And how to, whether there are idling or overclimb the resource and then how can you optimize those to reduce your total energy consumption And the users can also check the recommended actions and take just one click execution to automate those actions And so about consolidating workloads to improve the sustainability They achieve 75% power saving and application performance improvement So high density workload consolidation can improve the efficiency of power utilization a lot And as well as guarantee the performance and reliability of the workload And this is one example, so they consolidated workloads from a cluster of x86 servers to Linux 1 server to achieve over 75% energy saving and resource utilization improvement Meanwhile, the application performance and reliability have been improved as well And besides, we can measure the fine green power energy consumptions For example, I was presenting the project helper last year in open source summit as well So it gives you a detailed measurement of how each container on microservice is consuming the energy And then at the same time, you can leverage tools like permissions and Instana to understand the performance So you understand the trade-off between application performance and power consumption And then this year, they also enable the Kepler project on Linux 1 platform And which allows application developers to get the detailed microservice level power consumption And in this way, you can easily find the bottlenecks to improve your power consumption And so then intelligent energy aware and high performance workload scheduling is also useful So they also have the intelligent workload scheduling strategies to aim to schedule the workload based on different objectives For example, you want lower power price or you want the best performance and you want to maximize your resource utilization Such scheduling system or strategies are available both within the cluster and across clusters For the asset management, just as you have seen in the demo, IoT data provides real-time status of those equipment And then AI and AR techniques can help analyze this data to provide the health status and the cycle of life for production assets And then for any detected anomalies, the system can diagnose the problem based on the knowledge graph they constructed across different assets Then the onsite engineer can maintain the equipment with technologies such as AI and AR-based collaboration with remote engineers So environment sustainability metrics, those metrics can also automatically flow into the ESG reporting system So it gives the chief sustainability officer the latest performance like in the fine granularity of 15 minutes about those data all real-time And then they can analyze those data and take actions to achieve their EGS goals immediately So this is a really big teamwork integrating a lot of commercial software and open source tools And this presentation is based on a big team's great work And we appreciate all the contributors to these full stacks of vulnerabilities and their pictures that show here And we welcome more discussions in the open source communities and hopefully our practices can be useful to your own organization as well And thank you. If you have more questions, please reach out to Fanjin and Yehua directly. Those are their emails Thank you Thank you Okay, thank you for staying And yeah, I'm mainly, I myself mainly in the container and Kubernetes platform So it was part of the Kepler project and I presented it last year And then for the full stacks, solutions, Fanjin and Yehua are the main persons to reach out Thank you so much You did very good. Thank you for your presentation. Much appreciated Yeah, that was very impressive to see a 75% reduction You said that they used Linux 1 on the hardware. Is this a specific operating system or is this also an architecture that is different? Because it sounds like from the writing it said from x86 to Linux 1 And I'm like, oh, is that risk? Is that something else that I have not understood or? Yeah, thank you for the question. That's very interesting. That's a very good question Yeah, it's the architecture has different and this is a particular IBM developed architecture Oh, okay System architecture. So and I'm not in that team and if you want to know more details about it You can reach out to them directly. Yeah Thank you. Very good I think that was my main question from all this. It was very interesting to see all the open source technology that was being used Subproxmox and TrueNAS and as you mentioned Grafana and some others. So yeah, that was really good I was also very impressed by the Maximo assist. That was very interesting Yeah, that's... I think the maximum is also suffering internally in IBM And then I think the lessons we learned through this is really by leveraging all sort of open source tools That we can achieve something more complete and then we can even provide solutions more customizable And then with a lot of flexibilities and especially for example for the data source, right? So you can integrate all different types of exporters to permissions and all the different tools Give you different capabilities such as to track the energy flows to track the... Actually the request flows of an application using the tracing and instant, et cetera When you were limiting jobs, so you went from 96 to 24 during a high cost period of time That was very much in the application world, right? Not in the hypervisor world at all You'd be nice or interesting to see a way to try and reduce the workload Like I know Kubernetes could probably get tied into that for example, right? I saw Portainer was being used. I'm not sure what type of controls available there But yeah, it'd be really interesting to see that as well You've got a huge swarm based on the demand, but then, oh, it's a period of time where we need to reduce costs So we're going to actually claw that back Yeah, that example is actually we are leveraging the turbinomics And then so turbinomics is specifically identifying what are the idling resources And whether you are significantly overprovisioning resources at different layers Including containers, virtual machines, et cetera So they will give you some recommendations about whether you want to resize the VM, resize the containers So you use smaller amount of resources and you reduce the wasting of resources That's really interesting One last question, was NVizi and Turbinomics, were those preparatory cloud applications Or are they something that you can host on-prem and are they open source? Turbinomics and NVizi are not open source Are they cloud-based or is it something that you still... That's a cloud-based solution, so you can easily use those Cheers Oh boy, these chairs are tight, so I'm just going to reach Thank you so much It's great. I'm also in the CMCF Environmental Sustainability Tag It's great to see your talk today I really like the personas that you showed in the Grafana dashboards I think, yeah, so much of this is around the people and the process And then the technology that follows So I was wondering what data points would you associate with For example, for SREs, what kind of data points in terms of energy consumption Would you find most useful to track and to lead to these optimizations? Yeah, that's a very good question Sure, as I introduced in the presentation range of different people Considering different types of data, like CSO, the Chief Sustainability Officers They are concerned about the whole power consumption of the data center There are some significant changes over the last 24 hours or over the last month And then SREs are more concerned about performance issues Whether the applications have some problems They may still consume a lot of energy And then they really dive into the details about where are the energy consumption bottleneck Or the hotspot, and where are the bottlenecks for the performance So they need to see all the communications or dependencies Between different microservices in a large scale application In this way, data like tracing data, performance data, and microservice level energy consumption data Are pretty useful to them And then for a particular product, our asset managers and engineers They are concerned about particular device and how much it is consuming And then whether it functions well So all those IoT data, the device level energy consumption data are useful to them