 But it's very well known in the ITC community because it was already involved in many co-design process and the Luigi, many of the super-computers around the world have been co-designed in terms of architecture. Since the time of Guruji and formerly as P-machine, then more recently in Chinooka, for the Chinooka, for the computer, we were able to co-design the architectures and we announced that in Munich, a German TSO system. Thank you, Carlo. Well, thanks for staying that long. So I would say I would talk about, in fact, my experience on the co-designing of the NLVH Chinooka system. Yes, I can try. Oh, sorry. It's OK. You're not aware of the mic. I don't know. OK. Somewhere else. Someone stole the microphone. Sorry. It should be OK. It works. Thank you. Yeah, so I was saying that, in fact, I will talk about my experience, which I had with IBM and Lenovo, working with some customers on that topic. So, and I will focus mostly on the power problem. Yes. Focus mostly on the power problem. What you have in my pocket. OK, good. So, yeah, and what we see right now is that the chips are becoming hotter and hotter. For quite a few years we had chips. Maybe ARM will change that. We'll see when they have SV instructions. But for a while we were seeing chips with, I would say, 150 watts, for example, 120 watts. And now in the past 12 months we see chips which are becoming hotter and hotter. Of course, there was already, I would say, GPUs, which were 300 watts. But even the Xeon processors are now with the Skylake up to 205 watts. And we believe that there will be other chips which will be even hotter. So that means that it will be difficult to cool those kinds of very hot servers without sacrificing density. Because, of course, fans can do something, but at some point if you want density and hot chips, then you will need to do something else than air cooling. And that's the work we have been doing around the water cooling. So, how do we work on energy efficiency? First, of course, you can use higher flops per watt processor. And I think that has been most of the discussions already here. I will not talk about that. Then you can use different types of cooling. And I will present the work we have done since 2010 and 11 on water cooling. And the third and second topic I will cover is software. And in fact there will be some, I would say, overlapping with what was discussed or presented by Luca Benigni and Pietro from Chineca. So, but as I have a little time anyway, I will go very quickly. So if you want more information, send me an email. So first, direct water cooling. We started that in Germany in 2012 with SuperMuq1. At the end of this year or mid of this year, we will have installed, I'm sorry, not 4,000, 24,000 nodes. Okay? And again, it spread it from Germany to a lot of different countries in the world, China, Singapore, India, and so on and so forth. And that's a great, I would say, success. So how did we do that? So as I said earlier, we started with SuperMuq1, June 2011, 2012, 6,000 nodes, 9,000 nodes. It was Sandy Bridge. And we installed our first generation, and we were IBM System X at that time, our first generation of hot water cooling. Okay? And you will understand a little bit later why hot water cooling and not chilled water cooling. Okay? So first water cooling and then hot water cooling. And also our first generation of energy aware scheduler, which I will talk about also later. With that, of course, we achieved a PUE of 1.1. And we were able to save about 37% of the electricity in Germany. And of course, Germany has a very expensive kilowatt per hour. So but anyway, it saved about 10 million to this for the lifetime of SuperMuq. Then we installed the second generation of water cooling system as well. And I would say the same technology of water cooling. So we are now at the third generation of water cooling system. So we started with iDataplex, IBM System X. Then we introduced NexScale, which was a different form factor. When we introduced the water cooling on iDataplex, that was not planned. So it was an addition to the system, which we in fact did because of LRZ. We had no plan at that time to productize that we had prototypes of systems, but we have no plan to do that as a product. And we did that for LRZ. Then we did the second generation NexScale again and we installed the first NexScale systems of water cool again for LRZ, SuperMuq 2. And finally, we will be introducing, I think March this year, the third generation of water cooling, which is in the same four factor as the NexScale. The question is, what have we been doing? So when you install water cooling, there are a few things. First is, does it leak? So the answer is no. We have the technology which uses copper tubes. It doesn't leak. We have a lot of experience with that. The question is, does it pay off? And of course to pay off, it's a TCO calculation. On one hand, you need to save energy and on the other hand, I will show that. And on the other hand, you need to be as cheap as possible. And for example, in the first generation, the direct water cool product was about 10% more than an air cool product. Now with the latest generation, we are at about 5% to 6%. So therefore it's becoming much, I would say, much more from a TCO perspective. You can use it efficiently on countries which have not a high electricity price. So anyway, that's one part. The second is, let's improve the cooling technology. So now we are at 90% of heat to water. So we water cool everything on the board except the power supply. There was a lot of discussion in fact with LRZ about doing a water supply, which in fact we could do. We have a prototype of that. And that would have increased the heat to water to 95%. So 5% more. At the end, we didn't do it because we were able to demonstrate that doing that, which I will talk about later, it's not worth. From a TCO perspective, the 5% you gave here, or you gain here, given the price of the, I would say, of a water cool power supply, which is very high because it's not an industrial product, and then versus the price, you will pay for the additional chilled water you will need to heat those or to cool those 5%. If you start using absorption shields, then it's not worth at all and that's why we have demonstrated. Finally, of course, you need to be, I would say, flexible and scalable and that's what we have done with a manifold. So it's very often we talk about PUE and everybody knows about what PUE is. So it's the kind of measure of the cooling efficiency. So when you divide the total power facility by the IT power, an ideal value is 1 because many people or many centres when you don't do water cooling, it's usually 1.4 or 1.6 more often. OK. ITU is about the effectiveness of the node by itself, which means how efficient is your power supply, how is efficient is your fan and so on. And finally, a thing which nobody talks about or not enough is the energy reuse effectiveness. You produce a lot of heat with a system and I would show numbers and do you make any reuse of the heat? And usually not. So this metric is about measuring how much reuse you do. So you start with power facility minus T reuse over IT power. Ideal value is 0. If you do nothing, which means T reuse is 0, then you have ERI equal PUE. OK. So in a way, the goal of that is to be able to reuse waste heat the most efficient way as possible. So of course, you can use heat to warm a building, to heat a building, to heat a swimming pool, this kind of thing, but that's not, I would say, very professional or very industrialised. So again, that was a work which was done with NRZ in the 2014 timeframe with a company called Sortech, which is now called Paranite, which has produced a new generation of adsorption shielders and with a new material called zeolite, they are able to produce chill water with hot water and hot water, which is only 60 degrees hot. So that's, and with that, you have a COP, which means an efficiency of 60%, which means with one megawatt of hot water at 60 degrees, you can produce 600 kilowatt of chill water. So that was done as an experiment for the Kulmuk-2 system and here we see the power measurements, so the power consumption, the heat to hot water or warm water and then the COP produced and that's why we measure this COP of 0.6. So, and we need about 60% hot water and that's why in fact we have increased the temperature from 45 to 50 degrees in our last generation to be able to use absorption shielders efficiently and we will see that with the Supermuk-NG system, which is the next generation, which I will talk about later. At some point, tell me how much time I have left. So anyway, this is a simple example of energy cost, okay? So let's suppose you have, sorry, you have a one megawatt system. First, when you do water cooling, you decrease the process or the node power consumption by 10%. Okay, now I will not go in the detail of that. It's 5% from the heat, from the fan and 5% from the processor itself. So you gain 10% and therefore the one megawatt is reduced to 900 kilowatts. And then here I say, okay, let's we have other things than of course the water cool system. We have 100 kilowatts of whatever storage network, which is cool with a rear-door exchanger or crack unit. So anyway, and we have the PUE for air, which is 1.6, chill water, 1.4, and warm water, which is 1.06. And those numbers are, again, the real numbers coming from LRZ. So at the end, you can compute the energy, the total power based on these different PUE and the total IT power. And I will not go in all the details, but what is important is that if you fully reuse the heat, which means if you take, for example, in this example, we had 900 kilowatt of hot water, if you take that and you cool not only the devices on your supercomputer, but also other devices in the data center, that's what I call full heat reuse, then you are able to reduce the total power consumption by 52%. Which is amazing. And of course, very often you will not do full heat reuse because the data center is not capable of doing so and you will do partial heat reuse and that's what we'll do at Supermook and G. But if you are doing full heat reuse, then the energy cost is multiplied by ERE divided by PUE and with ERE of 0.3 and a PUE of 1.1, you find these savings of 52%. So in a nutshell, what we were doing with Supermook 1 and 2, we were able to save 25% on the cooling because we were doing free cooling all year long at 45 degrees C and saving 10% on the processor and node, okay, I would say. And then a few more percent about the software I will talk about later. And then with Supermook NG, which we start installing now, it's about 6,000 Skylake nodes, 205W. In fact, because of water cooling, we may run at a hotter temperature in a way than 205W. We will do that because water cooling will force reduce the processor temperature. So we think we could run it at 240W. And we expect it will be on the top 500 in November with about a 20-peta-flot, which is amazing for a system which is only a non-accelerated system, okay? So we will be doing waste heat reuse with absorption shielders installed by Sortec Fahrenheit and we will do the savings as I explained before, okay? And finally, we will introduce the next generation of energy aware scheduler, which is called now energy aware runtime. So, and with that, we'll be at about 50% depending on if you do partial heat reuse or fully heat reuse. And by the way, we committed to the power consumption of the system, okay? Which means we Lenovo pays the electricity bill. So if we are wrong in these numbers, we'll pay for it. So we did the same for Supermook and I believe we will be, we will do what we said. So now a little bit about software and that's where I will overlap a little bit with Luca Benigni, although we have a very different approach, but basically it's the same story. So what it is about, it's about monitoring the performance and the power, okay? Real time. And taking actions on the applications when they run or actions on the nodes where the application runs. We started this work again at LRZ with Supermook where we developed an extension of, so at that time it was load leveler. We were IBM, I was IBM, and we developed an extension to load leveler called energy aware scheduler. So what was that? There was a kind of free phase. So first there was a learning phase, we didn't call that learning phase at that time, but we were gathering information based on kernels, so it's kind of micro kernels, but we selected them from the NAS benchmark where we were collecting basic information on each node. And we were building, and we were storing that in a database. Then a user was submitting a job the first time, and we were, and at that time, so it's six years ago, so we made a few progress and I will explain that, but anyway, so we had to run the job with a tag and it was the identifier of the job, and then when we were running the application we were collecting a lot of information from the application this time. So in the learning phase it's hardware information, in this phase it's application information. And then we were merging them, and when the job was resubmitted another time with the same tag, then we were computing before the job was launched what was the optimal frequency to meet some performance or power targets. So it was a two-step and it was kind of static. Anyway, it worked and it has been running on Supermook for a few years and there was even a case study done with LRZ and it was presented at ISC 2014. So anyway, and of course inside of this thing there is a power and performance model. And this study was showing that this power and performance model was quite accurate. We were about 5% error in general on average. And with that, without any performance degradation they were able to gain 5% to 10% power saving. So that was at that time. So then we restarted that when we became Lenovo. We started to think about a new generation of that and we called that energy aware runtime. And it's in this co-design session because on top of targeting large systems like Supermook NG of course when we started the work we didn't know we would win Supermook NG but we were very much thinking about that. Anyway, we started this work with BSC because BSC has a lot of skills in the area of performance and application monitoring. And what did we do together? So in a way we took the base of the energy aware scheduler but we added a few things. And one of them is this dynamicity. In load level we needed to have a tag and it's only when the job was totally run that we were calculating an average frequency for the whole run. Here, and that's what mostly BSC added was what we call auto-loop detection. So the job doesn't need to be run once. You just submit the job. And this library is linked with your application. So we have still the learning phase as before. No change. But then when the application is launched we detect, so we need to have an MPI application because we use the MPI call to detect the periodicity of calls. Based on that periodicity we determined what are the loop structure in the code and what we are looking after is the auto-loop because at least our first goal is to be able to tune the frequency at the auto-loop level. So we are able to determine what is the auto-loop after a few iterations. And then a few iterations or so. And when we do that then we collect about the same information we were collecting already with energy over scheduler, which is CBI, Cycle for Instruction, GBS, which is the bandwidth or gigabyte bandwidth to memory, read plus write, and then power and time. And with that we use a performance and power model where we project based on, so the application is running at the frequency, whatever it is. And then with this model we project the time and the power at any other frequency than the nominal. And then based on energy policy or some policies which is submitted or which is, I would say, defined when the job is launched. Either you want to minimize energy to solution or you want to minimize time to solution. So minimize time to solution is you will try to increase the frequency, but increasing frequency doesn't often or doesn't always improve the performance so you will impose a threshold of performance efficiency. For example, you say, okay, I'm fine to increase the frequency if the performance efficiency is 60%. And based on the calculations we have done here it will select the upper frequency to meet that goal. And this other one is the opposite. You want to minimize the energy to solution. So you will try to find a frequency which is lower than nominal, but improve energy. But also, again, you don't want to go further than a max performance degradation. So you say, okay, I'm fine to save energy but I don't want my performance degradation to be more than 5%. And again, you will select automatically that. So in a way, we collect this information at initialization. We collect that at the first loop. It's totally transparent to the user. And then at the next iterations, it set up the frequency. Then he is looking at whatever is happening. And every whatever number of iterations, we again collect what we call the application signature to see if there have been any change in the application behavior in the application profile. And if so, then we reapply these policies to find what is the new optimal frequency. So that's totally transparent. And then what we, in fact, doing this work which was not planned at the beginning, we introduced a third policy which is no policy. But just collect the information because we believe that this information, it's collected for free in a way. By the way, it's a distributed architecture so we have no central daemon. Every correction is based at the node level. So if you have different, you could even manage, in fact, if they are different temperature at each node, you can manage that. And the override is very little. It's less than 1% because we collect very little information. So anyway, this information can be useful. And in fact, we can add other things and we are adding the flops, right, of course. And we believe that even people which are not interested in that can still be interested in the information we collected. So what we are doing is building a GUI to monitor the power and performance real-time or post-mortem. So that will be a side product in a way. So anyway, this work has been a work done with BSC and LRZ will be one of the customers and maybe the first one to use that on their system. And as a side effect, in fact, we will add this feature to something we are also introducing, which is Lenovo Intelligent Computing Orchestrator, which is LICO, which is a stack for AI and HPC, or AI and HPC. And in this stack, which is a collection of open HPC modules with a few additions. In these additions, we have the energy-aware runtime which we developed with BSC. So in conclusion, we believe that co-designing with end-users and data centers are very useful in research labs. And I think that the example of what we did with both LRZ and BSC is a good proof of that work. Thank you.