 energy efficiency and the evolution of architecture for energy efficiency, and the sort of thing we start collaborating with him. We have to understand what we can do at the start of the energy sector in order to exploit the future that could allow to save energy using a modern CQM. Connect with my thanks for the introduction. So let me get started with the presentation. It's going to be very much a hardware centering not so much on the application. So this is going to be the outline. I think I will go very fast on the first part because I guess I'm preaching to the converted. And then I'll go into some more details on the work that we have done on power and thermal management and monitoring. And I will close with some conclusion on the future energy efficient hardware. So essentially, the problem that we have is that we've projected exascale computing will have unreasonable power budgets. This is improving with technology, with better architecture, and so on. But it's still quite far from the target for exascale. So we need almost an order of magnitude more energy efficiency from the platform. But also, we have to consider that cooling is in the picture. Cooling is problematic, is increasingly problematic, and adds up to the overall power budget. So essentially, we need to worry about average power over quite a long time constants. And this brings the problem of dynamic power management for the system, the node, and the architecture. This is not the only problem. We also have a maximum power problem. With scaling technologies, there is more and more heterogeneity also in terms of fabrication on the chips. And on the same dye, for example, we can have very significant thermal gradients created by the workload, but also created by the variability on the chips themselves. This is true at a larger space, a spatial scale on the HPC system. If your cooling is not completely perfect and your machines are also not equally work loaded, you can have quite significant thermal gradients in your system. And this actually brings the need to actually power cap locally your machines or your chips to meet the maximum thermal budgets. We call this issue dynamic thermal management. So just to clarify, this is dynamic power management and dynamic thermal management are multi-scale problems. So they are present on the chip. They are present on the compute node on the rack and the HPC cluster, and also they have interaction with the cooling. So this is actually what we have been working on in my ERC project, trying to develop a strategy that goes across the various layers to have a multi-scale thermal and power management. From the point of view of the workload, the architecture of the workload is the way we architect our machines. So there is a higher level scheduling model where the various jobs are dispatched on the machines. And also there is a programming model. So today I'm mostly going to consider MPI-based programming model, which is quite pervasive. And the patterns of communications that you have in terms of synchronization, broadcast together, and scatter reduction. And understanding, having interaction between the power management and thermal management, and the programming scheduling model is quite essential to get the best results. So it's not only a hardware problem, but it's really a vertical hardware software problem. So looking what we have done in this area, first of all, we needed to look into what you can play to do power and thermal modeling. So essentially the knobs you have are in terms of frequency control. Here I'm using the CPU terminology, Intel CPU terminology, but most of the hardware to use in high performance computing has similar control knobs. So the first is the dynamic voltage and frequency setting that where core by core, you can actually decide on the trade-off between performance and, of course, power, playing with voltage and frequency. And you have a set of states that in Intel terminology are called P states where you can play with them. Then there is also the managing of the idleness. So in case you have long periods of idle time, you can also use the C states where actually you shut down or shut off a part of the resources using various levels of aggressiveness in terms of shutting down. And the larger the saving, the longer is the time to go back to work. And so the longer is the penalty for duty cycling in and out of the deeper sleep C states. So most of these currently is supported in through a hardware abstraction that is called for Intel Rappel, where essentially you provide what you ask or what you need from the machine. For example, you provide a power cap, and the machine will use these type of states trying to meet your target in terms of power. So now we did quite a lot of work in understanding how this type of controller work. And they actually work in hardware using some sort of a reactive control knob. So essentially what happens is that you have a limit here that is induced by thermal or average power. And the controller will try to settle using feedback control around the limit. And you have the usual problems of overshooting when you hit the limit and then you have to control back. And you have the problem of settling. And the stability versus accuracy trade-off is quite serious. So essentially this is what you will see. It's not really ideal. These controllers are not reacting instantaneously to what you need. So we did a lot of work in moving from this type of reactive control to more predictive control. And predictive can be done at the various time scales. So at the hardware level, essentially you replace this type of threshold-based feedback controls with what we call model predictive control. And essentially model predictive control is based on creating a model of your hardware. So a model based, for example, on an approximate compact model, so how the thermal distribution of the chip will change over different workloads or over different base temperature conditions. So you create a model of your workload. This is, for example, based on finite models, RC-based approach. And then based on this model, you can actually get a more longer-term type of prediction of what is coming along as the workload will change and how the dynamics of your thermal system will change the temperatures. So just to give you an idea, on the chip, this thermal model horizon is in the order of a few seconds or a few from the milliseconds to the seconds. And of course, this type of approach scales up also to rack and to the full data center scale. In this case, your time constants for your model prediction give you a room for prediction in the orders of several tens of seconds to minutes. And if you go to the full data center scales, even in the order of several minutes. And this actually gives you a lot of head room to try to plan ahead what you're going to do with the machine instead of just reacting to thermal and power emergencies. So for example, at a full data center scale, you can actually use a predictive model in the order of the duration of the entire job. And in this case, you have to have some sort of average power duration for average power prediction of your job. And then essentially try to use this model that you create for your jobs to direct the scheduling on the jobs on the machine. So for example, you can schedule the jobs to account, of course, for requirements in timeliness and so on, but also to account for the power cap. And this type of predictive power modeling provides better utilization of your machine and also allows you to manage power capping both for thermal and for power saving regions. Now, overall, this work that we have done in the past essentially requires as essentially the usual trade-off between hardware and software. Software can decide on more, but it's slow. And hardware is more focused and cannot decide on much more because it doesn't have a lot of visibility on the future, but it's much faster. And to support this, you need lower head accurate monitoring, scalable data collection, and analytics and decisions, and application level awareness. So let me cover a little bit about what we have done recently on this type of requirements. So in terms of lower head accurate monitoring, so currently what you have is this type of profiling of your workload and your thermal and power behavior, which is on the order of sampling times of one second. Actually, to improve the accuracy of what you see of the machine, you would like to go into milliseconds type of profiling so that you can track the power, thermal emergencies, and all these type of things. And so if you want to boost your sampling rate by several orders of magnitude to be able to track what is going on on all the levels of the machine, you have to deal with much more aggressive specification in terms of the sampling rate and in terms of amount of data produced and that we need to track. You can even be more aggressive. And this, for example, is looking in frequency to what happens to your hardware. This is the Fourier transform real time frequency analysis of a power supply. And this is actually extremely interesting because by doing the frequency analysis of your power supply, you can, for example, detect the characteristic of the applications. You can also detect in a very predictive way, in a very robust way, which nodes are actually not performing according to expectations. You can look at the moving of the peaks to see changing in workload and also changing in your hardware and such as slow deteriorations and things like that. But this, of course, requires very high sampling frequency and has to be done at the scale of the entire data center. So to support this high frequency monitoring and high precision monitoring, we have developed what we call the Dwarf in a Giant solution. Essentially is provide an embedded power monitor on the nodes of their machines and then get the data collection and feeding this to a database that does the aggregation process, learn, and analyze. Of course, if you talk about the high frequency that we have discussed before, you need a huge amount of data to be filtered here. So you need a very scalable infrastructure. But at the limit, this is not sustainable. And so in many cases, what you want to do is to partition your data analytics on the machines, not all on the cloud, let's say, database, on the database that is monitoring the machine, but also spread the part of the computation on the embedded platform that is actually monitoring the machine. So this is actually the node. And actually what you add here is a small tens of dollars embedded board that is capable of getting the data, but also processing the data. And this is the deployment at scale on the WD computer that is now in. And actually, this is a very small, big or gone black embedded board that costs a few tens of dollars and is deployed on a node by node level. So what this machine does is actually has dedicated the current sensor and does monitoring of the overall node power consumption. And it has sufficient on cheaper resources to actually be able to do edge computing and also some basic form of learning on the power models that you have on the machine. It is quite platform independent because it's added on the master board. We have demonstrated it on Intel IBM and ARM. It has a sub-work precision and is very high sampling rate. It up to several tens of kilo samples per second, which is several orders of magnitude better than with respect to the state of the art, even the most advanced machines that reach the one millisecond, which would be a one kilo hertz. What is also interesting is that this is a multi-core platform, of course, much simpler cores than the one that you have in the main machine. And it also has a real-time processor. And this real-time processor can be used actually to extract the computer with very high timing precision, things like the Fourier transform or averaging or filtering of your power profile. And this also offloads the ARM processor that will only be in charge of the communication with the rest of the system. So just to give you an idea of what the dwarf in a giant solution provides is essentially you can provide the monitoring up to the five kilo hertz. You can also do edge analysis. And you can also do fast Fourier type of transform edge analysis. You can also do simpler analysis like averaging or peak detection and things like that. And you can see the order of magnitude with respect to the fastest measurement system available today in standard machines. See, there's several orders of magnitude higher. One important point, if you project these on a multi-node setup, is the time synchronization between the values measurements that you get. And with time synchronization based on NTP or on PTP, we can actually get the time samples out of the system with the time resolution that is of the order of microseconds. So it actually is fine-grained enough to actually see things happening on a large scale system and like phases of the application across various nodes with very, very high precision here. Now on the other side of the collection data, all these are the sensor nodes that I mentioned, the dwarf in a giant sensor nodes. And these are actually projecting data into a big data type of scalable monitoring infrastructure based on open source big data tools. So MQTT is a standard protocol that we use for the sensors to communicate value and time stamps to a collection that is the broker. And the broker will feed the samples to a subscriber. And the subscriber is nothing else than a database entity subscribes to the information channels that are produced by the various nodes and distributes that into a column type of database that can be used with the time series from 10 to analyze what is being collected by this distributed infrastructure. This is the typical use that you have here, which is essentially you collect the data, you store in a column database, and then you perform searches and analytics on the data that has been stored. And for a one week or two or three weeks time window, depending on how much storage you can afford. The other more, let's say, advanced flow is actually to perform string analytics directly on the data coming from the brokers. These two flows are not mutually exclusive. Actually, they can be applied at the same time. And we do this with the Spark, with the memory type of stream-based computing. So this is the Spark flow. The data comes, it's taken from in a buffer. It's clustered. And then this is actually used to create, for example, a power model or predictive model. And this becomes, if you want, a virtual sensor that can produce again data or digest data. And then this data can be subscribed by a higher level database that will not need to get all the data, but only the data after the filtering from the stream processor. So we can apply some sort of a hierarchical filtering of the data in the node, but also on the global scale. So this infrastructure is deployed in Galileo and actually is producing data at around one terabyte per week. And of course, if we project this to a full-scale machine, even larger tier zero machine, we will have 10 terabytes in half a week. So stream analytics and the distributed process are really a necessity with this type of data. So with this information, we can perform a lot of application-aware type of energy to solution minimization. For example, we have done some work on Quantum Expresso analyzing with a very fine-grained the idleness in the application. So we have done this by adding similar monitoring infrastructure to the MPI codes. And this is actually a very low overhead instrumentation on the MPI that doesn't really need to change the code. It's just embedded in the MPI codes themselves. And then you can have information about application time and MPI time at a very fine grain with a very minimal overhead. So with this type of application that is actually this type of monitoring is actually very accurate, you can get data like this one. What you can see, for example, in a run of Quantum Expresso where here is a single node run with NDAG equal to 1. You see that there is a lot of time spent in the MPI because it's actually in balanced workload. If you go in the other 16 parallelizing diagonalization, you see that this is going much better, but you still have some idle time. So we actually explored how you could get, exploit this idle time to improve your energy efficiency. And essentially, the idea would be to actually apply an application-aware voltage scaling and frequency scaling where when you are waiting on the MPI synchronization, you actually slow down your processor here and you wake up and you accelerate when you are in the application phase. The idea is that here you will save power. The challenge is that these transitions take time. And if you do it on a very fine grain idle times, you actually decrease the performance of your application. So this actually shows what happens in practice. You'll see that how this type of profiling of your phases of application is useful because we have that in the case of NDAG equals 16, you can get 12% power saving, 11% of energy savings without any slowdown in your application. The case is not the same when you are highly balanced. You have very minimal savings and you have to be very careful because if you do power management on a too fine grain, you are actually getting a slowdown on your overall application. So again, this type of fine grain monitoring, high precision monitoring, is useful to do this type of recovery. So let me close with a few slides about what we can think about doing next. This is energy efficient hardware. So far, I talk about how we can improve the utilization of the machine and let's see how we can improve the machine itself. So the big message that has been already given today is that there is a massive presence of accelerators and absolute dominance of accelerators in the green 500. The reason why accelerators deliver from the hardware viewpoint better energy efficiency has already been explained. Many thousands of simple cores that maximize the floating point per millimeter square, in a sense less legacy oriented and simpler memory hierarchy, not so strictly bound to coherence and so on, that actually improve the amount of memory you can use in a useful way on a chip. And large memory bandwidth by tightly coupled memories and low operating voltage and relatively moderate operating frequency that keeps the watt per millimeter square under control and migration gradually from 2D to 3D in terms of attaching memories to the cores. So is there room for differentiation over these ideas with respect to GP GPUs? Or GP GPUs are already doing it perfectly. There's the only answer. So there is some hope. There is some interesting chips that have been presented, for example, the Pesius C2, which actually in November 16, 17 achieved 1, 2, 3 top scores in green 500. And essentially, the idea here is that there were several optimization performance, especially as you can see for the lack of time and not going to cover these indicators. But essentially, the key idea is that it combines low power design, a very simple, no legacy instruction set, and advanced power management. So it's some sort of a machine created from scratch, from evidence efficiency, and not derived through many iterations to become more energy efficient. So is there an opportunity to do something here? And the message I have is that indeed, it looks like we are at the point where there is an opportunity. So there has been a lot of interest lately on open source instruction set architectures. RISC-5 is an open instruction set architecture. This is the first time this really happens. So this instruction set that you can use without paying royalties to any company that has developed the instruction set. And this is sponsored by a foundation. It's supported by a foundation that includes all the largest, many of the largest vendors, hardware vendors. And it is a very reasonable streamlined instruction set architecture, distilled many years of research, and it's conceived really for efficiency and not to support legacy work. It is a safe use and free. So you can develop your own hardware based on these things. It has a wide community already in many companies supporting it. And it's rapidly gaining traction in many applications like IoT and big data. And also some companies are already starting to look into high performance profile implementation of chips based on these instruction set architecture. So what we did in my group, in the hardware side of the group, we actually work on actually bringing open source hardware implementation of this hardware with the support that you need and a solar pad license, which essentially is a free for commercial use license. And these type of hardware IPs are already used by several companies commercially in the IoT space, and are starting to be looked at at the high performance space. So essentially the basic idea is to democratize hardware as we have democratized software, like Quantum Express is a very good idea. This type of chips, these processors are already implementing quite advanced technologies. The latest step out we have done is in 22 FDX, which is more or less equivalent in terms of performance to the technologies used by the Pascal class GPUs. OK, so this concludes my presentation. Thank you for your attention. Thank you very much.