 Thank you. Thanks for giving me the opportunity to give this presentation at the beginning of the conference. So my talk will be about, well, I dropped from the title the much view because to say the truth, this is more also about my view, but what I'm understanding about the evolution of architectures. And there is a funny thing that, so several years ago, I was involved in making projection and suggestion for, let's say, planning the evolution of infrastructure. And all the time I was asking and planning for having a factor 10 improvement in the storage or in the compute capacity. The sponsor institution always said, but they really need such amount of... Then it started to become the age of buzzword, like big data that immediately my sponsor institution came to me and said, we are going to need petabytes of data. We know we need exabyte, exaflop of computing. And I said, yes, really? So it's a sort of game changing, so this, let's say, little bit strange to me. So in Macs, let's say, one of the goal is to enable exascale and exascale transition. I don't want to spend too much, but just to give you a lot. So we have a strategy and the strategy inside the first period of implementation of Macs, it was that we start from, let's say, open up and brainstorm and do, let's say, evaluation of different technologies. So we choose a pragmatic approach. So building knowledge about exascale, first of all. And I'm quite happy at the end that we know and learn a lot. And now we have a better view and better understanding of what we should do in terms of evolution of codes in terms to be prepared for what next, exascale but whatever other bus were. And in the meanwhile, in the roadmap to exascale, quite recently, there was an important announcement. So the announcement was, so there was a change in the roadmap. So this kind of guy that we, that was taken into consideration, somehow in our, let's say, thinking has been dropped. Has been dropped and canceled. So it has canceled the evolution of this processor from its architecture roadmap. I have to say that I think, I'm really convinced that, let's say, night landing or night hill, they were good processors. So if you were involved in the previous generation experiment that was night ferry, night coordinate, night landing, you can see a quite quick take up of this technology tower, a really efficient architecture for, for application. But apparently this was dropped not because it was not good and the plan was not aligned with the goal of the exascale, but because simply Intel was not able to find a market for this guy. So HPC is not enough. You need some other market. So, and we will discuss a little bit more in the talk. So at the beginning of the history of Max, we were, let's say, sort of, for those of you in Max, they may, you may remember our talk, our discussion, say we don't know what an exascale machine will look like, but for sure, if we use the math, since the exaflop means 10 to the 18 flops and let's say floating point units are not going to be any faster, at least based on the current architecture than one flop, one gigaflop. So I'm not sure how this architecture will look like, but it will integrate 10 to the 9 FPUs. So the challenge was how these FPUs will be integrated in the system. So there were two working hypotheses at the beginning of Max. So either in a dense system or in a light system. So the system will 10 to the 5 FPU in a server, light system, 10 to the 4 FPU in a server. But as I said, since the Intel drop, the F4 tower light, quote, unquote, server along the KNL line, we now are almost, sorry, we are almost sure that the architecture for exascale, at least my view is that I'm 85% sure that will be a tenor genius. So in the same system, we will have different chips to deliver different functionality. And even Intel, I discovered two years ago, sorry, four days ago, I look at the HPC wire and there was an article speaking about of Algarra view of exascale. Algarra is a well-known architect that now it is Intel, but it was formerly at IBM designing the blue gene computer line. And it was speaking about this kind of, let's say, node. So they already give for sure that you will have an heterogeneous system. So general consideration in the exascale is not only about scalability and the flop performance. In an exascale machine there will be 10 to the 9 FPU, but now we know that 10 to the 5, so 5 order out of 9 order of magnitude, German floating point unit will be inside a node. So probably we have, as I said, since the beginning we have to do probably more inside the node rather than across different nodes because we know that how to scale out to codes using message passing, but probably we don't know enough to scale up codes inside the nodes. So we need to, let's say, keep much more attention, pay much more attention rather than in the past to what happened in the node evolution rather than in architecture evolution. So, but from the market point of view, this is my own analysis, but that other may have different opinion or different view, but what is also happening that if we analyze from a market point of view, the value is moving up. So here we have a hardware infrastructure. Then we have this pillar that are verticalizations, so let's say a specialization of architecture tower, a given application, and then we have application. As you move up along this line, you are, let's say, bringing in more and more value, generating more and more value. And you know better than me probably that now there have been, well, a lot of success and a lot of value that have been generated, let's say verticalizing some architecture tower artificial intelligence or cloud service or in the past accelerated computing. So this means that the market is pushing to generate more values toward, let's say, specialization of architecture in a way because here is the value. And in fact Intel last year, that was Intel somehow find, find himself stuck somehow in this low value kind of part of the market. So Intel breakups, now Intel is two companies in one, one that is still dealing with that and another that is calling Intel data center group. That last year for the first time after many years, had the opportunity to increase the revenue. So now there is an announcement that in the last quarter of 2017 Intel do a lot of, but this was because data center group was able to move out from this layer and start to generate value here. So this I think will have an important impact of the architecture we are going to integrate. And in fact, again, let's say out of this Algarra interview, you can recover, let's say the same, so the view is, so the same architecture for the access care will be probably covering different aspect and different needs, vertical needs, so HPC, AI, data analytics and hopefully true configuration. This is, let's say, I'm not sure about that. It will be possible to configure easily a system to deliver different verticals, different verticalization, but I'm pretty sure that now if I'm going to, let's say, build up pre access case supercomputer right now, I probably need to integrate GPUs that already contains specialized core, so tensor core, which let's say are there regardless of the fact that you plan to have it or not. So tomorrow there will be the chance and the challenge to exploit different vertical and specialized hardware. So I can go quickly. This is just, but we already talk about the size of our next scale system. So my guess is that for the, well, again, take access scale just what's next, then it can be anything, but I see a trend of moving away from, let's say, having general purpose architecture to have some specialization in the hardware, because the market recognizes and takes value out of it. So since there is value, the designer of the architecture will try to maximize the return and so there is a strong push. There will be a strong push in having specialized core and specialized memory module. So I, I, I bandwidth memory module and specialized non volatile memory to integrate in future storage server. So in order to, let's say, this is a big challenge that we have in order to leverage this kind of new hardware that will probably be designed not for HPC, but for the degenerating value in different domains or for a single or few application. Then for HPC is up to us to, I think, try to cope and maximize our own return of the value. So in order to do that, we need, and you, you will, along this conference, there will be a talk about performance modeling. We need, in order to figure out how to leverage all that, we need to be better in modeling the performance and understanding how to refactor the code to fit architecture for architecture with specialized hardware. Just to make an example, if we look at the, the most well-known first, let's say, heterogeneous system. So the combination, the combination of an OST processor and an accelerator, let's say, a GPUs. So in order to, we start with codes that were running in this way, code being a sequence of, let's say, latency part. So part that were what come for performance, where the speed, the speed to solve and the speed to execute the operation on a single, on a single thread. Then in part where the code is essentially throughput, so where you can break up your computation in many smaller pieces and you can, let's say, with throughput, in a throughput part, you can leverage having multiple execution units. Then we have communication and so on so forth. So the code is a sequence of these sort of components. And then what we learn and what we are learning to do is to refactor, refactor the applications such that to have, let's say, break up and identify latency code and throughput code. And then we need to map this latency code and throughput code to the architecture overlapping, overlapping the two guys in order to keep them and fully exploiting without having to pay latency on the architecture. And when we look at the trace of one of the codes, you can, well, you can fairly understand what are these parts. And so we can identify fairly well what the code is doing. And so then it's more a problem of how we can really, let's say, restructure code to have this component being able to be placed where it's best for the execution. There is another problem that is one size does not fit all. So as I said, we have 10 to the 9 FPU to leverage. And probably the best algorithm for one FPU is different from the best algorithm to 10 to the 9 FPU. So the idea is, and then we discover it just working on the codes, that the best, to just make an example, the best FFT algorithm for the scalability is not the best algorithm to be executed on a single FPU. So in the code you have, let's say, there will be many modules. And that's what we will see next. There will be many components that can do similar stuff or the same stuff, but adapted or written for specialized hardware. And then the challenge is how we can choose runtime possibly the best algorithm or compile time for the given architectures in order to deliver the maximum value for the user and not to get stuck with the performance. We still don't know exactly what is the target architecture, but we start to learn, but as I said, probably there will be GPUs, many cores, FPGAs, others, specialized core, but heterogeneity. So we like in the past homogeneous architecture. But now we know that for sure heterogeneity is here to stay, and so we should care about. So heterogeneity is here to stay, and so we need to separate concern, as I said before. And so our idea is to sort of break up application in components at different levels. So there is an ongoing discussion, but I think that we have to, let's say, we should give the possibility to break up application and reassemble them. Reassemble them probably to some extent also at runtime to be possible to cope with and adapt with the hardware and the system in which the code is being ported or is being executed. So there is a full stack of possibilities starting from kernel libraries, so very low level to very high level. And let's say we are working to, let's say, better define these components and models, and better, let's say, give the opportunity to these modules to be adapted to the underlying hardware much faster and probably with the help of specialists to the underlying hardware rather than having monolithic codes. But just to say that along the first implementation of MAX, for what concerns quantum express, I was particularly involved in breaking up the core modules of quantum express. So now the core modules, we have at least three, but now soon we come four or five core libraries. So the code is becoming a collection of libraries and those collection of library may change, maybe interchange with other code that we can give the opportunity to, well, either for those that would like to rewrite or adapt the code to new architecture to either concentrate on a specific library or, let's say, not to, let's say, to have the challenge to consider a monolithic application with alpha million lines of code, but really concentrating deliver the best working on a limited set of lines of codes. Well, I don't have just to conclude another, so planning for access scale. We already now, we are able to, let's say, having applications that can exploit, let's say, a fairly large part of our current tier zero system in Europe. So we can scale out and exploit up to, let's say, a partition of a system that is in the order of several petaflops. So what is the path we are imagining toward access scaling? So probably there is some breakthrough in the middle, but we would try from an eye level point of view to try to be, let's say, inside a continuous kind of integration model where at the low level, oh, sorry, no, I would like to say that with separation concern, the idea is to live back through at the level of components, but at the level of application and user to be, to offer a sort of continuous path toward access scale. So today we are able to run on several petaflop partition. The idea is for the pre access scale system where we are imagining a factor of three to five improvement in the socket performance with respect to today. We can have the same user being possible to exploit partition of the system that is in the order of tens of petaflops, and, well, at the access scale probably when the architecture will be there, it's more that, well, beside the hero run that demonstrate the full scalability or the full exploitation of the system, probably the same user that today is running on a significant partition of the system of the order of several petaflops. In the access scale machine system, those users every day, they for sure need to run on partition which will be in the order of 100 of petaflops. So our challenge is to give the user the perspective of being able to, let's say, exploit the same fraction of the infrastructure. And with separation of concern, leaving the breakthrough managed at the level of the components, not to make the user, let's say, cope with that, as it was probably also in the past.