 predict the future but you have to be able to look back on a rich past and so I want just to introduce with one anecdote the first speaker because when I started my PhD in 1997 the speaker was winner of the Gordon Bell Prize which is a prestigious award in HPC and the application he was presenting got the prize in part for the performance and at that time it was measured in flops per dollar and today we measure maybe in flops per watt the best machine today is 17 gigaflops per watt and that's you know interesting number at that time the application of professor sterling ran 18 gigaflops per million dollars and that just illustrates how far this field has come in that time these 20 years essentially we have made a million fold improvement in performance of capabilities but we know that we are the end of this evolution or at least it's going slower and that will be the topic of the talks of stagnation at most nano scale barrier and what we can do to do it to improve and maybe might do reboot HPC to make progress so thank you thank you very much for that kind introduction I think we have all the electronics working here the little chip that's implanted in the back of my head it we have well unfortunately the chair just gave my talk so I'm going to touch on the on the high points I'm told that I have very few minutes so I'm going to minimize on the introductory material nobody here needs a an introduction on high performance computing or it's important applications in computational physics of which materials is an important part we are without exaggeration at a pivotal point in the in the domain the interdisciplinary domain of high performance computing and I'd like to take these minutes to frankly disagree with everybody in the field except those who work for me and suggest to you that there are that at a turning point we really do have to consider the class of systems and the enabling technologies and the methodologies of going forward as being very different from those which we have benefited from successfully over the last at least two dozen years one could argue more than that in fact while it may be a factor of million I'd have to do that in my head since 1997 and you were talking about my Gordon Bell prize right yeah okay I thought so I am well you have to forgive me I've also chaired the the prize so it all gets to be a blur but but the number is really much more staggering that in one lifetime and I'm referring to myself one of my favorite topics in one lifetime the field of high performance computing has conservatively conservatively delivered a performance gain factor of over 10 trillion and I say conservatively only because we don't actually have benchmarks for the original machines and frankly in in the original machines the first and second generation we tended not to measure them in flops but in ops now there is a standard trend the biology people refer to it as punctuated equilibrium and this is an important concept albeit metaphorical for our own field there are wide areas of evolution in which the changes to computing are truly incremental the same techniques except more of the same technologies but better of and the better tools as well but we continue to do the same things the same codes with minor modifications perhaps continue to work on the systems and then we hit that point of singularity that pivot point where the enabling technology changes there are new opportunities in computer architecture and unfortunately that requires changes in the underlying science and applications that are performed on that here's my you know one charge meets all and these are a number of the that I consider at least looking back at the history of the point where the technology shifted now there are long tails and they overlap each other in many cases when I was at MIT starting graduate student and that would be in 77 see more cray had just delivered the cray one computer a hundred megaflops machine approximately slightly more in peak slightly less and in delivery that's in the middle there through multiple generations computing has changed and the small role that I played was in in the area of commodity clusters hardly a breathtaking alternative but but a key one in terms of performance the cost and that accelerated the usage of computing and caused changes in the development of the synthesized technologies as well now where do I disagree with my community right now over the last two dozen years we have had this sweet spot this almost pox MPI in which every system after another could be programmed with the same modalities the same idioms as the systems got incrementally larger and larger but as I will show you it's now at a point of a stretch point and while we are here to discuss among other things the evolution to the achievement of exo scale exo scale is a lovely word it has no meaning it could be exo flops if you do nothing but run lint pack all day if you're interested in in big data and machine learning and so forth then you're really talking about exo bytes of information and one could go on but there has been an expectation that going on in terms of technology would be a repetition of the past 20 plus years and yes people such as myself get a little bit over enthusiastic about the possibilities and get on stage like this one and rant about how everything has to change and I'm not doing that I'm not doing that because it a lot of people get angry about that but if I may just quietly whisper to you everything is about to change here's where we are the Sunway Thai who light I'm not going to discuss the politics of high performance computing which is world-straddling I have never seen a period in my career when a focus on a singular point of performance has been so internationally embraced by China Japan US Europe I know I'm missing one my apologies to there here the Thai who light machine 10 million cores has achieved a sustained rmax performance of 93 petaflops petaflops that's a big number 93 petaflops and a peak performance of over 125 petaflops clicking over here if you look at the red I hope it's red yes the red line to your right all of those dots and I guess I may be behind half a year forgive me those are the Thai who light before that was another Chinese machine and I'm going to get it wrong so I won't even try but it is dramatic that the the quest for exaplops is truly international now I'm showing you this picture not so much to look at a single dot but rather to look at the trends you see these dashed lines historically the increase at least as measured by this going with far back is 1993 so that's about 25 years has been exponential and in not not in a term of hyperbole but rather literally exponential somewhere about 1.7 X per year again measured by the HP LR parallel impact benchmark but in the last few years the slope of that line has has shrunk has narrowed and these dotted lines that you see for the lowest line which is the 500th machine and the highest line which is the sum of all 500 machines are clearly showing a roll-off and this roll-off if I had time exists in technologies and clock rates in power consumptions in the performance per individual core and I could go on and here you see the the extrapolation of the fastest machine moving away from the original the original slope it had been assumed that we knew when we would have exaplops it was going to be in 2019 I think it was October 4th about 2.30 in the afternoon that's Eastern time Eastern time that's no longer the case you know and I work with many teams associated with the US national agencies including the Department of Energy the Defense Department the National Science Foundation and so on and we all anticipated that but it became clear that the problem was harder than we thought and so some brave man who got fired for this projected that the government would hit 2024 that's when we would do exaplops and we would do it right and we would have done that but our colleagues in China mentioned oh by the way they would have it in 2020 these days there's not much I can say that's positive about my government okay there's nothing I can say positive about my government and one of them is that that that nothing motivates them in in science and technology than a threat from another government and so instead of doing it right in 2023 or 2024 we find ourselves moving with alacrity that's the flightest way to put it to do it in 2021 or 2022 still admitting defeat as if this were a horse race or the Olympics but trying to expedite that now before I go on with this I have to point out one other said two other thing first the number 500 machine is just below a petaflops just below a petaflops I wrote a book enabling technologies for petaflops computing and at that time we were four orders of magnitude away from petaflops the times have changed but the second thing I want to point out is a slide that I'm showing you with with no discredit to Eric Strohmeier of Lawrence Berkeley National Lab who provided it to me is a lie at least it's implicitly a lie and this is very important to the discussion in this conference because the implication is that the the orange and the red lines are bounding a blur of 500 machines spread out across that gap that true his analysis done by by my colleague Dr. Machik Brotowitz and this shows the actual breakdown of the the distribution of the the architectures now I call this a three-world model you can probably figure out why if not I'll talk to you later and if you see it breaks down to a rather horizontal line that refer to I think appropriately is mainstream on the other side a almost pure vertical line which we call the flagship computing and then we don't have a name for the middle thing but it provides for greater accuracy almost all the machines are on that horizontal line in the mainstream and the fastest of those machines is around one petaflops but what is it we think we're doing when we're talking about Exascale do we only care about that less than 10 machines on the leadership panel and by the way this is slightly out of date it goes up it goes up somewhat further than this the fact is that the peak of high performance computing is essentially non-existent except I have to say this for the purposes of credit and stature spending money and and a few important scientific problems so in the era of achieving Exascale we are barely as a community of users of high performance computing we are barely approaching that a scale and yet we're not only thinking about how to drag our technologies to Exascale but we recognize that hitting a point and that being the last point is futile and so already we're thinking about how do we take that design point or a future design point and extend it further much further and that's what the last few minutes of my presentation are I oversimplify when I say that as people look beyond Exascale beyond Moore's law Moore's law is extraordinary experience we use it as a metaphor for anything that's exponential or anything we wish were exponential and the technology feature size is now a pre approached nanoscale technology five nanometers seven nanometers 10 nanometers in industrial laboratories and somewhere between 12 14 and 16 in commercial products so assuming and I know people will disagree in the materials community we can't get smaller than atomic granularity and not that now you're all going to tell me why we can we can store a bit on a on a hydrogen atom I'm sure but we are looking at possibilities and there really are some the probably the most exciting is a true paradigm shift in any any nature of the the term and this is quantum computing I'm not going to talk about quantum computing today you could spend two lectures on that easily I think that there are several very interesting talks about quantum computing but I will leave you with this I believe it I don't believe it now yes if you're Amazon or Facebook or or Microsoft or Google you can afford to build a machine that's not a quantum computer but call it that really a quantum and needler and when you put in all the overheads connecting it up and making it work it still doesn't go actually deliver faster in the sense of time to solution than a conventional machine but it will this is good research and sometime I would guess in a dozen years but I have no reason to believe my own prediction we will have quantum computers and for some applications it will provide polynomial scaling and for some applications it will provide exponential exponential scaling and in fact there will be those true high-stature problems in which you will be able to execute in a in a year an application that with the conventional computer could not be performed in the lifetime of the universe even if you started 13.82 billion years ago there is a fascinating I have no idea what that was there is a fascinating almost obsession with the metaphor of the brain that the brain which in case you haven't measured recently you all have one your cranial cavity is about 1450 cubic centimeters you have about 89 billion that's with a B neurons and and you run it at about 20 watts that excites people your whole body is like a what we used to call a hundred watt light bulb now with LEDs of course we're told that a hundred watt light bulb burns six watts so the the analogy doesn't work very well but but but that that capability and doing things we don't know how to do today at the same time it's very exciting it's very exciting it's probably wrong the idea of building something that were with reminiscent of the neuron and managing to have a a degree that is to say an output of something like 10,000 different connections all up and and burning it in you know not in saturation but somewhere in the analog domain we will probably be able to do that for certain interesting associative problems will probably be able to use it but it's unlikely that something like that is going to get up early this morning and quickly put together some slides so that it could then get in front of an audience of a hundred people and and describe what it's thinking about that is not likely to happen I myself have worked with superconducting super computing I can take you know you really hunger you spend 30 years in a career and you say did you do anything okay so in San Jose I don't remember the year I think it was also 97 a small research booth for California Institute of Technology Caltech I had a number of displays and one of them was a a large cold tube of liquid helium and some logicates and that one display ran the fastest clock rate that ever until that time had been run at a super computing conference and that's over 10 years and has ever since been run at the super computing conference and I won't count the the the time for that 247 gigahertz now unfortunately I had trouble with the fire marshal in in San Jose California one of his people saw what he thought with smoke coming out of this a fire marshal all dressed up in uniform came to me and said you cannot have this in the case of California he referred to me as dude he said you cannot have this because it's a fire hazard I said how is it fire hazard he said well there's smoke coming out of that and the first law of being a fire marshal is where there is smoke there is fire I said well there are two problems here first this is for Kelvin's and nothing is going to ignite at four Kelvin's and second second this is liquid helium a noble gas and it's not going to ignite with anything they told me to turn it off processor memory we have been we have been following the same adages of computer structures since since von Neumann first wrote the paper and forgot to put Eckert and Mockley's name on it we refer to this as the von Neumann architecture without Eckert Mockley architecture right and we've been following that among its many implications is the separation of memory which in those days are at least in the 1950s was made out of magnetic cores and the logic that was made out of vacuum tubes the brain-inspired computing as many aspects of which are very interesting and I make light of it only because in the limit I don't think that is the way we're going to build intelligent computers but being sim at both brain simulation which is critical both to the medical aspects and the understanding of it is is important and there may be some algorithmic issues that we derive from this technology so so we work with colleagues in Geneva at APFL on the simulation of neuronal structures and my chair has just told me I have four minutes left this is a direct empirical example of wishful thinking now you know what is the general law of computing or a high-performance computing I am now presenting it to you it consists in the challenges of starvation of latency of overhead and contention and any system that you tend to look at or care to look at over the last 60 plus years of computing will be ultimately sensitive to the performance will be a product of and the problems will be an imposition of these four activities there is however now an exoscale computing a fifth problem which is a synchrony and that is the uncertainty of the amount of time it takes to achieve an action in particular an action at a distance within within such a machine this brings us to the fact that we are at a point in time where we can no longer stretch the von Neumann architecture model in order to achieve the next one to four orders of magnitude in performance gain if in fact we can achieve it at all we should not be optimizing for the utilization of floating point units that was once a very important perhaps the important metric of success fp use are very small the rest of the architecture is very large that is really a stupid thing to be doing stupid I mean dumb you get me right that's the wrong thing okay now the second of course is a separation of logic and memory we even call it the von Neumann bottleneck that was important when the technologies for the two were very different now they are the same modulo some differences in process that manufacturing processes we want parallel execution so we issue sequential instructions and we have to pretend that our access to our memory is sequential consistent how does this make sense oh and by the way you don't even notice it but you've got registers in there those registers are completely different from the main system have a completely different namespace cause their own critical problems in doing that you don't need that so what should you do well I don't have draw it but if I did this would be a fun time I named it and so don't pay attention but it is the merger of the three important primitive micro functions and those are of course information storage information transformation and information transfer or communication and if we can build a single piece of logic we can that does all of these at the same time we can and put them in a tiny tiny little space much smaller than an arm processor of which we're going to hear some interesting things later then we can build something like this oh how excited you are it's a it's a triangle yes but this triangle has a lots of triangles in it each triangle has more triangles in it as they say it's turtles all the way down these triangles these fontans are not are not even logically as sophisticated individually as a core but they are in fact in the aggregate much more efficient in the use of the dye area of the energy area and in the issues of starvation latency overhead and yes contention we choose this particular form of tessellation not because there aren't others but because the ratio of the communication to the other functions is smallest possible and this proves to be an important engineering optimization if you take these chips and you put them together on on a dye and you and then put it on an MCM and then you stack the dyes even within the MCM and then you put those on boards and those boards on cases and where you get the picture and you build a large structure well you can get exoscale now if I did it the way we all do it using racks we've done the numbers it would take about 900 square feet about around 90 square meters and that is if you organize it in a cylinder which turns out to be the wrong way to do it but it looks nice if you do a different form of packaging a dense form of packaging then you in fact can have an exoscale within four square meters of footprint that's four square meters I know I'm drifting back and forth between meters and and and feet here I spent 14 years at NASA the space organization in the U.S. and there was a bad moment in their history when they got feet meters mixed up too and spacecraft blew up over Mars that wasn't the success criterion here are the computer I first mentioned which are the Taihu light and the total architecture that we talk about the different pair of design points the the simultech is the entire aggregate now I just want to point it to two lines here which I have to find there it is peak performance almost 10 exoflops versus about a hundred petaflops in in the Taihu light and I mean no disrespect for the Chinese machine if you go to the bottom line that times a hundred improvement is in 25 square meters versus a 605 and you get a factor of a hundred benefit I'd love to talk more about it but my chair is getting antsy and so I come to my last slide and I point out that yes I'm honest enough to admit there are several potential challenges to this one is the notion of memory density it is not DRAM dense although we have an aggregate or an external DRAM put on top of our stacks the clock rates we still don't know we're operating in our analyses around 120 megahertz we go up to less than 500 megahertz hold that number in your head feasibility it's good it's ordinarily is absurd for someone in academia to tell you how they're gonna beat that build the fastest computer in the world but we're not we're gonna build the tiniest computer in the world and we're just gonna build millions of them maybe a billion of them and package it and so that looks like the last but I have to I wasn't permitted to put up a slide in my last comment exoflops is not the goal exoflops is meant merely the continuation point to scientific computing way way further than that so we've done the numbers we have the spreadsheets we've computed the energies I know this meetings about exoflops I'd like to close my talk by talking about Zeta flops Zeta flops using this architecture model is entirely doable by 2027 because we want to take a little bit longer than we need 2027 will produce exoflops with the technologies that we anticipate within the next three to five years with the same architecture therefore using the same software for this architecture and add a much lower cost on the order of an order of magnitude lower than if we could stretch conventional technologies today and in a tiny tiny footprint compared to conventions so yep I'm done that's it if I had more time which clearly I don't I would go on and talk to you about the feasibility of yata flops but I don't thank you all very much