 Thank you for your patience, I apologize for the delay, I would thank you for all your patience, I apologize for the delay, I would thank you for all your patience, I would thank you for your patience, I apologize for the delay, I would thank you for all your patience, I would thank you for all your patience, thank and general purpose computing, high performance computing system at any given time. And just to say that this question about performance is not so trivial, I just want to start with on one side we have the most, I would say, successful or one of the most sold computing systems today, large scientific computing system. So many centers, for example like the one we have, but in Germany, HLRS or in Sweden, a number of weather services are all using this XE system from the company Cray. And the interesting thing is that this computing system which seems to be very popular has been funded by a program, the development of the system has been funded by a DARPA program that's called HPCS and in this case the P or the acronym stands for High Productivity Computing. So there is this dichotomy between performance and computing that I would like to address. And the dichotomy or this relationship between the two is not trivial or is relatively complicated because the program actually tried to fund two things, new programming languages that make it easy to program computers and at the same time the development of an architecture that matches these programming environments. And it turns out the hardware was very successful. So a lot of computer centers are buying this computing system, but I don't know who knows here has heard about chapel and X10, the programming language is two. And are you using, you just heard about them. So you don't have to be embarrassed about that's the level of knowledge or popularity about this. So the question that you can then ask is do we as scientists care more about performance, that the hardware seems to go everywhere and we seem to be using it, but the high productivity we seem not to care about because somehow the work that's been done in that direction is not getting any traction. And so this is really the questions that I would like to reflect on in the next hour or so, or one and a half hours, one and a quarter hours, the time that I have left. And I would like to start to do this by thinking about what exactly do we mean by performance. In popular term, of course, performance is measured either in floating point operations or number of floating point operations and machine or you can do per second. Or today it's also common to talk about energy efficiency just because it's popular. Everything has to be green. So there is even a list of the greenest most performant computers on the world called the green 500 list. And then you measure number of floating point operations you do per second and per watt. So that's like floating calculations per joule. And so these are the numbers that people track. And one very probably well known that probably more than two people know about even in this room is this top 500 list. This website that has been tracking the fastest 500 supercomputers since the early 1990s. And what you see here is exponential growth in the performance. The fastest machine, some of the 500 fastest machine and the slowest machine. And you can see that an iPhone today is probably around here in performance. So it's about the same as a supercomputer in the early 90s. And so it's, yeah, exactly that's here. So the most impressive thing about this development is when you put in the numbers. So there are actually application codes and these are the ones that typically win these golden bell prices that actually track the fastest machine. So this was the first time a teraflop was sustained in late 90s and the first time a petaflop was sustained in 2008. And when you do the factor here, you see this is increasing at the rate of about a factor of 1000 every 10 years. And this has been going on for decades. So I guess that's something we in computing we can be very proud of, you know that there is such a great increase in performance. Over such a long period of time. And the same exponential growth in performance is tracked by another website. This is ECMWF. This is the center for mid-range weather forecasting in Reading. And here again you see an exponential increase goes back even earlier in the late 70s. And the point is and which is again good. So for weather simulation, climate simulations, you have an increase. The thing is the difference to the previous one is this one is a factor 100 every 10 years, which is still very impressive. But you can say it's the same machines that run or that appear on the top 500. But here they only increase the factor 100. So, you know, being a physicist you ask, well, does this mean the efficiency of the codes is decreasing a factor 10 every 10 years? Okay, which is not good somehow. Well, the answer is it all depends on what you measure. You will find out. So what I would like to do next is look at exactly what we mean by the simple number calculations per second. Okay, and also the metric that we are using. The top 500 list is using the HPL high performance impact metric, which is the metric is basically a solver of a dense linear system. So it is dense linear algebra. And in dense linear algebra, we have a property that is called the arithmetic density or a quantity that we call the arithmetic density, which is the ratio between calculations and movement of data. So the number of floating point operations divided by the number of load stores. And in the case of dense linear algebra, this is increasing. The bigger you make this problem you want to solve, the more intense this quantity becomes. So it grows linearly with the scale of the problem. And the way you can imagine this is if you multiply two matrices, the complexity is n cubed. So n cubed operation, you have n squared element. So n cubed divided by n squared gives you that the density of the computation scales linearly with the size of the problem. And so if you have a problem that's really dominated by floating point operation, it is reasonable to normalize the time to solution by a number of floating point operations. So this is time to solution divided by floating point operations. And if you want to minimize this normalized number, you have to maximize the inverse. And that's why you want to maximize floating point operations per second. If you solve dense linear algebra problem, you can do the same reasoning with energy. And so you see that actually this metric, that top 500 is measuring, is actually very useful, reasonable. So what's wrong then with the climate codes or that they get so less efficient? Well, as I said, we have to do two things on the computer. And this is profoundly important. It sounds simple, but it is profoundly important. One thing we want to compute, but we have to feed the part that computes with data. So we also have to move data. And so this is why this ratio between compute and movement of data in an algorithm is a key number that you have to look at. And it turns out the reason this is important is in this so-called roofline model where you plot the peak performance of a processor against this density, this arithmetic density. And you can see that there is a range when the density is high enough where any of these processors, so these are Intel Xeon Phi or Xeon Xeon Phi, NVIDIA GPUs, all kinds of processors that we have available today. There is an area where they saturate, but then once your arithmetic intensity is low, then or below this threshold, then the peak performance of the processor decreases very quickly. And that's because at that point, the performance is dominated by the rate at which you can move data around, the so-called bandwidth of the memory. And what this picture basically tells you is depending on the algorithm that you run, you will have either... So this is DGEM, this is Matrix Multiply or dense linear algebra, you will have a very high peak performance. And the climate codes are typically grid-based partial differential equations, so stencils. And so in this case, the arithmetic density is low and the peak performance is order of magnitude smaller. And so depending on your algorithm, the performance of your machine is changing. And so that's a key thing that you have to keep in mind when you develop codes and algorithms on the system. And so now that while floating point operations per second is something easy to memorize and think about, it is not a good metric. So what are really the metrics we should worry about when we do computing and we should use when we talk about the performance is high? Well, I think the canonical metrics that we care about are still time and energy because they seem to be around in all problems. Most of you scientists will worry about time. We want to design an algorithm or design a machine that solves the computation in such a way that the solution comes back while we still remember the question that we have asked. And for graduate students that your PhD is, you can do it in three years and not in ten or thirty years. So that's the restriction that we have on time to solution is we don't really have to minimize it, but you have to keep it low enough so that you can get your answer back in a reasonable time. And so that's what we call operational constraints. I will come with an example where this is really illustrated. Energy to solution on the other side, energy is proportional to cost because I know that when you use a computer, you don't care about about the person provost or the president of your university has to somehow pay the power bill. We actually care about that. And or if you run a large computer center and your computers use megawatts and megawatts is a million a year, so you're interested in reducing the power bill. And so we should always try to minimize energy to solution. So what I will now do before I get into condensed matter physics, I want to quickly run through an example where this is all illustrated very easily. Examples in condensed matter physics are a bit more difficult. I know they are also more interesting to you, but I will try to relate what we learn here with energy and productivity and time to solution. Then later also to the examples that I'm using from material science. The simple example that most people can follow is this question of weather forecasting because here it's easy. You don't want to have the weather forecast for tomorrow a week from now. So you have very stringent constraints. And so we are running at our center a weather forecasting system for the Swiss meteorological service. And this takes in data the way they run is it takes in data from a global simulation at this European center for mid-range weather forecasting that I mentioned before. They take this as initial and boundary conditions for a simulation over Europe at a resolution of seven kilometers and then they run a high resolution simulation over the Alpine region. Because if you don't have a high resolution, you don't really predict the weather very well. And then this is used for daily weather for, you know, TV forecast in TV or air traffic control and so on. And the machine we are using is a Cray-XE6, so it's one that was procured in 2012, three cabinet system. One cabinet is used for production. The other two are used for research. And if the first one fails, then the other two take over immediately in the computation. Now in terms of development, what Meteo-Swiss would like to do is go from a single two kilometer simulation to one kilometer. So they want to double the resolution. And if they double the resolution, they have to increase the compute power by a factor of 10. And then in addition, they want to run an ensemble because this is a chaotic nonlinear system that we are studying. And so studying one trajectory is not meaningful in a chaotic system. So you need to study an ensemble of trajectories to make probabilistic predictions. And there again, every ensemble, the factor by which they multiply the ensemble members that reflects an increase in performance. And they have to improve their data assimilation. So overall, what they want to achieve is a factor of 40 in increase in performance. And this is done, the last machine we built in 2012, so this is in three years. And so in three years, a factor of 40 means you have to, you know, there's a factor two that you can gain, two to three you can gain from improving the processors in this time frame. So the rest you have to gain from making the machine larger. And because it is fortunately a highly parallel problem. So this is the size of the machine today. This is our machine room in Lugano. This is the size of the weather forecasting system. Now considering this factor two increase in performance, we will need a certain size of machine. And I'm using again this Kray-XE, you know, packaging density and so on. This is the area of 40 cabinets of Kray-XE. And I would need 30 of these to solve this problem. And so this is the size of the machine that we would have to increase to if we, if we just move the code over to a bigger machine to do this bigger problem on faster processors. Okay. And obviously the footprint roughly is proportional to cost in procurement and also to energy usage because they use around, you know, per area you can say 90 kilowatts of power. Okay. And so obviously this is, it reflects a very strong increase in cost and we have to take somehow a different approach. Now in the meantime, we have had a research group at ETH Zurich that has been investing in developing, taking the code, rewriting the code and moving it to GPUs. And we have, and so what I'm showing here is just the result of this climate simulation running at different resolutions. So it shows you why it is important to run at these high resolutions in order to predict the precipitation. Currently you see differences but the main point is we're not talking about climate here. It's the main point is that we have developed software that can run on a totally different architecture. And so this will be the recurring theme in my lecture today. You know, also on the condensed matter physics side that we have to invest in software development. It's probably one of the most important things in algorithms and software development in order to be able to move forward. So what we had here is this climate code running on this supercomputer that is loaded with GPUs. So has hybrid nodes, 5000 hybrid nodes of CPUs and GPUs. We will come back to this later with materials applications as well. So we had this climate code that the meteorological service was using and this is what we could use. I want to make some brief comments about this code. We went from an existing rather large, you know, 200,000, half a million line of code, Fortran code, fairly monolithic code, very professionally written in terms of performance. And we ended up these codes typically they have two parts. They have a dynamics part that solves the Euler equations to move the air around. And the equation of state to decide if you have clouds or ice or snow or rain or sunshine. And then there is a part that solves all the things that you have to parameterize because, you know, you're not using, you know, ab initio physics to describe things like, you know, the type of phenomena that you cannot resolve that you have to somehow parameterize. So that's called assault in the physics or radiation, for example, and so on. So we rewrote this part of the code that the climate scientists would not touch normally. And basically introduced a very structured, you know, separated the concerns in this part of the code in different libraries. In particular, a stencil library where we managed to separate the implementation side of the library that we use, you know, that the scientists use who implement the solvers from back ends that deal with the architectural details, whether you run on an x86 processor or on a GPU. And there is also a back end for Xeon 5 processors. And so it is this separation that makes it manageable to run the same code base on different architectures. And it turns out that with this investment in software, in the end, rather than building a machine that is x time larger and costs much more money, the machine that we actually now just built and recently announced is about the same size as the last one. And is actually delivering the product now for MeteoSwiss. And so I'm here, I'm summarizing where we got this factor 40 in performance increase, where this really came from. Because what I said before was speculation. This is when you plan a project, now we have the two machines and we can make real measurements and comparison. So this is comparing the running the bigger, this big problem on the new system relative to the one that we have in 2012. And so we get from increase in processor performance a factor 2.8. So a bit more than a factor 2. Then improve system in utilization. So that's more to do with scheduling of the jobs. We got another factor 2.8. So that's good. That's actually roughly 8. Then from just rewriting the processors and rewriting the code without even going to any new architectures, we get a factor 1.7. And so this is already well-performing code but rewritten and then you still get almost the factor 2 out. And this is by the way something, the reason I mentioned this here is because we observe this in many places. Probably many of, even the codes that are written in my group, you will see the same. Once you've written something, usually it helps rearranging things and you get more performance out. And then the important thing is then the rewritten code allows us to move to this new architecture seamlessly. And that gives us another factor 2.3. And then if you take everything together, before I go to the result, one thing I would like to really quickly look at is this is what we get from Moore's law. From just processors continuing to increase. And I noticed you had earlier, I think this week, you had Marty Astroyer here arguing that Moore's law is coming to the end and I totally agree with that point. And I will also come back to this. So this is what we get right now from just waiting and not doing anything. And the factor will actually probably decrease as we get towards the end of the decade. This is what we get from software refactoring, almost the factor 4. Basically it allows us to run faster and move to new architecture. And this will be the alternative to, while quantum computing is still maybe a bit further out, but Moore's law will end before we use really big quantum computer to solve many, many problems. And in the meantime, you will see the main point that I want to make with this lecture is the ability to move to different architectures will help us continue to improve performance. And going back to this Mateo Swiss problem, if you do all these multiplications, you see you only need a 30% increase in number of processors. And that gives you a factor 40 then of overall performance increase. And then the nice bonus that we get is because we move to GPUs, the architecture is much more energy efficient. So we get instead of a 30% increase because of this here in energy consumption, we get the factor 3 reduction in energy consumption. And so that's basically what makes people, you know, in the end very happy now. I just see here I have to, okay. So that's basically this, you know, extended introduction about what we talk, what we mean by performance. Okay, performance is not really a simple number. Performance is dealing with time to solution and energy to solution. And time to solution, of course, there is a factor of productivity that comes in. And so what I would like to do now is talk about two examples in condensed matter physics. One is more in the area of many body theory and the other one is more in the area of electronic structure theory. And I want to basically show that the same simple principles, namely that we have to think really hard on what we are trying to optimize, that we have to invest mostly in algorithms and software, refactoring are really the crucial ingredients. And I will argue at the end of my lecture that we have basically no choice because you can say then bite the bullet and invest in this whole endeavor because you could say, well, I'm happy with the performance that we have today in our codes or what we are doing. The problem is as the computers are not going to just, a given architecture is not going to improve that much anymore in the future. We have to be able to move our problems to different architectures like we have seen with this climate example. And so this is one of the requirements that we have to deal with if you want to continue to solve bigger problems in the future. And I assume, of course, as scientists, we are, for some reason this whole build up is a bit strange here, but never mind. So the first problem I will talk about is this, you know, models to describe superconductivity, high temperature superconductivity. I know, assume you've heard a lot about these type of models and things that we are using. So the challenge that we have here is we are dealing with a macroscopic effect and a microscopic model. And we have this disparity of scale and we have an algorithm or a problem, the complexity of which is scaling exponentially. And so to get to macroscopic scales when your problem scales exponentially is not really good news. So and you have heard earlier in this week and probably in this workshop that there is a solution to deal with this challenge and that is called dynamical mean field theory. Or in the case of superconductivity, we are using a cluster version of dynamical mean field theory that is called dynamical cluster approximation. Where you solve the many body problem on a cluster. That's the idea and that is embedded in an effective medium and the cluster should be large enough. So you solve all the, you basically cover all the important correlation effects that allow you, that you need to treat to study the physics. And then the medium takes care of the thermodynamic limit or the long length scales. And so that's the basic idea of the theory that I will be using in this first part. And I just want to mention that this is work done in collaboration with a formal graduate student. A graduate student actually is, yeah, I even wrote it, is now at IBM. He's no longer at Zurich and Thomas Meyer collaborator in Oak Ridge National Lab. Now the problem that we are trying to solve here is the Hubbard model that I'm sure again you've heard of. When you have questions, please interrupt me. I'm making some assumption and the main message of this talk is not in the details of what we are solving here. So even if you don't understand every bit of the equations, you probably will still get something out, I hope. But I still want to set the setting here. We're trying to solve the Hubbard model and we are going for this, trying to calculate the self-energy. And the key idea of this DCA algorithm is, of course, computing the self-energy for which you can compute wave functions or greens functions that you want. So if you have that quantity, you solve the problem. But computing this is, of course, very difficult and the basic idea of the DCA algorithm is to coarse-grain the self-energy. So expand the function in terms of step functions. So you take the Brillouin zone and you cut it up into pieces and then you expand it in terms of step functions of these patches. And then you try to determine this finite set of expansion parameters somehow. And of course reciprocal space, each of these ways to split the Brillouin zone corresponds to a cluster in real space. And that's why the method is called dynamical cluster approximation. And just like DMFT, it is an iterative method where you iterate between this lattice self-energy. This is the thing you try to determine. You compute, you map this onto a cluster by doing this expansion in these patch functions, step functions, and then you solve this in real space in the quantum Monte Carlo algorithm. And then you map it back into the lattice-grain function. And so you solve this iteratively where most of the computation is spent here in the quantum Monte Carlo algorithm. And I'll be saying a few things about this. So the main message about this method that was developed by Mark Jarrell and Thomas Meyer, my collaborator had a lot to do with applications to superconductivity is that you can qualitatively describe the phase diagram as we believe we understand it in the cuprates. And so you get a superconducting phase. You can get the pseudoguard phase and antiferromagnetism at low doping and so on. So everything looks reasonable from that point of view. There is just one problem with this method and that is that there is a strong dependence on the shape of the cluster or the choice we make with the shape of these patches in the Brillouin zone. These coarse-graining patches and, for example, this is from a 2005 paper. You can see the TC for the superconducting transition computed at different cluster sizes. And so you can see it is somehow converging, but it's fluctuating here. And you can see for a given cluster size, you can see two very different results in the TC. And when you look at the self-energy, this coarse-grained self-energy, it looks very different between two different choices of these patches. And so people have been working, trying very hard to work on getting rid of this cluster dependence problem. And this is where the contributions from Peter start coming, where he is trying to extend the algorithm to make the self-energy continuous. And the way he is doing this is not just through a simple interpolation, but through basically a two-step process that is on one side interpolating and trying to make the function that he is trying to make smooth, simpler by subtracting a part that he knows roughly the shape. And then in a second step, he is deconvoluting, so trying to solve the inverse problem of extracting the lattice self-energy from this continuous set of functions. And the result of this is that the self-energy indeed becomes smooth. So for this lecture, I think I have a reference here where all the details are described. The main point that I want to use this whole thing is what are the performance implications of this algorithmic improvement. The physics is improved in the sense that, as you can see here, before with the DCA, we have this strong dependence on the shape of this coarse-graining patches for the self-energy and the TC results. And for this new DCA plus algorithm, we have a smooth and therefore also self-energy that doesn't depend very much on the shape of the cluster. Now, one of the performance implications is that the sign problem in the Monte Carlo part of the simulation seems to improve. So the sign in the Monte Carlo simulation, the average sign is not dropping as fast with the increasing of the cluster. So the limitation why we cannot go to large cluster in these simulations is because in the partially doped or in the doped Hubbard model we are dealing with the sign problem that is basically stopping us at some point. So we go from the n-cube scaling of the algorithm to an exponential scaling at some point. And it turns out, as we are trying to show here, when you compare the DCA results here, these are the solid curves, how the sign develops for different cluster sizes to these DCA plus results. And you say that this drop is significantly shifted to lower temperatures. And so that allows us to run larger clusters all the way through the critical temperature in the simulation. And the question now is where is this improvement coming from? And since we don't rigorously understand the sign problem we can also only speculate where it is coming from. And it is believed that by making the self-energy smooth we remove correlations that are artificially introduced. And this is basically what improves the situation. But the key point is that this allows us to go to much lower temperatures and much larger clusters and it will show in the results that I will show in the end. But then there is a second part in the improvement and that is the so-called cluster solver where we are using this auxiliary field quantum Monte Carlo algorithm. And that's basically doing something very similar in the end from a numerical point of view, very similar to what we used to do before in the HIRSHA-FI algorithm. Namely we are using the core of the algorithm is basically boiling down to a vector auto-product. So where we have two vectors that have to be multiplied in an auto-product to generate the matrix. And one of the innovations that one can do there and again the reference is given here is that you basically introduce a blocking idea where you group together several of these vector auto-products so that you change the auto-product into basically matrix multiplies. And what this gives you is it reduces, it increases this arithmetic density of the computation. Here it is very low and here it is higher. And then you move in this roofline model that I showed in the beginning of the lecture, you move from low performance of the processor to high performance of the processor. And this again shows in here in the results basically that you get a roughly order of magnitude also improvement in time to solution. Another part that another algorithmic improvement is in this auxiliary field continuous time, auxiliary field Monte Carlo when you introduce this random sampling of time so you go away from a fixed grid where you do the sampling on time. And so this has vast improvements again in terms of the accuracy of the calculation but because you introduce a sampling in time when you do the Fourier transforms from frequency space into time space and back and forth, your Fourier transforms now become non-equidistant and the way this was naively done is to map this just on a very, very fine grid and then do a very big FFT. It turns out you can use non-equidistant FFTs and then again significantly improve the time to solution. And so up to now I've showed an improvement in the method and then various algorithmic improvements in how I map the problem onto, you know, processors in such a way that I use the architectures properly. So minimize data movement, here minimize the size of the problem that I have to solve and now the last step that we are doing is again is an implementation step. So writing codes in such a way with a parallel strategy to use different architectures. So what we are doing here is we have, say, a computer with two CPUs per node and multi-core CPUs. So this is one line here is in time showing the computation of an individual core. And the Monte Carlo calculation you can imagine that very easily you can just split it over different nodes, the Markov chain, and you can split it over different cores and then you have the green part that's a Monte Carlo simulation, the red part that's the measurements and every core is just running independently. And so that's how you get very effective use of the system if you have two CPUs on the node. Now on some of these modern and very high performance machines you have a CPU and a GPU and here we split the problem. We take the costly green part, the Monte Carlo solver and put it on the GPU and schedule it in such a way that this is just producing Markov chains and then it is sending the results over to the CPUs and then asynchronously is doing the measurements. And from this we get optimal utilization now of a hybrid system that has one part is optimized to solve the dense linear algebra problems and the rest is strong enough to solve the measurement parts. And this again leads to a significant improvement in both time to solution, I think that's what I'm showing here and as we will see later in also energy to solution. And then of course the final step is we have to map the whole problem onto a massively parallel system. So before I was talking about just one node with either two CPUs or two or GPU and the CPU and now we have to put the whole thing onto a large machine that has in this case this machine Titan at Oak Ridge National Lab has somewhere around 18,000 nodes. And that's what we are showing here the scaling in number of nodes at 18,600 nodes and the parallel efficiency. And you can see depending on the again the size of the problem where we have to do larger and larger numbers of measurements because we are also fighting the design problem even though it is much better. And then of course if we have large measurement number of measurements and the parallel efficiency all the way to 18,000 nodes is still relatively high above 90%. Okay and then finally put everything together we get a certain time to solution on both the CPU or the GPU system and that's what you see here. The time as a function for either running on a CPU only system here or on the hybrid system. So that's just to show that it pays to do this investment and deal with the architecture in the proper way. And then in this case time and energy is relatively proportional so the savings we get in time to solution is also savings we get in energy to solution. And then I need to accelerate a bit because I got started a bit late but so we are now that we have a much more efficient algorithm. You know for the various reasons that I discussed sign problem in algorithmic implementation to reduce the data movement increase the arithmetic density and using efficient more efficient architectures very large computers. We are in a position where we can actually run a problem and solve this hubbub model. And again the things are documented in a paper and you will have the slides if you're interested. So I will I will try to go very fast just saying that we are in a position to actually test the this you know solve of the hubbub model in the attractive you hubbub model where we don't have a sign problem. So we can go to very large cluster and where we can use a determinant and quantum Monte Carlo results that have been published by the group of Richard Scalata. And so we can for the first time actually go all the way to making a rigorous comparison between the solver that is adequate for for you know the repulsive model. But used on the attractive you model compared to a method that we know is asymptotically exact. And so the results as you can see here they agree and then we can apply the problem to to looking at the high DC problem or the 2D hubbub model. In this case for you over T equals four. And this is just showing the difference between what we had previously with the DCA. So these are the red dots. Let me put everything again on the slide. We see the red dots here. This is what we can do with the DCA the old code. And then after we do all the improvements I've been talking about we can really push the problem to very large clusters to the point where you can see DC is increasing as the cluster becomes larger and larger. And at some point the cluster is large enough. And then we see that the asymptotic decay of the DC that we expect from a coastal is stylish scaling of the problem. And what you can see here is somewhere sort of like a I don't know a length scale beyond which we see this proper asymptotic behavior. And now we can we can do this really you know very accurately in the simulation. And so we can see this coastal is stylish scaling and determine a TC. Okay. We can even push things to larger use where the sign problem again is becoming more problematic. Again the difference between the old implementation the red one the old DCA algorithm and the DCA plus algorithm where again you know here you can see you don't get anywhere. But then you can with with all these again improvements I've been talking about you can get the point where even at the large you over T equals seven you actually get TC to converge. You see the same phenomenon again. You know that there is a length scale beyond which the cluster is large enough and you see the proper scaling of the problem. So in summary when I now you know want to reflect on all the changes that have happened we had improvements in method and we had mapping to hardware. And so we had you know initially the first results I show is DCA with here's five quantum Monte Carlo. Then we had delayed update. So this was improving the arithmetic density. And so it's a mapping to to scalar processors. We had sub matrix updates that again improve the complexity. Then Emmanuel Gull and the group of much Astro you're they develop the CT out algorithm and that's a tremendous improvement in accuracy of quantum Monte Carlo. So that's impure. And then we again developed. This is the paper I cited in the talk. Submat matrix update version. So these improvements here. But now for this algorithm, which is again a mapping on the architecture map to hybrid computation improve again the algorithm with DCA plus. We use the same solvers. So everything here just transfers. And then implemented scale and everything together allows us to solve the problem. And so this is really an example where I try to show in the same way like the climate problem that's like a bit simpler to understand. Not easier to implement the models are much more complicated than the many body shredding equation. But the point I want to make is the same type of investment in algorithms and implementation along with improvements in the architecture. So going from CPUs to GPU that give you then the necessary improvements in performance that we want to see. Now in the last part, I would like now to go to electronic structure and how it is used in materials design. So this is an argument as material scientists. We often have to because we're solving complicated problem that most people, you know, don't understand the problems we are solving. And this is just to make the case that while weather simulations are important and we all care about humanity actually cares about materials development even more. Because we've even named the ages of development of civilization with the materials that we have developed and that have caused the development. Now one of the modern trends that we see in material science is that we actually use simulation for designing materials. And so one of the strong advocates of this in Switzerland is Nicola Masari at APFL. And the argument was leading a big initiative that we have. The argument is that when you develop materials, usually like the developments I showed before with Stone Age, Iron Age and so they were serendipitous discoveries. You know, that you can use something to do a certain task as a tool or as a weapon. And then I would say, you know, around the time of Edison or so that people started to do systematic searches, but they were experimental searches. So apparently Edison was testing 3000 materials for the filament of the light bulb, you know, the famous light bulb. And so you can imagine that's a lot of work. And then it's even worse for the Harper Bosch synthesis of ammonia synthesis. Apparently it is said a mitage tested of the order of 20,000 compounds, you know. And this is a very important problem, you know, that's used everywhere in industry around the world. You're saving a lot of energy if you improve catalysts, but it just gives you the scale of label that you have to invest. And so it was, you know, this is just in this context, but we have many other examples where, you know, simulations are used to help the development of materials. One is this development of tunneling magneto-resistance that I'm going to skip in the interest of time and just say that what we are using simulations for today is this, and this is representative of a workflow where we, the main point here is we are seamlessly have an interaction between defining the goals, the design parameters of the material, finding candidate compound, trying to make a search there and prescreen using simulations, and then have, you know, on a certain number of compounds then do verification with measurements. And if the measurements agree with what we have been looking at with simulations, then, you know, we can take the next step. If not, then we have to go back and so on. So the main message that is coming out, and this is again from Gert Seder in the U.S. and Masari in, who used to be a colleague at MIT of GERD, but now is that APFL is basically this idea of using simulations for materials design. And the key point there is you're no longer interested in one heroic big simulation like before, but you actually want to do, you know, thousands or tens of thousands of simulations because we know there are of the order of 100, 150,000 non-inorganic compounds and if we really want to look at everything we have, if we want to study, you know, a certain property and we have to study it in at least, you know, maybe not all inorganic compounds, but in a few thousand or tens of thousands. So that gives us the scale at which we, that we have to deal with when we, you know, how often we have to repeat the simulation. And so what I would like to do now in the remainder of the talk is reflect on, you know, what is possible today in terms of doing ab initio electronic structure simulation. So if we are on Titan, this machine I've been talking about, and we take, say, BASP and so a very basic simulation and we can do roughly, say, in 10 minutes an ab initio simulation on a good workstation and that corresponds to one node on Titan and we, assuming we can use the GPUs, then we can do of the order of 18,000 structures in 10 minutes which is, you know, and even this machine that we have in Switzerland allows us to do of the order of 5,000 structures in 10 minutes. But this is for the very simple basic simulations. So what I would like to ask, okay, so for some reason I've messed up my slides. What I would like to approach, ask now is approach the problem from another side, namely, what if I don't use simple pseudo-potential code and small problem but I use an all-electron code, full potential, and I use it on the largest problem that I would reasonably want to do. So I use LAPW and I use it on large problems and now you can say what is the largest problem that you want to do in, say, total energy type of calculation and of course there is no limit in principle you can do any scale of problem. But fortunately there is this near-sightedness of electronic matter that says that there's up to a certain scale that you need to solve the problem ab initio and then afterwards you can use other peaks and the scale is somewhere around 1000 atoms and this was formulated by Claudia Draxel. So she basically says, give me, solve me this problem but not using simple pseudo-potential but using the accurate, exciting code, you know, and do it in order to solve the design problem. Do it not once but we have to design the machine, the code, everything so that we can do it many times. So I want to now discuss how we are solving this problem and what is the footprint of this scale of calculation today. Now it's obviously going to be a bigger than the one I discussed before than if you would use VASP. The problem we solve here is and you know this is basically concharm equations using this spectral approach where we end up with having to deal with a generalized eigen-solver which I'm sure everybody, or most of you, you've seen talks about this several times during the course of this lecture. The specifics of LAPW is you use a matrix, a basis, sorry, that consists of spherical harmonic expansion around the nuclei of the atoms and then a plane wave expansion in between the atoms and then you match the two functions with some conditions that give you the conditions for the, sorry, I need to go a bit slower, these expansion coefficients here. Okay, so that basically defines your basis and now you have, once you have the basis you get an overlap matrix and you get the Hamiltonian that you have to plug into your eigen-solvers. Now it's very natural that people will say, well, the eigen-solver, this is the computationally hard problem, okay? It turns out that when you now put in these spaces into these computation, you know, this is the equations that you will end up having to solve. And the reason why you say this is the hard problem is because this scales with n cubed. So as you make the problem larger, so if I go from, say, 10 atoms to 1,000 atoms, then I have 100 cubed in computational complexity because of this problem, okay? And so the natural thing is to attack this problem at scale. It turns out this is not the only place where you have the n cubed complexity. But when you look at these equations, then you have another matrix multiply here, which is also n cubed. It's much simpler than the eigen-solver, you know, multiplying matrices is trivial, but it's the same complexity. So if the pre-factor is lower, you don't see it in small problems. But if the problems become large, you will see it. And so the main message is we have to focus on both parts of the code. And the bad news is in a typical LAPW code, these computations are distributed over thousands of lines of code. And no applied mathematician has written a nice library for it. For eigen-solver, you have scale-a-pack, LA-pack and scale-a-pack, so you have actually solvers. But for this, you don't. And so if you put your nice library here, this will be fast and then this will be slow in the end. And that's why, you know, this thousand atom problem has not been solved trivially yet. And so what I'm trying to give you quickly is a feeling on how you solve these problems. So the thousand atom problem is 100,000 by 100,000 matrix roughly in LAPW basis. And this matrix is then split up in a block diagonal, sorry, in a cyclic, a block cyclic distribution. So you cut the matrix into blocks and you distribute these blocks over different processes. So blue, red, yellow and green mean different MPI tasks on different processes. So that's how you split it and you distribute it. And this is how scale-a-pack is doing it. And so then the algorithms used here are optimized for these distributions. Since this is the most expensive part, this dictates the way you decompose the problem over a parallel machine. Then you see in these parts here, you have to, when you have to do these multiplications over plane waves and orbitals, when you see how the sums are, the inner sum is over orbitals of the matrix multiplication. So it's over this here, the column here. And then you see when you do the matrix multiply, you are basically taking data from all the processors in this reduction. So that means you have to move a lot of data in order to do the computation. And so effectively, even though you're still multiplying matrices, but because the data is distributed over many nodes, you're effectively reducing the arithmetic density again over the entire machine and you're bound by the performance of the network. And the remedy to this problem is now to look at the distribution and for the sake of these multiplications, not for these solvers, you use this distribution, but for the sake of these multiplications, move data, increase the storage on every node slightly and move data around once so that on every node, you don't have one of the blocks, but for the given row that you multiply, you have all the data that you need and then you do the multiplies and what this basically the result is, you pay a little bit in price in increasing the memory, but you reduce the communication you do during the computation. And this results in that you're no longer bound by the bandwidth of the memory of the network and again that the computation is going faster. So this is a common trick used when you try to solve distributed matrix problems and here we are just applying it to a very complex LAPW electronic structure code. Then we still have to deal with the solver and the solver is because we want to use now the solver on a again distributed matrix problem. You can say, well, why don't you use scalar pack? I will show you results. Why not? And also because we want to use GPUs again. So the solver is, we have this generalized Eigen solver that we use Cholesky decomposition to transform it into a standard Eigen value problem and then we solve the standard Eigen value problem by tri-diagonalizing the matrix and solving this tri-diagonal matrix which is then simple. There are a number of algorithm, Thomas algorithm and different types of algorithm that you can use to solve this problem and then you do a back transformation to get the Eigen vectors of your generalized problem. And so this is the standard algorithm how it is used in scalar pack. And the caveat of this algorithm is that this part here turns out to be a low arithmetic density formulation. So it's a level two blast operation and is again bound by the bandwidth of the memory and not by the compute. So again in this roofline model that I had before, when the performance is high and then this decreases, this is logarithm of the performance. Instead of being over here, you are somewhere over here. So you have this reduction in performance. And so there is again a trick how to avoid this problem. And that is make the computation more complicated and actually do more computation but because you will again move the arithmetic density from here over here and you move less data, you end up winning in time to solution. So the trick is do a reduction to band and then try diagonalization from a banded matrix and then the price you pay is twice. This is a bit more complicated and then this you have to do two back transformations. But it turns out in the end because this is now a dense matrix problem so you go back from lower arithmetic density to high, you win so much here in time to solution that in the end you can afford to pay this additional price for the back transformations. And so there are two types of implementations. One is a multi-core implementation by a library, Elpa library developed in Germany in Garching. So anybody of you is doing electronic structure or these types of problems and you are using just a normal distributed CPU system. Don't use LA-pack but use this library, Elpa. This works very, very well. And we have then a similar implementation of this very similar algorithms that is in this magma library and is currently running on this gray system very well and I will now come to the results. So the results what I'm testing on is a problem lithium-intercalated cobalt oxide with about 1500 atoms so it's roughly 1000 atoms so it's this 1000 atom has a 115,000 basis function so that's the size of the matrix. So we are running on our Kray-XC-30 in Lugano which has this Intel Xeon sandwich processor and a combination with NVIDIA GPUs. So we are comparing running the algorithm just on the CPUs with the Elpa library to running the algorithm on the hybrid GPU system. And so we are trying to compare and this is not really putting the two in competition but what I really want to understand here is I take this problem which I say is the biggest we want to reasonably solve in materials design and I want to understand how much resources does it take to solve the problem. And so this is the result here using, so this is the MPI grid we're using this is MPI ranks per socket and MPI ranks per thread. So we are playing in some cases by putting two MPI ranks per node or per processor and or one MPI rank per processor. You could try to put more MPI ranks so this is this hybrid model between OpenMP and MPI and usually at least in all cases that I've seen it pays to put more MPI ranks on a multi-core processor but the price you pay is memory and normally you don't have enough memory so you have to compromise and use multiple MPI, a fewer MPI ranks and use more OpenMP threads per socket. And so what you see here is using the Scalepack and Elpa and you can see the time to solution in the solver is dramatically different roughly here a factor four. And the overall time to solution also. So this is really what you see here at play is this change the algorithm in the solver. Okay, and you also see in the, where do I have it? The new solver, am I comparing this? Yeah, I'm, this is where I'm, no, I'm not showing the difference between this different way of doing the distribution but the main point I want to make here is this difference really changing the solver. So making actually the computation more complicated but because it maps better onto the architecture you get this factor four increase in time to solution and then you get another improvement in terms of using the hybrid solver and so overall the message we get out is that typically on somewhere around 400 nodes we get a time to solution that's roughly around 15 minutes to 20 minutes and for one iteration. Now you say you need 10 iterations to solve the problem here so for one solution so you need 15 minutes for one iteration three hours for 10. I realize that solving large problems you need more iterations but on the other hand you don't have to solve the full eigenvalue problem. You can use some other tricks to cut down. So roughly what we are saying is to solve one problem you need somewhere around 400 hybrid nodes today on a machine with 5000 nodes if you want to turn around 5000 compounds you need somewhere around 2 to 3 weeks. That's today. But if we go, you know, we wait a few years we will get machines that are a factor 10 to 100 faster. We will have to again invest in the software development so there's no free lunch but then the 16 days will come down to a few hours to do these very accurate, very large calculations and in the scale of, you know, 5 to 10,000 of them. So that roughly gives you a feeling of where things are. I'm including this slide and you will get these slides in the slide. This is where I'm discussing this thing of running the eigensolver with different distributions between MPI ranks and threads. So this is one rank, eight threads, and this one is eight ranks and just one thread per rank. And you can clearly see the difference in time to solution. Also energy to solution and this one is always the GPU part or the hybrid part which seems to be faster in all cases. Let me jump here and again similar to the climate code before after doing all these investments in how to deal distributed linear algebra both in the solver and in the setup of the Hamiltonian we end up again with a domain specific library that contains now all the tools that we need to build electronic structure code. This library is then embedded in codes like exciting or elk. We are currently working on quantum espresso and the whole point of this again is to separate the code which most scientists use from the back end these different libraries to solve the linear algebra problems, the IO problems because it is these back ends that we are then mapping onto the different architectures. So I hope by repeating the same argument now several times to have made the point that I would like to summarize now that there are again some papers, the collaborators in this work I want to highlight particularly as I'm Haidar and Rafael Solcha who have been doing and Anton Kochevnikov most of the work. So what I really would like to stress is this need to structure software both in climate or material science or chemistry or astrophysics also structure software in such a way that you get somehow a separation between the architecture and the software that we use as scientists when we build the models. And it turns out that it took me many almost one and a half hours a bit more than an hour to give you these examples and it looks very convoluted and it is really convoluted so unfortunately it's not easy but it is really necessary and what I would like now to discuss in the last couple of minutes is the dilemma that we have today and how we have to organize this type of work from now on so today and in the future and I want to do this in a number of examples so we have just the one we saw this is this material science electronic structure problem so we have a physical model, we have a theory which in this case is density function theory and then we have this mapping of this mathematical problem onto the machines that I've been discussing extensively and the code we have to write, the compiler and the machine we run on and so this is a linear way of thinking how we develop things and this is in fact the way we think about how we write codes or solve problems since at least the time when I was a graduate student and probably long before that and the reason this is a convenient way of thinking about this is because we can draw a line and we can say okay this part is solved by computer engineers this part is the one we have to deal with and life is relatively simple you have seen on distributed machines life can be a bit more complicated because ideally we want to have a compiler that does all this automatically too, all this data distribution but we haven't seen this appear in fact the same problem applies to molecular dynamics in biology or the things I discussed first the quantum many body part it's the same thing it's just the model changes and the method changes and the algorithms but the structure is the same and even in the climate case it's the same you have a model, you have Euler equations you have discretization, stencils, implementation and you run it on a computer now why am I making this point and to summarize you know all the suffering that we went through in the three examples I showed the challenge that we have today is that we don't have one machine here we have machines with Xeon processors then we have machines with Xeon 5 processors there are not so many now but in a few years there will be really performant Xeon 5 processors and you will see many of those machines and even Intel now will admit that this is not the same processor it's a different architecture and you have to invest a lot to change your codes to run it on here and then you have GPUs or you have APUs and in the future you will have ARM processors as well so are you going to change everything every time you change the architecture and the fact that the architecture will change is just the reality this is a consequence of this end of Moore's law coming that we will see more and more architectures emerging and being sustainable in fact we see the precursor of this by just having this distinction of three different processors that I just discussed so what it means is this idea of separating the concerns is no longer reasonable and we have to rethink the whole workflow and the way we have to think about this is on one side we still have physical models mathematical, the math is not going to change it's going to improve but it's never going to go away also the algorithms need to discretize it's never going to go away either and ideally we want to have a system where we can iterate between these three very conveniently without having to rewrite too much code but then once we decide on algorithms and have thought along about this then we have to implement, compile and then, and this is the part that is new today we have a second and then a third n architectures to deal with and the key is and I hope to have made this clear in the lecture is depending on the architecture we need different implementations and we need different algorithms and this is where the need for labor comes in and now you see given this graph of the distribution of the labor and the feedback on the labor it is not reasonable to say we scientists we write codes here and you know write many many different codes what you want is a separation of concerns that is somewhere at this level you have many used just tools that are developed by others and these tools are developed by by interdisciplinary teams there is still a second line because we are using commodity architectures today but the main point here is that we would like to use a productive system like a python system here and we would like to have tools that implement these algorithms on these various architectures and so what is the difference between this picture and the previous one or you know how things were in the past in the past we had things like just scale a pack that was the generic libraries that are used in many many places in science in this picture we will need libraries that are domain specific like for the climate codes specific libraries for the domain of climate and earth science simulations or specific libraries for electronic structural simulations and in order to develop these libraries you need teams that are made up of hopefully some people from this audience and then applied mathematicians and computer scientists and in order to create a platform that is productive for the remaining scientists that are just using the platform to solve their problems and what I hope to do with this lecture is to motivate those of you who are interested in the more technology side of the implementation of things to really think about in the future of your career also think about joining these type of efforts I believe there is a in the same way like we build large experiments like at LHC also you have a lot of people who are involved in designing the actual experiment and building the machines to build that and this would be the equivalent of this in various domains so I think I've given that I was a bit late I think I should stop here so that we still have a break enough of a break left thank you and I don't know if there are questions and I'm here still for the rest of the morning and early afternoon so I can also discuss questions later