 zelo se pripravimo, da se bo, da je pridik način, in da se pripravimo, da je pridik način. OK. Čakaj, Karlo. In, brez, da je poslednja, da se veče izgleda, da boš početnja, kot se vseče izgleda, da je načinja preformacijo modela for Quantum Espresso. To je in aktivitavnja, da je vzgleda in v kontekstu of Hardware Software co-design, In za svojo glasba in več izveni in ne lovimo. V almost detak,... ...prej bi mese namo obesih, če je hrvnočadnji ko desaj, ... ...ko vedno neko glasba nekaj reč convention, nekaj ta. Tudi več, že obi lohne. Tudi musim se izgnati v svojo čose. in občas, ki je od 8 nekaj nekaj, a zelo, na kodesah, sešnje in in tega, sem vsezvalo na nekaj prejz, ki je 1.4 gigaer, korek, pa nekaj vsezvali in 64 korek per nod, in to je 3 teraprop, dobro preseženje, in je tudi tako, 원ilice for memory, and, more precisely, high bandwidth memory. Most of you have recognized that I'm talking about KCL, and during this session, the question I might be asked to address is, can a quantum Express explore this architecture and, more importantly, probably se So for workload that fit this kind of architecture, how many gigabytes of this new high-banded memory would be needed, would be really exploited by Quantum Express. So to answer this kind of questions, a very useful tool is, of course, a performance model. Perfomance modelenja je strategij, da vse zelo drev začen in piske, da se tega drev začen je otvarjena, in je sreč da je prejzovedno izgleda, da se nekaj prikul na uročen, zelo da ma rečova, da tega posleda hodne spoti in botneks, in to, da se priče, če je izgleda, je odgleda. In, da vse, odgleda, začnevaj, nečo odgleda preformacije. To je zelo zelo, da je zelo zelo, tudi v kontekstu kodov, zelo, da se priče, če je zelo, da se priče, če je zelo, zelo, da ne kodov, Kod je nekaj izgleda preformacije v tem, da je tudi flop in v tem, da je zemljena. In, da je zemljena, s evolutnem arduerim, kaj je hodnja spotov, kaj je v tudi. In, da je izgleda, nekaj je izgleda preformacije. In, zelo, je to potrebno potrebno za vse. ...jeg boja v modelju. In boja da pogreži zrani, kako je velika simulacija. Pistim, da se pripežite... ...izposljiti strategij, da je tudi zelo vršite parametri... ...zalivaj krženje... ...zapravj, da je nebezavizavila... ...zalivaj krženje, da je vršit krženje... ...zalivaj krženje... ...zalivaj krženje, da je zelo... ...zalivaj krženje, da je zelo... ...zalivaj krženje, da je zelo... došla vseveda, da keče jaz nekaj ostajete, staredaj ta dnev štadi ziko delivery. So, zdažba po komentaciju, ki could hansov, at least in paov, teške probleme pes smoothlyv on the PW channel in we targeted in the first stage a single modern HPC node. Vseh, ki so mi prav, je, občak tudi in gigabajs, tudi gigabajs delov. V nekaj prezidenta, vznikam izvojnjenj iz bolj bolj bolj bolj in z KSEON-5 procesor. which, ok. Apparently it has been, it will be no longer followed by new architecture. But still interesting to check what model will give also on this process. Let's go a bit into the details. The total execution time of a single MPA process may be approximated with the sum of sum components one connected to the time taken by MPA communication, another component taken..? sorry. ... being the input-output time and finally the rest, the remaining part of the code. All these elements… …are influenced by a certain number parameters ki če je tudi sešljaimi n proveči. Kaj je nekaj nekaj in pri nekaj zelo na nekaj. Je to človek, kaj piecesenje navahlu beli morda, zelo sešlja v prebe, imači. Zelo je tko v sešlju. Zelo je ta parametra vznena v tešelice. To v tem sešlji s disku. V srečnih modeljih, nekaj in v srečnih modeljih, sem poživljala dve praviče. Srečnih je poživljala informacije na aplikacij, na srečnih modeljih, in se zelo vsečenje vziv, kaj je nešto, zelo zelo vsečenje, zelo vsečenje vsečenjanja parametri, kot češaj z taj zel, vsečenje, izvah, da bi je komentatorne. So the time taken for the execution of the application on a new architecture can be estimated by using this splitting of the execution time and multiplying these factors for the relevant changes in the hardware details. For example, the MPI may be improving if the network bandwidth increases with respect to the reference network bandwidth and so on. Now the good thing about this approach is that there are only a few parameters to take into account and they can be modeled with some experiments by varying the values hardware details. But the better thing with this approach is that the predictive power of these models is limited. So usually they don't provide good absolute time prediction as the hardware evolves. And of course there is the need to repeat the analysis, the analysis, the whole analysis when the code changes, when there are major changes in the code. The second approach is instead to divide the application into kernels and estimate the time taken by each kernel with the relevant variables hardware and input variables for the given kernel. And then add all the time, so compute the time taken by all the kernels taken into account and then of course there will be something that is left out for a series of reasons. It's too complicated to estimate, it's not worth estimating this remaining time. So the good thing about this approach is that in general it produces much better absolute time predictions, but of course there's the drawback that you have to analyze each kernel and go through the details of the code execution. So as I said we wanted really to provide a system and a model to obtain absolute time information on the execution of PW, so we went through this second approach. And the first step that we took is of course code profiling. We did this for both PW in quantum espresso and YAMBO, while the model is instead only available for PW at the moment, but I will show some details anyway. So PW, most of these results have already been presented by Massimiliano Fatica at the beginning of this session. So what I'm showing here is the best execution time on a kernel node. And what you observe is that the best time to solution is obtained with a pure MPI implementation, and the pure MPI implementation takes quite some part of the whole time to solution. So it's almost slightly less than 30% of the time, and most of the communication are collective communication, and these are more or less the communication involved in FFT, parallel FFT, and parallel diagonalization. When you consider instead the blue part, the time is mostly spent in free kernels. As I said, I'm more or less repeating what has already been presented, so I don't want to spend too much time on this. It's general matrix multiplication, diagonalization, and FFT. So, this tells me that if I want to understand and model the performances of PW, I have to really take into account collective communication and these three kernels, and this is indeed what we have done. But before showing the details, I want to present a very different case. This is the case of Jambo, where instead time for a single node execution is, well, MPI time is almost absent, and it's all MPI barriers. So this tells me that in this case synchronization is the issue. I mean, not really an issue because it's just 10% of the time, but in this case I would focus on synchronization and load balance. And instead, what this analysis of bandwidth memory usage is showing is that the code is somehow bandwidth memory bound. So, and this is slightly different from what we observed in the case of quantum special Vgem, where a sizable part of the execution in the compute bound. So, as I said earlier, these are the kernels that we focused on. So, the parallel FFT, and by this I mean FFT kernels, so 1D FFTs, MPI all to all communication and the memory access, which happens when there is a scatter between the processors and the zeroing out of some variables. So, matrix multiplications and the diagonalization kernel. At this stage, it is only a severe diagonalization. And finally, unbalance is considered only in the distribution of k points. So, the step two is, as I said earlier, counting the operations. So, to do this, I went through the code and just collected the number of flops that are done as a function of the input parameters. I won't just very briefly report that, of course, you may use hardware counters to do this, but there is also, there are also projects going towards the analysis of the source code and providing an estimate of the floating point operations from the source code. And this could be an option to keep in mind maybe in the future. But the second part is how to describe the machine, the hardware parameters. So, this is by far the most complex part, and we soon realized that we couldn't just get a list of values from the vendors, like CPU, frequency, cache, site, and so on. And to convince you about this, I would briefly show an example. This is a parallel benchmark of memory bandwidth. This is a Broadwell core. It looks nice, very nice picture with drops, where L1 and L2 and the free caches are full. And this is the same picture as a function of the number of cores, in case as expected. And what you observe here is that as expected, at a certain point you reach a saturation. I don't want to go into the details, I'm just showing that if you repeat this thing on a KNL, well, the picture is slightly different, but still you understand what is going on. While the growing number of cores, the thing gets a little bit more messy, but probably it's messy because I didn't went really into the details of the benchmarking. That's not what I wanted to do. I wanted to highlight that the bandwidth here as a function of a processor is increasing and keeps increasing because of caching, and that's indeed what is expected. But still we cannot just get the number from the vendor. It's much more complex, and the same holds true for the software part. If we want to be able to have a prediction and an accurate prediction, accurate I mean in the order of 10%, while Quantum Express operates in this part of the FFT dimensions, so you immediately see that the performance of the FFT really depends on the implementation and on the size of the data. So it became clear that we needed an additional layer between the performances and the hardware. So this is indeed what we have done. The model as of today takes into account the number operation by just counting the dimension of the matrix multiplications of the FFT of the diagonalization, reading PW input files, input files and pseudo potential, of course, while instead the details of the hardware are abstract and are obtained with micro benchmarks, which in some cases are easy to connect to the details of the hardware. VGEM, for example, in some other cases are not so straightforward, but still this gives us the flexibility to work on this micro benchmark and do the projection on the results on the micro benchmark and still keep the accuracy on the total time, absolute total time estimate when needed. So, OK, so I'm showing a few results here. So this is bulk material, 64 atoms, 14 k-points, running on an increasing number of MPI processes on the left side, Broadway architecture, on the right side, KNL architecture. So the blue bar is the true execution time. The orange bar is the estimate provided by the model. If you remember, I said earlier that we only consider a certain number of kernels. So, actually, this, that seems a perfect match, it's not because we forgot about what's out of the kernels that I'm considering. So I'm having here an estimate for what the rest might be and how it contributes. So this 15% come from an estimate of, was not taken into account by the kernels that I'm considering. And of course, there is the implicit assumption that the remaining part of the code behaves as the kernel that I'm considering. So what you actually should compare is the blue bar with the yellow bar and, well, the accuracy is of the order of 10%. And I did it with a slightly different material. So the first one was a bulk material, three-dimensional material. This is on a bidimensional material, slightly larger, and still the agreement is not bad. Of course, there are situations where the unbalance is not described so accurately, but the nice thing I wanted to show is that for providing information in account of a code design, this model already gives important information like, for example, the comparator between Brodel and K&L. What you see here on the left and on the right is the ratio between the time-to-solution obtained on a Brodel core and on a K&L core running on the Brodel and on the K&L running with a certain number of MPI on the Brodel and with twice as much MPI processes on the K&L. So here I'm running up to 32 MPI processes on the Brodel and 64, so this data is for 64 MPI processes on the K&L. What you see here, what you observe here, is that going from 4 to 32, you cross to 1, and the model predicts this crossing slightly higher value, but still you observe this advantage of using Brodel and K&L for a given number of processes. So, the conclusion is that the job was actually rather straightforward, it just had to identify the relevant kernels, and the tricky part was to dig into the code and construct the right subroutine code tree. And I believe that another important information is that this approach that at the beginning we thought was quite long and difficult proved to be rather straightforward. I mean, this preliminary result, this final means complete, but this preliminary result were done in about three weeks, so it didn't take much, and with the code evolution we could really think that following the code evolution doesn't take that much time. And of course these results have been useful, I've already shared in co-design activities, but let me finish with future and perspectives, because we have many ideas on how to deploy this system in different use cases, so of course we would like to expand this model to parallel level organization, task groups, better unbalanced description, and these are parts of model improvements, but after that of course we are in contact with Intel and adjust and provide information starting, working on the micro benchmarks to obtain a description of the time-to-solution in future architecture, and of course another idea that we hope will take place is the training of the model starting with AIDA level, so with high throughput calculations, and possibly the adoption of the model in AIDA for providing these mechanisms of auto-tuning of the parallel performances and an estimate for a given workflow for the time-to-solution of a given workflow. I think that with this I can thank you, and thanks Max and Sineka for this work.