 Thank you very much. Thank you. Thank you for the invitation for the organization. So I am yes I'm Philippo Mantavani from the Barcelona Supercomputing Center I was involved in the Mont Blanc project that I'm going to present now and I'm going to give you a little bit the updates from the Barcelona Supercomputing Center the research one of the research center involved in this in the project so the Here the legacy vision of the Mont Blanc project for those in the room that doesn't know about it is to leverage the fast-growing market of mobile technology for scientific computing HPC and data center. We have the usual timeline here and The partners that are participating the project is coordinated by bull a BSC is part of the Research project here. I'm sorry the resource center here, and we have a good combination of Industries a resource center and academia as I say there are three phases there have been three phases since 2011 and All so all these three phases have been so three research line In in each of them that's more or less have been going on in parallel The first one is experiment with real hardware. We started with Android development kids in 2012 For arriving to a production ready system that I'm going to present very soon The other research line is push system software We started with some sister software in order to operate our clusters and we arrived to more complex HPC code that I'm going to show you later and Why in this process of deployment this platform? Of course you learn from your say limitation errors and so on so we always had this study for next-generation Architecture going on in background Okay, so let's move on and see what has been the shift between from the hardware platform from the beginning to today Probably you already seen this present. It is a slide. I already presented many times So this is the first Mont Blanc prototype started deploying in well the project in 2012 deployed in 2015 And this is purely mobile. I mean the mobile technology. He did this is a notecard. It is a credit card form factor here You have a mobile SOC and Samsung Exynos dual and this just mobile technology based on arm and in a HPC envelope And we started from here by the way This has been presented last year in supercomputing that there is a paper with the evaluation of this machine And we ended up here and this is what we have today and what you can see in the Mont Blanc booth by the way so this is the bull based Sikwana platform including 48 nodes with dual socket cavium Tender X2 and And as I said in the booth we have a board if you want to see it and also if you want to know more details about the specs So there has been a huge shift from mobile Technologies or low-end technologies to high-end technology This is this is the trend within the project and I think also within the market of each Arming this in the HPC. So let's see now about the system software I see a similar trend here. We started building our own system software from Recompiling Hacking and try to have all the things that you need in to run to operate your cluster and what we have today is as Chris already presented is Products we have things that we can deploy we can deploy in our in our cluster in the form of product We have different flavor of OS system We have the unperformance library the un compiler the linear tools and Everything is well packed in your open HPC So I see here the same shift that we've got from a hand-made system software doing Hacking things in that this has been that has shifted towards something that is productize if you want and of course this is applied also for the HPC code We started with simple benchmark and now we can run as Chris was saying so you can Pretty pay without Huge pain run any of your HPC code on this platform So also here as I say there is there has been this shift and I think Mont Blanc contributed a little bit in this Ecosystem of software. Okay, so let's see this study of next-generation architecture as I said before In doing all these hardware and software combination you learn from your errors, right or at least your limitations So in the case or where we were developing our first prototype with mobile technology There was this idea of let's try to understand Try to gather data in the current prototype and then try to Change for example in this plot. It was a COMD changing the technology of the core that we were using and the The network technology because the networking that's in that Prototype was pretty problematic. So, yeah, this was a little bit an attempt of trying to simulate a next-generation Machine and what we have today is what we call the multi-level simulation approach Mewsa that I'm going to show you a little bit more later But basically the idea is that you can trace your application in whichever HPC application you HPC platform you have and then you can replay your your trace in some Simulator and get the feeling of Changing the architectural parameter. So get the feeling of how your application will behave in a different hardware configuration and This is a more global Approach to the next-generation architecture than the one that we had in this in the beginning So also here I see a shift to something that is more comprehensive And very good. So let's see now I say that the beginning that I'm going to show you The BSC point of view the Barcelona supercomputing center point of view So I am going to talk a little bit the the evaluation of solution hardware solution and software solutions Here I want to mention that we have a poster In the poster session that you can see to the or tomorrow. I think is the poster session and then and I'm going to show you a little bit of data that are complementary to that poster So I invite you also to check out the poster and then use cases. I'm going to show you a little bit alia and Each PCG with two different goals in alia we tested the runtime features that we want to push in open mp as a and in each PCG we try to improve the parallelization of their benchmark and also try to start a study to Targeting the Scalable vector extension the SV that has been introduced before by Chris as the last point I wanted to give you a little bit more details about this Musa simulation infrastructure and Yeah, also here. There was a paper last year in which we presented the methodology I just want to give you a little bit the update about this topic Let's start so the the first evaluation I'm going to present is the unperformance library. We test we tested on each PC code Making use of arithmetic and the FFT libraries. We selected quantum espresso. We use gcc 7 1 0 and we tested on the AMD Seattle and The result for that you can find it in the poster what I want I'm going to present here is the updated Version of that result for the cabinet and the rex 2 and so here in reality the platform is Not not I mean we are we are evaluating at the same time the platform and the Performance library right so in this plot what I'm showing is that so there are different functions in these quantum espresso code and basically there are Two big families. So the one is this callback and the other one us are all more or less FFT based So here the message is that in this callback. Sorry is so you have mostly Arithmetic or linear algebra Functions while in the rest is FFT and on the y-axis here. You have execution time So lower is better and this normalized to the unperformance library. Okay key message here is that whenever you have Arithmetic or linear algebra function like a callback you are at least in our experience are you using the unperformance library? You are outperforming Open blasts in this case by 30% so that's a great result on the opposite side the FFT has still some work to do and so you have a little bit of less performance here, but overall here the message is that if you use the Unperformance library you get very close performance to the to the open blasts And if you couple the unperformance library right now we had the FFT you get even better performance. So takeaway message is our performance library are great as soon as you use the Arithmetic functions and You can couple it with the FFTW or or we just we are working with with Chris and the group of Chris for Improving the FFT part. So let's see the evaluation of the arm HPC compiler And here we took the last version of the HPC compiler and the version 18 or 0 Compared with the one of August at the 1.4 and we run the poly bench bench much sweet So this is a you it's pretty huge banser streets and we run it on on calium tender X here You get the relative comparison between the new version and the old version and So basically this is execution time. So once more lower is better and This is a as I say big benchmark suit. So you have to Know what you are running, right? So here there are especially these two Benchmark, they're very small, but basically you see that you have a lot of improvement with these two Benchmark and these are the benchmark that are more say important in the HPC workload because they are scalable vector metrics Multiply and metrics vector transpose. So And takeaway message here is that with this new version We see a lot of improvement. This is that we are around the 30 percent better performance for this kind of benchmarks And another thing that we tried to have a look at is the level of say the auto vectorization that you can get in this in this With this compiler and here we get even better results because there are Two situation in which there was no vectorization with the previous version And now we have auto vectorization that is improving a lot the performance and also in the gem in the gem here You get pretty Good performance overall in all the benchmark this they say the vector the auto vectorization is Increasing about eight percent is the number here. Okay, so Very good job I think from at least our point of view because everything that is related to HPC is Improving version by version you can see you can check out the details in the poster as well and very good Also very good news about the the simdi because we have to survive with simdi and we have to shift towards the SVE and these these are good news for for us that we have to use this machine. Okay, so now I start with the The evaluation of the use cases the first one as I said before the high-performance condom get gradient The problem is that if you get the reference code of the HPCG basically you don't get Scalability because the open mp parallelization is pretty poor. So we did this While we started this work recently after after Eric commenting on our Same not not so big attention towards the vectorization So we decided to improve the open mp parallelization of HPCG We studied the current auto vectorization that you can get in the in the current HPCG implementation Of course for leveraging for leveraging the SVE and we analyze the other performance limitation of the of the benchmark and Especially cash cash effects. So concerning the first the improving the open mp parallelization here. I have on the left The scalability that you get with the reference code and basically zero and the scalability that you get with with our Code after touching a little bit of the implementation especially of the Gauss say the part and you can see that Not only we have better scalability, but using the our performance. It's already the arm HPC compiler will get even better figure of performance here and Well, this is just another way of seeing it every time that you see White is because your your CPU is in idle and Every time that you see some color your CPU is doing something So you can see that in our version that is the one on the right. You have more activity say Okay, so concerning the second point as study the current out of vectorization for leveraging SVE we did a sort of Evaluation on the on the cabin tunnel x2 and on the Intel Xeon and Trying to simply count the SIMD instruction of the compute simg s region And of course you have what you expect that and on the Intel you have way more Out of vectorization and this is a clear sign that We have to start leveraging the DSV so and that's the reason because we moved on in this study and We evaluated our code using the arm instruction emulator and we basically got the binary with SV instruction in it and We said, okay, let's give this binary to the emulator and see which So how many instruction we can get out of it and the result is this one? so basically here I'm plotting on the y-axis the increment of vector instructions and for for the on the x-axis I have the different the different vector size and The very good there are two good news here. The first good news is that you get SV instruction so out of vectorization a Lot of instruction Automatically out of the compiler the second good news is that this is scaling of course If you have bigger and bigger Vector size you get less and less instruction. You could expect it, but it's not granted, right? So this is the first time that we try to quantify this and we are pretty excited to see how this will behave and Of course, this is just number of instructions So there is no concept of performance here because the arm instruction emulator so far doesn't give us any Knowledge about the performance, but we are working on it. The last point about the SPCG is the memory access evaluation And basically what we did is a sort of coloring technique and whenever you start doing this coloring technique You are you are harming your data locality So and here you basically get the ratio of miss ratio on the L1 and L2 cache This is just a measurement This is just saying us that we did a little bit of work, but still have a lot of work to do So what we are going to do is optimize the data access pattern in memory, but What it will be even more important is to test this code With a simulator in which we will have as we gather load in which we could take a lot of advantage of our implementation Very good. So this is the use case the SPCG use case I'm moving out on now with the alia use case alia is a finite element code that is developed within BSC Well, here is a trace. Basically, there are several phases in this code I'm not going in the details, but you can clearly see here that you have this load and balance and Yeah, this load and balance is harming the performance. So we started and try to understand what's going on there. There are Atomic separation that are harming the performance So we try to implement three well two different Implementation the baseline is the non-coloring in which we use atomics The second one is using some coloring technique and the third one is using commutative multi-dependencies. I already presented this in the arm Resource summit, but I have an update here That we ran it on arm now. So in the next slide I am going to show you the results and Let me just clarify what's come up commutative multi-dependencies are here basically We tell the runtime that he can update blue or red, but never blue and red at the same time So basically you are saying you can do blue or red in the order that you prefer But you cannot operate on blue and red at the same time. So this Is our test and our evaluation that we did so quantify the effect of the commutative multi-dependencies and also the dynamic load balancing technique that is another technique that is developed within BSC and tested with our runtime The method that we used we use the assembly phase of this alia code and we ran it on Marinostrum 3 the previous version of the BSC supercomputer and on caveat on cabinet and the X2 and here. Sorry 10 directs one. This is a This is in-house. So we did it on 10 on 10 directs one So what I already presented at the arm Resource Summit is this one So basically in blue you have the non-coloring in red you have the coloring and in Orange you have the arms with multi-dependencies Dependencies the takeaway message is we are plotting time here. So lower is better So whenever you use coloring or arms you get always better performance, whichever configuration you use of MPI processes and threads and What is important is that if you combine this with the DLB the dynamic load balancing you get even better performance So take away message here is that? We are working on techniques that are not only of benefits for our and But but but we are pushing these techniques because we believe that there are things that we can do and That can be of benefits for all next generation HPC machine And okay, so I really wanted to show this plot because it's really a work in progress We are getting data here. You see that we miss some column here because there are guys trying to Deploy this and run this right these test cases right now And so let's move on to the last topic of my presentation. That is the multi-level simulation approach As I said before is a simulation infrastructure Splitted in level the first level is a trace generation basically you have your application running on an HPC machine and You instrument it and you gather open MP around time events MPI calls and Dynamic instruction like for example number of accesses to the memory and so on and you gather all of this in a trace and Then you move on to the second level. That is the network simulation You then you can replay your trace find out which where are your MPI calls and You move to the third level in which you can simulate the just the Workloads the compute part in the single nodes and Combining these levels we are able to simulate multi level multi-level parameters architectural parameter micro-architectural one and main memory Features and is like for in my head This is like changing the setup of your formula one car right and every setup is different and gives you Probably some benefit hopefully some benefit may be some disadvantage so We are able to do this with several architecture parameters and with several MPI configuration sorry MPI Processes of course we have the problem that simulation is Simulation time is huge. So we have different techniques for trading accuracy and the speed This is just as I said an update. This has been presented in a paper last last year in which we validated the method methodology we presented five application and We proved the performance up to 16 K 16,000 MPI ranks and The status update is that we added parameter sets So we expanded our say space of research. We support power consumption modeling We support several system of the top 500 and We expand we incremented a set of application and we extended the trace database And we included the support for the Namorio in order to get our architectural data also on our platform Of course, I have my compulsor is a final slide that is related to the student cluster competition That has nothing to do with the rest, but I stick To give this message It is not only business. It is not only Pushing one technology or another technology or installing huge cluster. It is also a matter of education. So I'm the advisor of the team of student cluster since 2015 and we've always participated with the arm based Architecture and our base clusters and I think it is a great great example of collaboration between industries academia to give to these guys not only the idea of Reasoning about performance but also about energy and and the collaboration has been always between BSE for Cavium and arm and I have the last message that is I submitted a proposal for a team But I still don't have a cluster So if there are people in the room having an arm based class cluster to offer, please contact me and I leave you here ma the The coordinates to visit our booths We are in the exhibition with the Mont Blanc booth where we have the the platform of the Mont Blanc-Sequana board and then we have bull in the exhibition floor BSC and Several of other partners of Mont Blanc and I'll be here all week I don't know if there is time for a question. I see Roxana here. So Thank you very much