 of today and it is a pleasure for me to introduce you to Yipo Spiga who actually started to work in the PhD with me many years ago during his master degree. Then he moved around the world from Ireland then to UK, Cambridge and now he's recently joined ARM and he will talk to us about what are the ARM plan and the co-design that's a interaction they would like to make with communities in order to plan for next generation. Thank you for inviting me as a Harman. I'm very familiar with this community, I recognize many faces and I was not involved in Max in my previous job but I was aware about all the good stuff coming out of that project. So today people see me talking about multiple stuff so today I want to talk about what's ARM, what's ARM is doing in the context of it, you see high-performance computing in the context of co-design. So because this session is about co-design I will stress the concept about what ARM is doing or try to do with its own partner in this space and the objective of this talk is really try to convince you that you know ARM is doing something extremely important, very active, is present and is somehow a force of change in this environment. Before started because while I was talking with my colleague about what the presentation and what introduced the world concept I naively guessed that everybody actually understand what ARM is and now ARM operates. So by raise of hands since you've got all coffee who knows what ARM is and does. Okay this is great so I have not see a couple of marketing slide about that but I will basically introduce what ARM does and how ARM operates in the concept of high-performance computing in particularly the business model because there's a little bit of confusion sometimes about what ARM can give to the community straight away rather than what department that ARM is working with can give to the community straight away. So because this is about co-design and I've been following all the talk today I want to start with what is the definition. I googled this and I found this paper about what co-design is in the view of the DOE back in US and I want to point out on one thing that it's probably most relevant to what ARM is and now how I'm operating it is actually the second part. I'm going to read it just to make sure that I can stress with the words with the concept I want to translate. So the co-design strategy based on developing partnership with computer vendors and application scientists and engaging them in a highly collaborative and iterative design process well before a given system is available for commercial use. So ARM business model is produced intellectual property for building cheap for building CPU. We've been a market leader for the mobile market. We basically have a huge share 87, 1995 depending on how you count it but the reality is that we don't bring the silicon. We are a design company from that point of view. We understand our CPU memory system work. We put together this vending block and we give to a partner someone that implement these in a real CPU a specific design that can eventually sell. So what business model is essentially to engage with these people understand what they want, what they need, what type of workload they have then work with them to create the best design for their use. They will create the product that is sold to someone. So I don't know if I can take an intelligent video example, they produce CPU and GPU, they sell it to you and this is perfectly fine. ARM is like a difference. It doesn't build the physical hardware directly or at least with a few prototypes but we design them with a partner and then reach to you. Nevertheless, we want to be exposed to the final user because the quality of co-design is to come early in advance when the design of this element before even printed or before even realized and then we have essentially different business models behind this but I work for ARM research and then more interested to actually the collaborative and partnership part of the licensee. In the era of HPC, ARM is relatively new. We'll talk about what we mentioned about one project in Montblanc has been like an open door about ARM into the HPC world but what ARM brings into the HPC market will summarize into these three elements. Because of our heritage into the mobile space, ARM design is extremely power efficient and toward the years we also create this big little type of architecture when there are core that are throughput optimized or core that are latency optimized that we can somehow mix and mix together in order to tackle specific workflow. I have to make an example with the mobile space when you I don't know when you browse into the mobile you probably are using a fast core but when you play some game when there is some graphic or some other compute behind the scene you're using a lot of smooth core all together and these are all packed somehow into the same SOC. What ARM brings because of its business model is also a choice so in principle based on the same foundation that is essentially the ARM architecture there are multiple vendors that produce silicon and you can buy from these multiple vendors different solutions based on your needs and all of those may have different choices being more optimized for something like a cloud or a performance computing or normal server administration and nevertheless there is also a little bit of customization we don't sell a we don't we don't give the partner a solution that is you know fixed but we give them building blocks and so people may decide to take this building block putting them together in different ways and they produce a different outcome or somehow even change some of them and produce another some of their capability inside so there's a high level of flexibility that of course may create a little bit of uncertainty because you don't know exactly what you can get but also give you a lot of choice a lot of flexibility what you can achieve if the co-design process is done I would say proper so our strategy in the in the hbc is to essentially see the first supercomputer that is ARM enabled and we want to do this with this idea of co-design with this idea of partnership and also to enable the ecosystem and we started with several projects in the in the DOE in Japan and Europe but we also start to essentially look at the software ecosystem we'll start to look at what was missing in our tool chain in our software stack in order to be in the performance computing and to enable performance computing application but also we want to work with you work with people actually writing the application because at the end you are who's our technology you're gonna use that so we want to work with the application owner and user or is being in case of commercial application so the co-design journey of ARM started with I don't know exactly when it started I mean that I joined the company three weeks ago so I don't know what was before that since it's a company a lot of things are not entirely public this is perfectly okay but I'm talking about a few initiatives that are public knowledge so you can find plenty of material online in Europe since this is an European initiative we are more blind that open up the door it was for they started with the mobile very very aggressive they started more or less when the same time when the first R64 bit specifications come out and then they evolve to a point where now they're looking for RSOC that are fully 64 beta I would say mainstream and ready for production that is the Carriott and the RX2 and the building 7 prototype they building their programming model they've been looking at libraries and the ecosystem application but they've been working with ARM to understand how to put together this system in UK through one of the recent funding around about the infrastructure instead of deploy a classic system the GW4 consortium in Bristol decide to deploy a system this fully ARM based with two specific purposes now independently which hard technology and the vendor that now is public is going to be created but the objective of this type of machine was I compare Apple with Apple up to the point I can actually do that so if I try to make homogeneous as much as I can within that system is the same vendor is the same to change you know it's literally just the blade just accommodate a different chip maybe I can really compare how competitive is an ARM based solution versus an alternative x86 based solution so it would be very interesting to see what this comparison exercise is going to look like and through probably praise through even through direct contact it would be possible to access this platform and play with it there is something going on also outside Europe well probably there are a different way of ending with the funding and probably bigger bigger ideas it's public knowledge about some involvement in try to build a system based on ARM they really have something in-house they want to go further and bigger there's no specification exactly about which ARM implementation they're going to look like but it's going to happen and they want to target access scale in their way who is going to apparently try to reach the scale as quick as possible are the Japanese they were using Spark as a technology but then they decided for various reasons was better to actually rely on the ARM implementation and so they designed the post-k system as a ARM enabled type of machine and they also dealing with their own internet and so on and so the role of ARM in that specific project was designed the Scalable Vector Extension the SV that I will slightly mention what they are and how they work so back into design what I want to give it to you in the next 10 minutes is basically try to feel the game so as a harm where we operate on how we're operating you know these seven levels starting from the application because because well because you are application person so starting from here up down to what is the architecture and the platform I will show some let's say results I will put some reference of what other people have been doing this slide will be public so you can forward look into material and but I will not show batch mark because even if I have them I don't want to show that I want purely on this idea of co-design out in essentially we can work together so in terms of application ARM as soon as ARM 64 machine started both public and both still under new disclosure agreement ARM started to engage with its partner to basically compile this application and see if they're working and to do that essentially we we look at what application ARM most common use and we basically try to run with the armature chain with the gcc chain and we set up essentially a gtab type of type of repository where there is basically a description on how to compile code and essentially how to run them using ARM 64 and some artists will compile it and library and to be honest as based on my knowledge and I try some of them there's no particular problem 99.99 percent works out of the box because you just compile and rely on the components of libraries of course we will just look at our software stats and we make tuning and enhancement in their respect so application you know it's complex we we look forward and see whatever it is message passing various library but you know we code it and we need a set of tool libraries compiler to essentially enable them enable the developer to actually do their work so I don't know when exactly but I was probably last year we are quite linear there was a hpc company that was doing tool for hpc and we paired up with the effort that we had internally on compiling a library to create a suite to basically give to the developer to build their application on future and current ARM 64 enable platform so without entering the detail about what the tool are doing of course we support in the bugger profiler we support the cc plus plus important compiler as well we are essentially focused on enabling as much as possible the standard so you know we have C++ 14 support that probably will be planned for the future one same for for c++ we want to of course going going forward enable these libraries to the feature that are specific for down 64 architecture we want to give performance library that are optimized I will spend a few words on that and the performance library that we care about that we all care about the classic blast laptop frtm whatever it is because we know and working with partners some of these libraries will be tuned for specific implementation of the ARM architecture and this is great you're going to use the library linking that will work out of the box and thanks to the linear effort and expertise we have also to allow to you know identify like performance bottom and profiling spot all the things that are parallel developed for scientific application usually usually so information are essentially there at the very beginning ARM didn't have that for the compiler that was so there was a big miss so we pay up with PGI actually so we need some working with them and we implemented within our compiler to change PGI flag so now we can essentially throw our directly compiler in just for fun and somehow generate binary that can run on the CPU so so far I mentioned what is that we consider ARM commercial compiler historically I'm truly narrow that is one of the branch that you're working together that is working more into the software ecosystem a lot of enhancement and improvement are upstream to the open source community so an open source code so gcc lvm blast so you can find a lot of libraries that are completely open source completely free that actually have enhancement optimization from app 64 the reason why at some point we decide to go forward with our commercial compiler is because well if you're selling something you need to have support and at some point you want to I'm not saying blaming someone but somehow be having a reference person that you can talk to and say I have a bug can you fix it or I have a problem can you support me and this is very important especially if we look at one of the commercial side of hbc rather than just that again and so in term of the future we will basically going on adding new things adding new optimization in terms of hardware it's been announced that 2016 the first so see that is going to become adopted by by multiple partner that is the X2 you can Google and find more specification last year Qualcomm announced another chip soon to be proven good for hbc but another chip with a lot of core enable 64 that's going to target the hyperscale in the hbc market these two processors somehow the one that are going to be available this year for anyone who wants to play physically with arm into into server essentially deploy proper server and system so in terms of coming tundra x2 because there are public results and whatever i can again i have results uh what was very important was that um well compared to what is on paper we can actually squeeze a good amount of the real performance of the core using classic workload for example jam and then because we work on our commercial compiler so improving the libraries improving the compiler it was well advantage on our compiler for example in the maximum throughput of the maximum we can reach on on this specific workload and this is the classical gen because this was easy however we didn't do it but the gw4 consumption started to do it because they have this platform and they have a radio test bed before the full system deployment they start to compare apple to apple so they compare robot skylight and the tundra x2 across some of the workload these are single nodes because well they're only one um and you know it's interesting independently what's the actually time of running this mini output is an almost all mini output um it's good to see that well between robot and skylight of course there's a bumping performance but in a lot of cases tundra x2 is very compelling then we can go down and looking because looking at the reason why this happening the single core throughput or the memory bandwidth these are mostly actually memory bandwidth related time of benchmark but this is good sign that the things here is real so if you buy it you can actually use it this is great one of the advantages that started to make the hpc community even more excited is this concept of scalar vector extension so to make it very short the idea is that when you program vector instruction now through in principal to an assembly you have a pixel vector length so you are like 256 512 you have to align your data in this way or rely the compiler to understand in that so we realize that the partner may want to implement this in a different way um because they think that their customer they have different needs that's perfectly fine but uh design and isa design a core that will accommodate all possible vector instruction is simply much too much effort and it's not really it also makes this software side more complicated to optimize and build so the compiler and the library so we came out with the idea of vector length diagnostic so the assembly that the compiler can generate doesn't have any knowledge of the vector extension but somehow is marked to the hardware when the one provider decided the vector left on my chip is going to be for example 512 so all of these from an application point of view what does it mean well you can still implement tricks in the software and this is great and we can still take your workload and analyze how effective is mapping to this console sv by varying the vector length but at the end from a code generation perspective we can put a lot of risk in the compiler and the libraries and they will simply work well or good enough in multiple implementation of the software and we don't know which which vector length our partner is going to adopt it's not really under our control so this concept of diagnostic it's it's very interesting and this is again another word you can look into the detail this has been done in the context of Mont Blanc sv hardware is not available at the moment so we need to use a mullet of understanding how the workload actually behave here the idea of code design and basically we take some of the typical small workload uh from uh you know that's me the jam and these are the other mini benchmark and we basically in the simulator try to bury the vector length of some of these simulated system and see how much throughput gain we can get in their respect and this is a good feedback in terms of code design about what you can do if you make an artwork a design a hardware design decision rather than another so today because there's no dog yet sb enable hardware what we have as a harm for our for the people who wants work with us is essentially a compiler emulator that extracts a lot of information heuristic about your code and allow to essentially browse what is the code and also the assembly to understand you know when things happen or when things does not happen so there is there is something that prevents for example vectorization a work very similar to this has been done actually by julie the case depression that can talk about this much more in detail that's been presented both supercomputing in both a high peak last week when essentially they took several applications they extract part of the workload they run our tool and they'll be looking essentially understanding if this application can be mapped with sb when sb is going to be available and how much gain or performance improvement we can we can get and they've been losing several technique i believe both automated and senior automated in order to to estimate the performance that to be honest a kind of very low level compared to the general application development but still this is the something as close as possible to the hardware that then can feed into the design of the net so this is very interesting i believe that they are very much so just to close a little bit and a question that people ask me they were asking me already before i will even join arm okay i want to play with arm i need an arm machine anyway i cannot play with an emulator you know forever so machine that have arm 64 core are starting to appear and more and more will come so these are three references of machine that already exists you can buy out of the box you cannot have to come to us that's the that's in the message we don't have them you have to go to some of your vendor the work you were close with the within the center or yes within your center and independently by the platform what is very important is that arm wants to enter and start to build relationship with the community because if we go again on this co-design idea we have to feed back into the hardware design and since we are i mean not just me said by the entire company hardware architect we want to understand better what the final user want to have in order to make hardware choice design choices they will impact the partner and so our user so to do that we are running multiple events where people talk about their experience with arm or major supercomputing conference we have now we have a google group we have some digital pages we'll be able to receive and how to build the application and run that's what we'll appeal as well and our role essentially is to you know to talk with you help if you get any problem to actually compile and run we want to get to isolate what is very representative from your application and we can reuse for the next stages and eventually help also to what is to say most of the testing and understanding that there is continuous coverage of our tool and our compiler as soon as you're called the ball and um yeah that's all if you have any questions i'm over here thank you