 Hi, good morning. Great day in San Diego, it's a little cloudy, but it's a lot warmer. It's a little cooler than Tokyo, or Kobe, where I am. And it's a pleasure for me to be able to speak to you and introduce some of our efforts regarding our probably what will be the biggest arm machine ever, or the biggest supercomputer ever to be created in the world. So before I go on, to tell you who you are, I'm the director of the center of one of the weekend centers with our 15 research centers weekend. Weekend is like, if you don't know, it's the biggest national lab, research lab in Japan. It's a public lab, so it's like Max Planck in Europe, or maybe like one of the DOE labs in the US. So we've got thousands of people and researchers. But we, again, we are one of the 15 centers, research centers in weekend, other centers who like biology, physics, but we are the center of computation. And what we do in summary is we study computing and computing in high end. So what we say our slogan is science of computing by computing and for computing. And those of you who have studied US history probably know where the paraphrase came from. But no, our subject matter is computing. We also apply computing to other branches of science, and also we collaborate with other disciplines of science that allows us to accelerate computing. And that's our whole agenda, okay? And then, so we're a research lab. We are situated, so weekend tech quarters isn't actually suburbs of Tokyo, but our center actually is located near Kobe, and we know what Kobe is good for. Well, actually they're not really, technically they're not raised in the city of Kobe. City of Kobe is urban, so you don't see any cows. Although we have lots of wild boars roaming around for some reason from the mountains. But we're located here, it's a big building. We're located right in the tip of this landfill, very large landfill. There's another landfill airport, so we're just one station away from the airport. So if you're ever come, if you want to visit, let us know, and once we have the new machine, we can take you on a tour. So like I said, the mission of our center is to do research. So we publish papers and do all kinds of stuff. We have hundreds of researchers in the area, but the other responsibility we have is to design, build, and operate the flagship supercomputer in Japan. There are many other supercomputers in Japan and there's a coalition of supercomputing centers, much like, if you know, like Exceed. Actually, UCSD is one of the, the SDSC, not UCSD, SDSC for example, is at UCSD, is one of the Exceed centers. And just like that, we wanted the, what's it called, what we call HPCI centers in Japan, but the differences were the flagship centers. So we're tier one, all the other centers are tier two in Japan. And being a tier one center, we, up until very recently, we operated this, we built and operated the SK Computer, which was a flagship machine for Japan. And you know, it was the fastest supercomputer in the world by various metrics, not only in the top 500 ranking, which is most famous, but other rankings too, like the HPCG, or, which does, you do a conjugate gradient, or things like graph 500, where you do graph analytics. And in fact, this is, and we also won many accolades in applications like the Gordon Belt Prize with the machine. And in fact, K Computer was the first machine to retire. Actually, we just retired it. It was a big retirement ceremony in the August. It was the biggest retirement, it retired as incumbent number one on the graph processing, superseding every other supercomputers or cloud or any other infrastructure, being able to solve the biggest world's largest graph problems, like you do in social networks. So, it's so the machine retired with lots of accolades coverage by all the TV stations and use. So, it became a national event. But why did we retire the machine? Well, to make move forward the next one, which we have been planning for many years. The new machine, Higaku, Higaku means Mount Fuji. It's the other name for, it's the alternative name for Mount Fuji. Well, those of you who can read Chinese or Japanese characters, this is Higaku. So, it's a big, it's a new supercomputer we're building. It's arm-based, and that's why I'm here. Intel-based, maybe, maybe I don't know. It's arm-based, but it was built in mind with two very two large objectives. One is to achieve the highest peak in the world, being able to run these flagship applications, have extreme performance that can only, and solve problems that can only be solved by computers of this magnitude. But the other is the other objective, which we did not fulfill with K, was to have a very broad base, very broad adoption. And not just the machine itself, but have a very broad proliferation of technology we have developed in the ARM ecosystem. So, more users, not just capacity, but more broader sets of applications than just traditional simulation-based applications, data science, artificial intelligence, and so forth. And much broader user base, not just in academia, industry, or even have these technologies, the right processors, going to clouds for people to use them in, like, to become the server end of the IoT applications. Okay, so that was the objective. So that's the, this broad adoption is the other objective. And if you look at Fuji, it's like that. It has the highest peak in Japan, but also as a standalone mountain, has a very, very beautiful broad base. So that's our idealism. So just for the remaining of time, if you're reading email on your smartphones or do stuff like that, just the, but this is one key slide. The rest of the stuff is just explanation of all this. So this ARM, so what we have done is to build a new processor, ARM FX, A64FX or ARM 64FX with Fujitsu. And this processor, it's a new processor. It has nothing to do with any of the ARM IP, except for the instruction set. It's in the brand new processor. And it's a processor that's all, is a standard CPU, but also it's been optimized for HPC workloads. So most other processors, including Xeons and so forth, are, you know, they just have HPC capabilities, but they're not optimized for HPC. I'm sorry, but no, this is a fact. They can run HPC workloads well, but the main point of optimization is to, let's say be in the cloud or run your notebooks or be in the cloud and run video streaming. Whereas this, so if your workload is that, that's a processor you're buying. And that's why, you know, Intel is selling like billion processors in the market. Oh, about 100 billion processors in the market. But this processor is entirely HPC optimized. So it's extremely high bandwidth, compared to standard processors, like again, standard CPUs, and as over order of magnitude, higher memory bandwidth, and memory bandwidth is very important for HPC applications and also for many other apps. That's fairly high flops, it's like three teraflops in peak. As extremely high embedded network bandwidth, as every processor, every chip has 400 gigabits worth of bandwidth going in and out through the interconnect. So think of every node in your cloud having 400G internet, this is the equivalent of that. And then there are various AI supports, so just like GPUs, we have a FP16 intake support. But it's still a general purpose CPU. It's not like a GPU, it's not a special purpose accelerator, it's still a very generic CPU. And it doesn't have embedded GPU, just a standard ARM based cores. ARM based core, well, ARM version 8.2 instruction set extension with SVE. So the entire, the chip can run work, it can run Red Hat, when we had to chip the first for the first time for the fabs, it just booted a rel out of the box. No modifications, just booted everything right there. And it can run Windows, with a little bit of driver support, but it can run Windows, so it can even run Word. So, it will probably be the first fastest word processor. I'm over, what would it be a processor? The TDP, the processors are kind of rising. Now as you see on the server, you see processors like 200 watts or even like 220 watts, the recent announcement by like AMD and so forth. Or even Intel's some of the high end offerings or hundreds of watts. But I can't disclose the power numbers yet, very soon. But this processor is also very, very power efficient, extremely power efficient. So it's much lower than it's competing tiers. So combined with the higher performance and lower power, in terms of power performance, and some of the applications we have seen, even order, over order of magnitude, faster application performance on a per chip basis compared to, let's say, ASEAN, high end ASEAN. And then we take that processor, and of course it's been built to scale. It's a parallel, it's a more parallel machine. So there are more than, so we're building machine that's more than 150,000 of these nodes integrated into a single machine. And so we get something like 150,000, 100 petabytes per second memory bandwidth. In terms of the number of nodes, it succeeds a little more Sequoia, which was a blue gene machine in terms of number of nodes. Of course, each processor being much more powerful. And then we get 60 petabits of injection bandwidth aggregate into this interconnect, which is about 10 times all the cloud vendors internal data center, internal traffic aggregated, but it's an order of magnitude, a higher capability. So that's the amount of interconnect bandwidth and bring on to the system. Then we have a petabytes of MVM storage and up to 10,000 endpoint IO network. We'll be a little less than that, but with the capability. So in terms of, it will not be exaflot machine, per se, but in terms of the capability of when we start, when everybody in HPC community start thinking about what the exaxial machine could be. This is the first machine in the world to be achieving the level of performance that's expected at an exascale. But of course, we want the technology to proliferate like I told you. So we have been working on this for 10 years and a few, I think they'll publish the slides so you can look at the details. But in fact, we were working on, they started working on this machine when K was deployed in 2011. So we've been working on this for almost like eight, nine years for a very long time. But we're finally at the very last stage of being able to deploy it. How do we achieve this? Well, it was really a very articulate co-design process involving a lot of the advanced applications that were running on the K as well as on the other supercomputers we had. So we had a whole bunch of, we got the entire application community, HPC application community in Japan involved in doing a co-design with us in our lab, our center as well as Fujitsu. And so there was very tight relationship going on. And there's a whole talk about this, I won't go into detail today. But basically this tight collaboration we were, we picked nine representative applications from the nine areas being what are called priority issues. And there were of course application projects in each of the nine areas, very similar to, if you're familiar with DOE's co-design centers, it's very similar. But we had areas like pharmaceuticals, we had areas like disaster and environment, areas like energy, storage, delivery, also generation. And we had manufacturing materials, actually material sciences manufacturing and also but more traditional macro skill, macroscopic skill manufacturing like turbines and airplanes and cars and so forth. And then we had some basic science. But most of these, all of these applications are of high interest to the society. So we took nine applications from them. And our objective was to, for these applications, we inquired all the scientists. Inquired them like, what's your goal in five years, 10 years time, what's your computational requirement, what are the algorithms you're using, how do you stay competitive in the field and so forth. So everything combined, the goal was set to have the applications run some of the applications run more than 100 times faster than K. On the average to somewhat comparable to that like 30 times, 40 times, 50 times faster than K. And that was a co-designed target. And so I'll tell you later, we have high expectation reaching that. So now you know why we decommission K because the next machine will be 4,200 times faster. So there's no point in keeping the old machine around anymore. So what are the apps? Well, I can give you a few samples. There are lots of a lot of comprehensive examples. Oops, let me turn off the sound. Yeah, so this is a multi-proteomic simulation. So usually the molecular dynamics simulation is the stuff you see on the right-hand side, you have a single molecule protein and then you simulate like folding. But in reality, when you want to do let's say drug design, you really need to think of the holistic environment. You need to think of like membranes or the inner cell, inside the cell. You need to really calculate all the interactions of the proteins themselves, of course, which are composed of all the atoms. So it turns into this a multi-scale, multi-proteomic simulation involving thousands or even more proteins like that. So you need to be able to solve this entire system. And you can't do that without having a very large machine. The other would be something like the whole global atmospheric simulation. And you know, because at least in Japan we believe in global warming. Yeah, we do, yeah. We think it's an important problem, societal problem. So of course you need to be able to simulate various scenarios, but for that, you need to be able to simulate, ideally you need to be able to simulate sub-kilometer scale resolution in your simulation. Otherwise, because of physics, you can't simulate the clouds directly. The clouds, if you have a coarser match like five kilometers, then you have to, the clouds are actually sort of emulated, but they're not exact. But being under sub-kilometer allows you to do this type of simulation in real resolution, in reality. And thus you can just typhoon and all that. But of course in order to do this, integrate this on a time scale, so in 50 years, 100 years, you need really big machines. And so we achieved high resolution on K, but in order to do real science we need to have the power to accelerate by at least two orders of magnitude and we'll achieve that with Fugaku. And like health, we can simulate the heart. Now we can simulate hearts directly. So part of the aspectular illness is that one of the biggest killer in the US, but now in very close future, we'll have capabilities to simulate the entire heart. Of course, in fact, we can do it now. And this allows you to do various what-if analysis, what-if having an anomaly, what-if to surgery, what's the correct, how do you diagnose these things, how do you do the correct to surgery, what's the most effective drugs or any sort of solutions to the problem that's more most effective. And this is very exact simulation. You're assimilating cells themselves. You're assimilating the blood flow. You're assimilating all the electrical pulses. It's all a multi-physics simulation. Again, this is only possible with very large machines. And finally, something that has to do with things like IoT. Well, this is IoT of macroscopic scale. So we have all these advanced radars like that, like weather, net phaser, radars. But what we really want to do, and this is also advocated recently in some of the people in the IoT community. For example, one of my old friends, Rich Wolski, at Santa Barbara, he's doing lots of IoT stuff. But given his HPC background, he's saying, well, you really need HPC to, once you have the data, you need to simulate things. And what you really do is to have the data, incoming data stream, you have the simulation, you try to couple them together. And that's called data assimilation. So, but to simulate anything like weather, you need to assimilate this kind of simulation you just saw in climate. Then you need to assimilate with terabytes of incoming data. And that's not a very trivial problem. But this is like IoT at very, very large scale. So not only you need methods to acquire the data, but you also need ways to utilize the data. It's not just analytics to do real advanced stuff. You need to couple that with simulations. So analytics plus simulation, that's really important. So with that in mind, again, we designed this chip, which, so there were a lot of stuff we did right with K, so we inherited that. And so again, it was very easy to program, but it's a large machine, but you can basically all the program that ran on K runs on this machine without modifications. We retained the bytes per flock ratio. A lot of the new modern designs, they compromised the memory bandwidth to, just to try to get the superficial flops up, but we didn't do that. We retained the bytes per flock. So there's a lot of easy use and a hundred times speed up. However, the new innovations we put into the machine because it's important, because like I said, for proliferation, we really needed this for broad adoption. So one difficulty with K was the fact that the processor was kind of subpar. Well, it was not bad, but compared to, let's say, Intel offerings of comparable times, the processor was not slower, but by and large, the same performance. We tested a lot of applications. And that really has no compelling market value, right? Because if you have some processor, and by the way, K was running a Spark, anybody who's used Spark stations, I'm sure you're very familiar with what it was, but it was sort of a dying ecosystem. So if you have a processor, a Spark processor, and you have a X86 processor, they perform by and large the same, there's no compelling reason to buy a Spark one. So you need some compelling reason, which is performance, how much higher performance that will get you over the edge and adopt the new infrastructure. Also need to be very green, because data centers these days are very power limited and being eco-friendly is very important, not just for cost, but for environmental responsibilities. It's not just putting data centers up north where the electricity cheap, you're still using coal, let's say, or maybe using carbon neutral power sources, but then some other people are just using coal to fill the gap. So really need to build something ecologically friendly as an IT infrastructure itself, not to move your data center somewhere to make it look like it's green. And of course, it has to be part of the global arm ecosystem, some global software ecosystem. Arm, for us, was the primary target, because arm ships more than 20 billion chips each year. So this chip will sit at the pinnacle of the overall arm ecosystem. And of course, what we call society 5.0 in Japan is a new days and age of IT, or like Neoverse and arm terminology. It's largely the same thing. There's combination of assimilation with big data, AI, everything, and security, including blockchain. But these processors need to be the power source of that. So that's a chip we developed. And by the way, it's real. In fact, it's working. It's being mass produced right now in Taiwan. And it's a seven nanometer chip. And it has a high bandwidth memory. There's a serial block that has lots of series for high bandwidth output into the network and also PCI and so forth. And so you see the chip here at the center, you'll see the HBMs right next to it using silicon interposer. And it actually has 52 cores, arm cores inside. And I won't go into the details of the architecture. If you're interested, there are several materials that's been published by Fiditsu. There are several talks about me and others. But basically, we have what's called CMGs for memory groups, four of them in a chip. They're connected by a cache curve here in high bandwidth interconnect. But each of these CMGs have 13 cores, 12 of which are user cores, but all the cores are connected by a crossbar. So by being connected by a crossbar, there's extremely high bandwidth between the cores. And then there's a level two cache and then there's a memory controller. And if you're into any of the architecture stuff, this kind of configuration looks very familiar. It doesn't look like a CPU. I mean, the core is a CPU, but it doesn't look like a CPU. In fact, it looks very much like a GPU, if you know the processor architectures. In fact, that's exactly what it is. So on one hand, A-square FX, one way to look at A-square FX is, it's a mini-core, again, it's a mini-core ARM CPU with all the intrinsic CPU properties. It runs all the standard codes, now it has a mini-core, single-threaded performance is okay, very high performance due to higher bandwidth and vectorization, but still it's the CPU core. But architecturally, the underfootings of the processor interconnects and everything, plus the vectorization engine, makes the processor behave also like a GPU. So it has a very high streaming, data streaming and processing capability. In fact, it has the same memory subsystem as, let's say, Volta, a terabyte per second memory bandwidth. Has the same sort of vector, very wide vector lanes. And it has other memory and streaming memory has features like something like cache localization, has very, also being able to synchronize very fast scatter gather support. And it has other performance enhancing features in synchronizing all the cores and so forth. So again, these are features you find in GPUs and you find them also in this chip. It's not, again, a lot of, some people ask the question, is there a GPU embedded? No, there's no GPU embedded in the chip, but it's a CPU cores and the aggregate underfootings and the memory subsystem that makes the behave also like a GPU, not just a CPU. So that's the beauty of the chip. So what's the real performance? Well, so the Jimeno benchmark is a CFD, computational flow dynamics benchmark that's very famous. It's a micro benchmark, so the scores may be a little inflated. But again, it's a benchmark that everybody does, so we did it. And so this is Xeon, this is, by the way, this is a Skylake Platinum, and it's two socket, so it's two chips, okay? So a single chip performance is about half of this. And this is a performance of the AC4FX CPU. So compared to a dual socket Xeon node, it's about four times faster, okay? And then even compared to like Volta or Skylore, Sektori, which is a dedicated vector chip, is faster by, it's faster than they are. So on a per socket basis, this chip is about eight times faster than Xeon on a test CFD workload. And by the way, lower power. On a, so that was a rather synthetic workload. So on a real workload, of course, in some cases, the differences become smaller. So this is WORF. WORF is a weather, is a very famous weather code in the U.S. And of course, with real weather code, you have other things like boundary conditions and some physics, which are more drip warm, can compute bound and compute bound workloads Xeon's are actually doing, will do a lot better. But largely, even so, this is again, this dual socket, and this is a one CPU, a single socket, and it's about 56% faster or about three times faster on a per socket basis. And of course, this is, even on per socket basis, this CPU is much lower power. So that's the performance. And it's integrated in this way. And under our configuration, there are other configurations like your cool one. We have, because of our scalability, we have this liquid cooling package, about 384 sockets, nodes per rack. You see there are no dims because all the memory's embedded in the chip. And then we can interconnect that with the tofu network, which has 40 gigabytes per second or almost 100 gigabytes per second. But also the latency is in the sub microsecond, about 0.5 microsecond from node to node, which is 100 times faster than ethernet. In lower latency than ethernet. So we connect that, let me skip over. And then we integrated that in a hierarchical fashion and we have 150K nodes. So right now here, we ended the K-Op computer operations. By December, we start an installation and by the first half, we will have the whole machine. So if you put this together, so we did a lot of NRE, of course, we paid a lot of money for it. But overall, it paid off because, you know, have we built this out of off-the-shelf chips, then it would cost three times more, machine three times bigger, and we use three times more electricity. Because we designed the chip from ground up for HPC, because we designed this to be power efficient, because we designed this to be integrated, we got our efficiency. It was a lot of years of work, but we got this. So just to put you in perspective, this compared to, let's say, my iPhone XS Max, okay? This machine, as a whole, for HPC workflow, is about 20 million times faster. So that's like all the cell phones, that's comparable to all the cell phones shipped in Japan over a year. So think of all the cell phones in a big country being, aren't now, being just concentrated in the single room. So that's the magnitude of the scale of this machine. So we did reach 100 times in some of the apps, some are like 30, 10 and 40, but we know that we will be reaching 100 times. But again, the software base is generic art, plus SVE for vectorization, but everything you expect to run on, let's say your XA6 cluster, if it can be just recompiled or adjusted a little bit, we'll run on art. And there's been lots of precedence work that have proven that ARM ecosystem is credible. HPC, you know, there are a lot of people who have done this work, say on Thunder X2, but they have shown this to be credible. But now this everything will run much faster. And we're porting many applications, HPC apps, and then eventually, but we really want to make this go into the cloud. By first providing our services as a back end to an alliance with the cloud vendors, that's phase one. But in phase two, we really want, once we get an option, we really want these chips to go into the clouds directly themselves to offer HPC services and AI services and data analytics services at high performance combining cloud to the edge. So it'll be the, you know, you have the same ARM running on the edge, but you have this, the highest performing processor of all times running in the cloud and be conjoined. And you know, there are uptakes on my talk soon, there are uptakes already, there was announced by NSF that the Stony Brook is buying a A64 FX based system in U.S. It's a Cray system. So now Cray has not made any formal announcement, Fujitsu has not made any formal announcement, we have not made any formal announcement, it's an announcement by NSF, okay? So bear that in mind, but the announcement says there will be an A64 FX based Cray system installed in Stony Brook's early next year. So I'll skip the interest of time, I'll skip most of the AI stuff because there will be a talk by my researcher, Alex, Alex raise your hand, yeah, the tall guy, yeah. So Alex will give a talk detailing out our AI plans, especially in training, how we can all be able to utilize the power of A64 FX for AI training, which is very much like GPUs. And so I won't go into the details of that and since I'm running out of time, but overall we believe that A64 FX will be a credible platform for AI training because simply each of the chip already has like some of the AI acceleration capabilities like low precision support, plus very high plop performance and high bandwidth, but also the interconnects very fast. And when you try to do nowadays, the real trend is to scalable training because the networks are getting much more complicated. In fact compared to the past when we saw like the first Alex that some of the demands are six orders of magnitude higher compared to the early days in the early 2010s. So high performance parallelization is the only way to achieve training in reasonable amount of time and we believe that the machine is very fit for that. So, but again if you're interested, I'll leave the details to Alex with respect to what we're doing. And overall, and so we have lots of AI machines now, some of which I have built in Japan, but there are no public, there's very little public AI infrastructure. The reason why HPC has been successful over many years is because in every country in the US and Japan and Europe and like China recently, there are public sources of being able to access these large machines. And HPC's success rests on that from the 1960s. Now with AI, like I said, the real challenge is how to make these complicated networks, deep learning networks, perform and train them and you really need a big machine to do that. But do you, can you get access to a big machine? And HPC, yes. Well, simulation workloads, yes. But for AI, there is very little. So, we've been building mid-sized systems and some are very successful like the ABCI system I built last year with 4,000 voltas. And the US, just about the only public, large-scale AI machine is Oak Ridge's Summit, which has actually $27,000 voltas. And very, very powerful. There's been lots of great results. But most of the results are dominated by the big players like Google and Amazon who have comparable infrastructures, but they're not open to the public or they're very expensive where you buy out of cloud services. So, this resource is very important and comparing that Oak Ridge, they only had the aggregate of all the machines, only had like one fourth to one fifth of the performance. But once we launch Fugaku, our capabilities will be higher than Summit. So, not only in simulation workloads, but we expect to be able to drive the four fronts of AI, which of course will be combined with the Edge to drive the next generation applications. So, it's not just about simulation. It's not just about simple analytics. It's really about scaling to tackle the next generation problems that's ever more complicated in this what we, some people call Neoverse and ARM. Well, what we call society 5.0 and large machines we have built at the very pinnacle of the ARM ecosystem. We'll be one of the resources to do that, but we also expect to see very wide proliferation, including those being embedded directly into the cloud. Okay, with that, I'd like to end my talk and thank you very much. Thank you so much. Thank you so much for joining us this morning. We really appreciate you coming here and sharing this with us. Really super interesting stuff. I wish we had.