 Welcome to another edition of RCE. Again, this is Brock Palin along with Jeff Squires from Cisco Systems and one of the developers of open MPI Jeff. Thanks again for your time. Hey, Brock. We're we're verging on to my territory again today. We're talking to some MPI guys and some big parallel computing guys here So this will be an interesting one. Yeah So before we get into that you can find us online at RCE-cast.com You can find all the old shows there subscribed by iTunes RSS, etc So our guests today are um two guys who worked on the k-computer You guys may have heard of the k-computer is currently the fastest machine in the world with an absolutely massive core count So our two guests are uh co-hota and shinji sumimoto and i'm going to let them introduce themselves and say exactly what their roles are So co and shinji. Thanks for taking time for the show I don't know my pleasure So why don't you give us a little bit of what your roles are? So my name is co-hota and i'm in charge of uh Software development in the k-computer and other free to next super computing. So especially i I have experience in the compiler development and also the open p i'm a open p director of the open pa rb And also of course Hi, my name is shinji sumimoto I'm a same division of co-hota and also i'm in charge of development of Our mpa and high cop high performance communication library. Also i'm in charge of developing a file system And i'm also in a board member of mpa forum and jeff. I will be well known and and maybe much time to meet So I should mention too that shinji and I Are on the mpi forum together. So we see each other every two months or so at At these forum meetings for for quite a long time now actually So, um, let's go ahead and start with a good general question here. What is the k-computer? k-computer the name k-computer is a nickname of the next generation supercomputer built by riken and fujitsu. Uh, this is a national project of japanese governments government, uh south side science and technology basic plan so using the k itself is the name of the national project and fujitsu is in charge of the development and k-computer uses splat processors and more than 80,000 processors are combined in the one system a very huge size of the Computer and the network itself is Stopped fujitsu or k-computer original original network. So 3d Taurus network is used in k-computer now it will be Development will be finished in March 2012 but so now we have Maybe 80 percent of the system is already developed So when you say it's 80 developed the hardware is all in place you're talking about the special software stack to support such a large system yes, uh special software stack means that so um operating system is d-nax or d-nax is using the k-computer and uh Compiler it has Fortran cc plus plus and so MPI using the On that system is matched to the This tofu Interconnect system And it is based on open pi or we we have to use open pi and too much to the k-computer and we have We had a lot of issues for example, so huge System so we need to reduce the Overhead of the operating system and some use Supporting facility to use the Scheduled job scheduler or etc. So they are Kind of original but so enhanced or the tuned facility for the Our future the current former previous system of the Super computer And it has a eight core spark processor is new for this k-computer and it has eight cores and Our compiler has automatic parallelization facility on eight core system processor So thread parallelizing is done by automatic parallelization and also close open p On and hybrid programming. So thread and power MPI hybrid programming is achieved by this Used by this system and so too Easy programming so we enhanced our automatic parallelization because programmers Doesn't want to think about two level of parallelization thread parallel and Process parallel so because of the Automatic parallelization programmer does not don't need to Think about thread parallelization and this is a Became available Not only by compiler optimization, but also by hardware facilities it has Shell cache and hardware barrier high-speed barrier it make it available This is an abstract of the k-computer So who are the target users for the the k-computer and what kind of applications Will be running on this machine? A k-computer is developed for kind of general purpose. So there are Almost We have we are our target the target users are Four kinds of target users and one is dive science and The next one energy research disaster prevention Manufacturing supporting manufacturing and astronomy and astrophysics or deep five Kinds of the area is a target of the k-computer right now, but it is very very general purpose and so Maybe most of the users are research Researching areas academic and research laboratory, but so we are also expecting that manufacturer on the Industry people will use the k-computer also So the k-computer is you know very very large by core count and it has a very specialized network to enable mpi scaling What do you expect the system to actually run a Very very large single mpi job, or do you expect it more often to be running several or dozens of smaller jobs simultaneously In practically Now I Don't think so much through mpi program. So 80,000 processors are Simultaneously used by one job. So it is of course available and This is a final target of that final use usage of that k-computer. But so In Most of the times or several or dozens of programs are used simultaneously in the k-computer system k-computer has Partitioning facility. So We the k-computer system can be divided to Independent partitions for Dozens of programs. So this partition is Independent in the Operation So was there a reason to build this as a a single large machine rather than several smaller clusters Yeah, so of course if there is every time this is This is used in the smaller programs Several smaller systems may be available. But so in the target Systems for lifestyle and for energy research or maybe so disaster prevention in tsunami Simulation or life simulation, etc. So Those area needs a huge size huge high-speed Systems so it means 10 petaflops Systems so Those Area needs full system of the area Or simultaneously A k-computer development and application people are developing their programs for K-computer In very near future after the Development of the completing the k-computer system. So Some programs will use Full system in one time. So of course there is some Strategy for the how many how much time Full system is used and how much time so the separated partition is used so we have Opportunity to to select the partitioning and the full system On the on the system So the k-computer is an absolutely massive system. I mean, it's the largest publicly acknowledged machine ever built. What sort of challenges did you run into both? building such a physically large system and The kinds of issues you run into managing such a physically large system So runtime Management is Was a challenge for because the system itself is huge. So one is a quality issue Especially for the because of the k-computer network top network. So In case of some partitioned jobs, so Even if some nodes has damaged and cannot run so The system Program can Use the other system so because the network is Kind of redundant structure. So Program can run on the Using some other redundant nodes. This is one of the challenge Of course the other one. So I have just mentioned Mark Hybrid programming thread and process programming. So we have it was a challenge for the use Very simple Programming style Don't use a can programmer can Ignore the existence of the thread Because of the a threads can be automatically Used by compiler These are the challenge of course the hardware has a lot of challenges Also, I have to note that the Detection of a failure node that almost all of us Automatically detected by a job schedule system. So job schedule automatically eliminated the failure nodes and Making a three torus system Into the accept a failure node and connecting Three torus networks and automatically resubmitted. So whole system is all automatically changing and Such a node and So you know such a large system It is very difficult to understanding what knows it wrong and we whole Designed and Paid a pattern and making into our system software and the whole Inside and Including a job scheduler are the npr runtime system and job dispatch system also user environment that is our feature system So Shinji, what kind of challenges did you run into in scaling mpi up to this large okay If you already told about a open mpi Has to run over such a maybe a kind of a 60,000 node That's often and there's a many challenges. One is maybe a memory conceptions and I don't know The other one is a corrective communications We add some modifications to the open mpi platform Can you describe any of those in in generals or in specifics now? I you know, clearly don't want you to give away any proprietary kind of stuff that you're doing there But you know in general What were what were the issues that you ran into? Okay Okay, uh country Open mpi stack is maybe a three sets pml vml and btl and We add some modifications to The for a top net maybe a top net talk is implemented in a top btl, but Pml and bml are always So much high. So For a low-latage communication. We add a top low-latage pass that passes the pml and bml and In some cases, uh Personally, I Messaged pass through the bml and bml and It realized a low-latage communication The other is a corrective communication because Currently open mpi stack is only provided the twanking method for multi network interface card what uh toff has a four network interface card on the chip so And our network is a six dimensional torus And We we want to make effective use of multiple network interface into multiple dimensional networks and we do on top of the multiple internal talking to our card and We implemented a corrective communication algorithm That's a very uh Our uh challenges it and also In a 60,000 node system country mpi Consumes a Maybe a memory on now I mean my Number of ranks, so We eliminated the control structures and and make sure Should we uh, that's no concept should be minimized so We have some modifications So the tofu network itself, uh, he says a six dimension torus What what kind of bandwidth are you seeing between links? okay, uh Each each link has a Five gigabyte per second by direction. So uh in the basic uh by by direction in a 10 gigabyte per second so uh tofu Has a four network interface For each node. So it provides a 20 40 gigabyte per second on bandwidth and also has a talk her Interconnect controller I had 10 10 links 10 external links that provides in that 100 gigabyte per second on the node That's too much high bandwidth and uh Network byte flop is around the uh 0.25 that is very high rate uh in comparatively a bruisinian such a kind of uh the other uh Dedicated connection network. Did you say 100 gigabytes to each node? Yes Wow That's that's impressive. How do you how do you feed that? Are you using uh, you know pc i? technology between uh, the network interface card and ram or do you have something that can go much faster than that to feed All 40 or 100 gigs of uh network bandwidth that you've got Okay, uh one hundred one hundred gigabyte per second is uh, uh link for maybe a 10 links. So what is uh, uh interface of uh limited to the uh 40 gigabyte per second But the such a high bandwidth does not realize a current uh, for example, pc i access network Now it's very limited. For example, uh, maybe a uh future generation two and in feedband, you know one link is around the 3.2 gigabyte per second on one link qdr And by the same also means a five or six gigabyte per second in that case, maybe uh Uh four and five channel of pc i space uh by eight net uh collection uh If uh, we realize uh the using a ipv band there's maybe a uh around the Four or maybe a six six network interface card on the one node That's a table to realize uh commodity network. So, uh We realize the network on the one chip and interconnect controller and also We develop the uh spot 64 and version hfx cpu. That's two chip And then uh eight dames what that is one node and very very small and compact Implementation to high bandwidth and low latency network That is uh, uh, why we developed uh, uh, dedicated it to us Okay, and you say eight dims how much ram are you talking? uh, maybe kind of uh, uh, six uh 60 uh 16 gigabyte uh 16 gigabyte per node, but uh, maybe Bandwidth is 64 gigabyte per second So, um, so i'm sorry. I was just thinking there. So 16 gigabytes per node. Uh, so that's two gigabytes per core If I recall you said eight cores. Okay Yes All right, that's kind of that's kind of standard fare these days So what are some of the other properties of the torus network of the the tofu network being torus-based? Can you emulate like traditional ethernet devices or something like that to be able to access say external file systems? Or do you have to go through a special file system later? Okay, uh currently a computer has um, uh We are uh Two types of file system. Uh, one is uh, uh, directly connected Tof network that is connected by a fiber channel and maybe ion node over Uh, and spark nodes and directly connected uh Computational nodes That's you That's a file system that we showed We developed a raster-based uh file system and a little bit of a scale of networks and uh osc is around maybe Over 2000 nodes that is uh Not, uh, Realize a current raster specification. So, uh, we uh expanded the specification for our, uh, network network and file size and expanded and also a file system size also expanded They may be many many are expanded on a current raster And also how the we had A global file system as a global file systems for use users data manipulations and and With the programming and the kind of data is provided When your job is submitted, uh, the data users specified on the global file system is transferred to the local file system the connection and Maybe a File system is a using we are using an infinity band and qdr system. So, uh, global file system and local file system I don't know the there's a gateway of Tof network The dialogue has both, uh, infinity band networks and also has Tof network and also, uh, some I don't know how the five-dimensional network that Developed in the local file system. So, uh, we Use a dispatcher job the data specified in the global file system and transfer to the local file system through infinity band and This is staging The staging then the job has been started And the job use mainly use Data on the local file system so there's a Data maybe a So, uh, uh user job almost Dedicated data on the local file system Quick question phrase g. I so earlier in the conversation Said that there was hardware support for barriers I assumed that that meant Within a single node, but now that you're talking about both tofu plus infinity band I'm wondering do you have a a network for barrier support across the machine like blue jeans? Or is that was that really within a single node barrier support? Oh, yes, uh, tofu network also started the barrier and reduction and uh, uh, blue gas That's maybe a scholar, but very fast. Uh, that's around a 10 microsecond over our, uh Almost over 10 10 000 nodes Okay, so that is that's built into the tofu network. You say there's hardware support for those things Yes Okay, so back to the file system. Am I getting it right that you're kind of almost doing your parallel On your staging and stage out because you're reading from this global file system into the local file system And you're almost using the local file systems like luster, um, osts. Okay, that's an interesting way of doing it So you don't have the overhead until you stage in and stage out Um, but uh, we think the uh, uh, maybe uh our community knows maybe a file IO is very uh weak to the uh Interparallel access so Once into into a copy to the local file system The file system is dedicated to use for jobs So the IO time is divided also maybe uh, uh, also, uh, the Stating is uh, overlapped to the other jobs There's a waiting time, but no almost no loss of Job period So let me ask something a little among, uh, you know, my particular bench here What what led you to choose choose open mpi uh over the other open source mpi implementations? Okay, that's my question. Uh, sir, uh, maybe uh Almost three years ago, uh We have a dedicated mpi, uh for our system But that is a very old system based on the mpi sh But uh, expanding maybe For example, when with a new version of mpi, that is a very very table lot of code so We think uh It must be uh based on uh open source based, uh mpi first so then we uh Measured and implemented uh and compared uh, uh several mpi platform Why the mpi sh2 and mm mm mm mp and open pi and we compare performance Maybe the result is uh, uh good for uh, maybe mp for in front of the best However, uh mp is uh, uh, totally dedicated to the infinii band so, uh So our mission is maybe uh provide a Standard mpi to a k computer also to the pc cluster. So we think the Open pi We selected the open pi because uh when we are implemented Open pi to a k computer. It also works on the pc cluster. That is our decision point Some of the methods you came up with for supporting a scalable broadcast and scalable other types of collective operations on a torus type network Can those methods implemented in open mpi be made generic to other types of torus networks? Or does it rely so much on the specific tofu hardware? And um I really I said uh The other the underneath uh, can be implemented the other three torus network. However, uh, currently we are Making a special uh collective region uh for our mpi because uh We are the uh, special interplay phase to use a multiple interface so It can be uh Possible to implement but currently it is not difficult Uh directly to the other network now So I think you said um either you or somewhere on your team shinji I'm afraid I don't recall but somebody said on a public mailing list That uh someday you probably would have some code To put back to the open source mpi and I think that's really great. So we really appreciate that I I assume that some of this will be kept as Proprietary secret sauce as you should because you have a a fantastic hardware platform But whatever you whatever you guys want to put back to open source. We'd be happy to take Okay, uh, maybe uh It takes some time maybe We'll be started next year but uh If uh, uh some open pa uh make some uh interface to provide uh for example, uh There is a for net audience cloud So I I want to you send this message to the for example uh Netalk interface one the specified then Just if such kind of a interface is provided We can uh implemented Our uh our uh Collective uh protocols in our standard Open mpi Kindly that is dedicated but also it can be implemented when in the interface is provided So are there plans to sell smaller versions of the k-computer? Um Or is the k-computer of uh exactly a one-off? So Fujitsu is uh developing uh hpc technology for a long time and k-computer is one of the Uh Implementation of the our technology, uh, of course, uh collaborated with weekend And so Fujitsu will uh ship Our hpc product so using the Same technology of the k-computer k-computer itself will not be the ship but so we will ship HPC system using the same technology of the k so So is there anything you can comment on publicly about Future products at all or future directions Um now I I know you might not be able to say much but I figured I'd ask anyway So just I can say that we will announce soon And you can use this system It's planned This year I think We will Good to know Okay, thank you, uh Shinji co. Thanks a lot for your time. I think we've got everything that we want to get out there So thanks again for your time. Is there a website or anything for information on the k-computer? K-computer information is on weekend system. Uh, so in weekend, there are some, uh, information about the k-computer So I will send the url for the weekend webpage and with my bio Okay, great. We'll put that in the show notes so that it'll be available to everybody. I want to give a special thank you because uh, Brock and I are on us eastern time and and uh co and shinji are both in japan So it's it's nearly midnight for them As we're recording so we want to say a special thank you for staying up so late to do this with us Yes, thank you very much Thank you very much