 Welcome to another edition of RCE. I'm your host Brock Palin and once again I have Jeff Squires from Cisco Systems and the open MPI project Jeff. Thanks for putting this one together Hey, no problem Brock is uh This one's about MPI. It's something I work with all day every day. So this is near and dear to my heart yeah, so I Spend a lot of my time as an sys admin working with MPI and making it work and it can sometimes be a crazy beast But actually I think it makes a lot of people's lives and parallel computing a lot easier If Jeff could you give us a little bit of background on MPI once we get started here? But first I'd like to introduce our two guests. We have Bill Gropp and Rich Graham both of who sit on the MPI forum with you Jeff I understand and are central to the updates to MPI that are recently coming and in the future Well, hi, this I'm Bill Gropp. I am at the University of Illinois now And I also have an affiliation with the Institute for Advanced Computing Applications and Technologies But it also was at Argonne for 17 years before I moved here about two years ago I've been on the MPI forum since the very very beginning Including the meeting in Minneapolis that really kicked off the forum as well as the meeting that Ken Kennedy called before that that inspired a group of us to form the MPI forum and Most recently I've been responsible as chair of the effort that produced the MPI 2.2 document And my name is Rich Graham. I'm at Oak Ridge National Laboratory In the computer science and mathematics division. I run a group here who does work on MPI and tools I've been at Oak Ridge for roughly three years before that spent about eight years at Los Alamos National Laboratory Doing a variety of things including Including a lot of work on MPI Actually, that's where Jeff Squires and myself met when we together started the open MPI project In the context of the MPI forum This is my first first forum to be involved with I'm chairing chairing this this effort Yeah, so as as rich mentioned here all three of us really have been working together for for quite a long time Bill has a quite a large part to do with the MPI CH Implementation as well And so this interview is gonna be a little strange for me because I am actually Quite familiar with and work with these guys quite a bit So I'll just try and keep my mouth shut and ask properly leading questions and things like that But let's let's go ahead and start and get at least a little bit of background because every time I talked to end users and system administrators and and most recently at super computing 09 in Portland, Oregon It still amazes me how many people don't really know what MPI is they know that it's you know something used in parallel computing and then it's you know a lot of applications use it and things But they don't really know what it is. So, you know, Bill rich I wonder if you guys could give us the the short rundown, you know, what does MPI stand for what is it and so on? Well, just the MPI is the message-passing interface. It is a standard specification for a library That allows processes to communicate with each other it was developed by a diverse group of vendors users and researchers to standardize what was a fairly well understood but very fragmented programming model of essentially passing messages between processes and the standard Was a little schizophrenic in that it's targeted sort of everybody. There's Material in the standard for end users. There's material in the standard for tool developers And there's material in the standard for library developers. In fact, one of its strengths has been its support for the development of component software, which has allowed people to Build applications from libraries written at other places So Richard, I wonder if you could give us a little bit behind the history and the intent behind MPI like for example I know this is the first forum that you've been involved with and and Bill's been involved from the beginning. I got involved around MPI 2.0 But you know, who who's the target audience? What's it for? You know, how do you see people using it and things like that because you've been around MPI for years and years and years so as Bill alluded to the MPI standard First became standard or roughly 13 years ago. I think it was 13 14 years ago And really intended intended to bring together To bring to bring to bear a single a single form of message passing that that would be supported across by by all or by a wide range of Hardware vendors so that application developers and tool writers Library developers could could rely on on writing a single a single implementation that That they could have a high confidence that would work on all the platforms that they care care about And so really the intent the intent behind behind the effort was as with any standard effort is To is to make is to put together a standard that's that's good across and is and work has the potential to work well across a wide wide range of hardware and software platforms So who exactly is responsible for the MPI standards document and and what what has gone on there? So we we've talked about we have MPI 2.2 and MPI 3.0 is in the works and so on but what exactly is the process there? Well, there's a Forum this is a group of individuals that represent organizations that meet About every six to eight weeks in the original MPI forum. We met every six weeks now. We're meeting about every two months the the forum is made up of people with expertise and interest in various areas of parallel computing in a technical space and The forum is responsible for Producing a document. There is a organized process for voting in Sections that's very deliberative. We followed in fact the exact process that was used for high-performance Fortran in fact in the original MPI forum. We use the same hotel in Northern Dallas which was a great place to encourage people to stick around and work out the standard and this process of producing Written documents which are then discussed and then read and at consecutive meetings Voted on with one vote per organization Produces a process that by and large has created a fairly robust Document that has had surprisingly few Inconsistencies or flaws and not that it doesn't have any but it's turned out to be quite Quite effective. In fact the MPI one document is now I think about 17 16 or 17 years old and has still been a pretty Good document in terms of defining what MPI should be MPI is mentioned what it was used for and such but I did not get any Information there about What its dominance is is it is it a really popular use in scientific computing? Or is it not that common on what are some of the would call a competitor to it? So I would say there's essentially no competitor to it in scientific computing at the largest scale and on distributed memory platforms for shared memory platforms the primary Competitor is probably open in P and then There are various other Systems that have been produced either by individual vendors or research groups that have some Smaller following Smaller possibly very passionate. There certainly are things that some of these systems are better at it than MPI but if you look at Publications in parallel computing you'll find that if they're using more than a handful of nodes And anything besides the most embarrassingly parallel computations. They're probably running MPI So in terms of the actual standards document, do you actually have to be like a member of the forum of pay some Support fee or anything like that or can anybody get their hands on this document to know what the true standard says? Well, one of the choices that we made in the original forum Was to make it very easy for people to get a copy of the document So in fact the documents available at the MPI forum website just www.mpi-forum.org and You can get the PDF for the whole document there. There's no fee in terms of being a member of the forum membership of the forum is achieved simply by showing up at the meetings and Participating in discussions once you've been at two consecutive meetings you can vote That is in fact quite a commitment of time and effort so it's not that it's free to be a member of the forum It's just that There is no sort of additional cost That's been good for some it's a little harder for some others, but in general it has made it possible for the broadest Possible participation. I should mention that that participation is international So you don't have to be a representative of the US company organization to belong to the forum. In fact, we have quite a few international members So for both Bill and rich What is your own? How did you originally get pulled into doing MPI and? What is your employer's interest in having you guys take such a central role in MPI do they have a vested interest? Do they need this to work on their platforms? What you know, why do they want you doing this? so I Got involved at MPI roughly 10 years ago Because of my my interests in high performance Communication user level communication stacks and and the fact that MPI is ubiquitous, you know People people don't use application developers in general don't use Communication stacks that they they don't have high confidence they're going to be supported across or a wide range of platforms and so if I wanted my you know since I wanted the work I was doing to be used by by folks and not just be a little experiment Doing to myself that I'd you know, I played around with it was important that it be Did it the output be enough in a format that would be usable which is MPI? Since it is since since Oak Ridge does does have a large parallel parallel system Jaguar Which is which is now number one of the top 500 It has a vested interest in in making sure that applications can run and run well on on that platform So so RNL has a vested interest in supporting Efforts to to improve the standard And and that's really why why why they're interested in me participating in this process Well, my interest dates back much further, of course I in fact had developed one of the many message passing portability layers that research groups Were forced to come up with because there was no standard as part of my research into Scatable numerical methods for solving nonlinear systems of equations sort of classic technical computing problem and as a result of my experience with that in fact My portability layer had been used in an application, which was an early winner of a Gordon Bell Prize. I Was very interested in the potential to actually solve this problem In a broad way where there'd be a single Portable layer one of the things that I was most interested in and remain very interested in is Ensuring that with this portability. We don't give up either performance or scalability because there certainly were other portability layers, but none that had the Performance that MPI has continued to demonstrate and so That's why I got involved in MPI From the very beginning in fact at the Minneapolis meeting. I committed us Argonne at the time to a rolling implementation of MPI and that implementation which was based on my portability layer became MPI CH and made it possible For MPI to be adopted widely by many vendors because there was a open source Implementation designed to exploit their own Low-level communication layers since then I remain very interested in MPI number of areas and the reasons that the University of Illinois is happy that I'm participating in this is both in terms of the research that can be conducted into performance and scalability of parallel programming models including MPI and in the use of such models in combinational science and that's Allows me to plug our big machine So the University of Illinois will have National Science Foundation's largest system Which will be a sustained petaflop system called blue waters and it will be running MPI applications on well over 200,000 nodes and a sustained petaflop and ensuring that that in fact can happen is one of the things that we're very interested in here And we all must pay homage to the top 500. I'm glad to see you got your both both. You've got your little byline For those of you who don't know the top 500 is a biannual Listing of the fastest 500 machines in the world as according to the lin pack. That's twice a year Whatever biannual means I mean twice a year So here's another question that I get asked sometimes internally even within my own company Because we're very ethernet centric here at Cisco Why don't people just open sockets? What what is better about MPI? Why? What was the whole point of this so sure scalability and all these things but can't you just get scalability with sockets? And why would you do something else? So so I would I would say first of all that sock with sockets aren't As you as aren't available on all systems So So on some on some systems, you know using sockets isn't an option at all But that's probably not that not the answer you were necessarily looking for But beyond that the the issues the issues really revolve around the desire for you know for getting achieving high performance And so when people spend a lot of money On on very large systems They want to get performance out of the they want to get optimal they expect to get optimal performance as good performance as they can out of their out of their network and typically a lot of the high performance networks have Much better communication stacks if you use something aside from other than sockets and so MPI lets you let's let's you sort of have your cake and eat it too You can you can you can make your choice and use a socket layer if you want to underneath the covers But it can also also provide support Has the ability to provide the support for the high performance Communication stacks that people would like to take advantage of on these on these high performance systems Yeah, let me add a little to that One of the things is that Sockets defines Fairly rigorous semantics including the typically Ordered delivery semantic which is Which requires a certain extra overhead on some of these networks and so there are things performance advantages that we can Achieve an MPI that are not available With just sockets Another way to look at that is a very good MPI implementation on a high-performance network can achieve latencies on the order of one or two microseconds for delivering a message That's hard to achieve Although not impossible with sockets and that's one to two microseconds between two different servers, right? That's exactly the time between Processes on a single server for MPI is more like a few hundred nanoseconds And so we're down to the cost of a few memory transactions The other thing that's important to remember is that MPI Contains more than just the point to point or sort of two-party messaging primitives a large Part of MPI are the collective operations which involve Collections of processes performing broadcasts or scatters or collective computation like an all-reduce and Again on the large scalable machines. You will find that Hardware has been created to make those Operations perform much more efficiently than you can get by gluing them together out of pair-wise Message passing transactions. So MPI provides clean way to access those sorts of resources And that's just not present in most other Communication models. I should mention this is a it's a very different model than the sort of multicast model that You might think of It's both reliable and very importantly for technical Computation it includes these collective computation operations such as all-reduce Okay, so you're able to optimize under the covers more and the MPI implementer or vendor can worry about You know integrating these things together more than the application developer has to worry about them Exactly So I have a quick question before we continue earlier. You said MPI CH. I guess you can put this to rest for me It's MPI CH not impage Yes, even says that in the I think the second edition of using MPI But we don't really care you want to call it impage. That's okay It's kind of like, you know, whether there's a space in the name of open MPI or not Some people don't but you know, I do and Bill will always call it MPI CH and I frequently call it impage You know here both ways. Well, I I like to respect the you know the creators effort and what they want to call it They can call it that So moving on with something else. What is MPI bad at? What is cases where you've seen somebody use MPI that you're just like, oh my gosh, they should have totally used XYZ This is an awful use of MPI That's sort of a loaded question Perhaps just a little Yeah, I mean So, I mean one of the things so so, you know one of the things that's that's at hand at handling on predictable Message message traffic And what I mean by that It can be relatively expensive if you if you if you Both in terms of resources that you potentially use as well as in performance If you don't know where where communications are coming from Having said that MPI does have the ability to do remote memory operations Which which sort of gets around that issue But again there the issue is a lot of a lot of at least part of the issue is our is performance Our implementation issues Yeah, another another place where MPI is Probably not the right choice or are those embarrassing the parallel applications particularly ones in Fault rich environments that are better done with sockets So if you want to do SETI at home, I would advise you not to use MPI even though it would be possible To build one with MPI and there's actually some cool things you could do You really would be much better off using sockets for something like that the other place where MPI is sometimes not the best choice is Within a an SMP where there are other Programming models like open and P having said that To get good performance out of Models on SMPs you still need to manage locality and this is something that MPI Gives you or actually forces you to confront whereas some other programming models attempt to hide it from you Which unfortunately doesn't meet the level of technology we have for dealing with the performance consequences of that Okay, those are all fair answers appreciate that. Let's let's move on to a slightly different topic here What was the intent and the goals behind the MPI 2.1 and MPI 2.2? Specs that were just recently passed because for a little bit of history here for the listeners MPI 2 was passed way back in 1996 97 somewhere in that time frame and there was very little in terms of standards development Until just two years ago or so when we kind of got the band back together again and got the forum together and Started cleaning things up. Could you tell me Bill since you were the chair of 2-2, you know What was the scope and the intent of 2-1 and 2-2? Well the the scope of 2-1 and 2-2 were really to clean up the standard and in particularly the case of 2-2 With the potential of adding a small number of routines that were felt to require modest implementation efforts and provide significant benefit for users the biggest thing that happened with 2-1 was to take the MPI 1.1 document and which was a single document and the MPI 2 document which included an MPI 1.2 addendum and Merge those into a single document that was a significant effort in part because there are parts of for example the MPI 1.2 which Provided significant clarifications or resolutions of ambiguities Stuff that was in the 1.1 document we would get Comments from people who said well MPI says this and we would have to point out. Well actually in the 1.2 document It says this or in one of the past rata. So I should mention that the MPI forum had been Passing a rata again with a fairly deliberate process. So there was an additional set of documents That had to be consulted if you wanted to answer a question, you know, is this valid in the I so the 2-1 document was primarily that merge along with various corrections that Were found as that merge was putting it was being put together and Ralph Robbins I heard there's a tremendous amount of credit for Effort that went into this along with a number of other people that helped resolve each of the chapters The MPI 2.2 effort then built on this and actually looked at as I mentioned adding a few routines that were felt to be Needed fairly urgently by some applications and so in MPI 2.2 we ended up with a handful of new routines Primarily routines to lit tool builders Build better collectives and to provide a better way to Describe non-cretation Process topologies in a scalable way as well as deprecating the C++ interface, which was one of the pieces of MPI 2 which did not seem to Catch on with very many users I think someone here in the interview might have been a big proponent of deprecating though No names mentioned, but his name rhymes with Jeff. Yeah So Bill was there any any other scope in in 2-2, you know two things that you know people say to me well like well Why should I bother to upgrade my my application or should I bother to reprint out this monster document? You know with 2-2 What what else? there there were a number of Figurities that were resolved. So there's some places where In implementation could conceivably have done one of several things and that's no longer the The case in as many places There are See Well the fact that you're struggling to come up with stuff is actually that's kind of a Testament to what you said before that the spec actually was pretty tight for 10 plus years And it took us 10 years to come up with you know enough stuff to make it worthwhile for everybody to get back together And clarify things. Yeah, I would really say they're For most people if you're comfortable with your MPI program Then you don't really need to get the 2.2 document if you've got questions about the MPI standard or there are things that you were Unclear about then I would go first to the 2.2 document because it does have a lot of these clarifications maybe the the leading question here is a Correct MPI 2.1 or in fact a correct MPI 2.0 program is still a correct MPI 2.2 program So if you have a correct program, and it really is correct as opposed to it runs with somebody's implementation now which might rely on some Interpretation that's a resolution of an ambiguity But if it's a correct MPI 2 program, it's still correct MPI 2.2 program So moving on from 2.2 since that documents actually out I've heard rumblings 3 is in works It's we're not sitting around for another 10 years. There's actually a 3.0 document coming Yeah, so the three point the three point all effort actually did start Roughly about the same time we started working on the 2.1 document though though not in not in as much earnest It's as it has we are now Since we can you know basically any effort that we put towards the standard is really not directed towards 3.0 Because everything's behind us. So 3.0 is a lot more ambitious in terms of what potentially and I have to emphasize potentially Can go into the standard in 3.0 Issues such as backward compatibility We're thrown open for discussion to try and understand You know where where where makes sense to maintain backward compatibility and where it does not for 3.0, you know the the Door was also open for adding You know completely new functionality to the standard taking away completely a Functionality that was that is viewed is is maybe not necessary And there I'll mention Jeff's Jeff's Favorite topic or the C++ bindings get rid of them Yeah But but the door the door is wide is wide open, but having said that You know since this is a standard You know things we can't just be you know, we can't just do whatever we feel like and So the several several rules that that we do we do follow is first of all, this still is the MPI standard So this isn't this isn't the kitchen sink of parallel programming. This isn't You know some other I mean the intent is still to is still to to continue and support Primarily application and library developers, although there has been some discussion about what might might be done to To help support compilers Also because this is a standards effort there there's a there's a wide range of a wide range of things that can go into the standard and and to and to become part of the standard you have to have at least 50 50 percent of the of the Voting members agree agree with you on on two different two different votes in the final vote on the whole on the whole Document that this is something that should go into the standard So there is a fair amount of work to actually and and a fair amount of process in place to make sure that that what happens? is is is is not Doesn't happen to women and and there's deliberate time delays in the whole process to give people time to think about About what goes into the standard having having given that huge caveat There are several several items that are there currently Under discussion there have been quite a few items that have been proposed a lot of a lot of them have sort of died on the vine But those that are that are people are currently working on and And are making active progress is first of all collective communications And to polity to apologize. This is a This is the the one part of the standard of this of the 3.0 standard It actually has already has already already has something that's made it into the initial Has gone through the first first cut of voting on the standard and that's the non-blocking collectives That effort was chaired by by torson and still is chaired by Torsten Hoeffler from Indiana University There's a group looking at looking at fault tolerance and Trying to understand what and if and what it make what might be added to to the MPI standard to support fault tolerance there's There's a group that's looking at at How to improve for trans for trans binding support and the standard to take advantage of of some of the new capabilities of that the the Fortran has introduced This is actually a good example for the four for trans standards committee in the open and the MPI Standards committee are working together to try and to try and and do something together. That's better for the users Bill is chairing the section that's that's looking at what to do about remote memory accesses there's there's general feeling that the remote memory accesses Access support in MPI 2.2 isn't quite what people would like it to be but whatever that happens to mean And and that that effort is trying to see, you know, what it makes sense, you know What what changes might be made there to to improve aren't it to change hopefully improve Support there's a very active group looking at what what sort of capabilities should be added So the tools can better support MPI implementations sort of to add things some of the things are one specific thing that's been been looked at is is How to how to better support? Process acquisition so the debuggers and other tools can can have a standard way that they can be sure the standard compliant MPI supports to to gain Process information about a parallel job And there is and the last last working group that's that's become very active recently is an active group That's looking at and what MPI should do to play better within the broader the broader community of Parallel programming and bill already suggest already mentioned Thread how to interact better with threads potentially is specifically, you know One one of the thread packages being considered is is open MP People have been looking into what makes sense to do in the context of GPUs Just just to name a few things that fall under this under this category So you can see that the list the list is is is somewhat extensive What actually makes it into the standard hopefully we'll know within the next year and a half or so so in terms of a regular You know user of MPI currently if they wanted to give you guys input Do they have to email you get on one of the implementations lists or are you guys actually? feeling out for input from people in the world So so so I'll address that because that's one of the one of the things I do on a On a somewhat regular basis. So there's several several ways that you can you can contact us That's a row is that you can provide Provide input So for my for my perspective the best way to do that is if people want to join us in the process You know as bill said, there's no there's no fee to join us Although there is a meeting fee because the meeting the meeting is self is self-sustaining So we have to pay for for whatever whatever facilities we use to meet Which is not it's not a large fee at all That does require quite a commitment But it's from my perspective by far the best way to get involved. They're mailing lists. There's a mailing list. That's comments at and List up dash MPI dot Let's scratch that Jeff do you happen or it's it's comments MPI dash comments at Yeah, thanks Yeah, there's the wiki Yeah in the wiki and we should also plug the survey I have a good comment to make this about the survey Okay, I'll do that But also, you know, we had a we had a public meeting at at supercomputing to to illicit illicit illicit feedback from users were planning Planning on having a meeting at the super computer a similar type of meeting at the supercomputing meeting and in in Germany and in June And so we try to actively In as many ways as we can think of and have time to do to illicit information because ultimately the intent is to is to provide better service not to implement our favorite Algorithm or whatever Richel, I'll throw into that at that the birds of a feather session at supercomputing about the MPI forum I was pretty well attended. I think and there was a good number of people there and there We introduced a user level survey That we're asking people to participate in that you can go to MPI dash forum Dot question pro dot com. We just used a you know a web survey company thingy to do it and So far, you know since supercomputing we've had 216 replies and actually the most recent of which came in while we were Recording right here. And so this is we have a bunch of questions On that survey where we're really genuinely asking for feedback from the MPI community about you know, what do you want to see? What do you not want to see and there's some leading questions and some not so leading questions and some freeform areas where? You know, we want to hear what people think that some of the questions are really complex And you know in some of the questions we even said hey if you don't know If you're not familiar with this area, please don't even don't even answer and that's not meant to be condescending It's meant to be a recognition that some of these issues are really fantastically complex So we are looking for you know good good feedback from people who are familiar with their favorite areas of MPI oh And I should also point out too that we posted the slides from the birds of feather session It's on the meetings dot MPI dash forum org Website if you go to the November meeting and look look under slides the the boff slides are listed there I'll get all those from you guys and I'll stick them all on the show notes for this a show when it goes up on RCE dash cast comm and that way people can get the links and they don't have to try to Follow what we're saying right now. I was going to add that all the MPI forum mailing lists are moderate our subscriber only to avoid Spam so if you if people do want to make comments to the MPI dash comments a mailing list, please make sure to sign up for the mailing list Otherwise it goes it it won't make it out to the to everyone And I'd like to add that I encourage everyone who's interested to participate At the same time I really encourage everyone to Take advantage of the history of the discussions in the wiki that's Jeff alluded some of these issues are quite subtle. They may seem simple on the surface, but they are Could be some very interesting Issues that you may not think of them first blush and one of the things that's actually fun in participating the forum is learning about The sort of bigger world that some of these issues so I would again encourage Anyone who's interested to take advantage the wiki See what the history and learn what the background is and to participate Okay guys on both Bill and Rich. Thank you very much for your time I learned quite a bit. I tend to see MPI as a very high-level user perspective and as a sysadmin perspective I Filled out the survey and I screwed up something so I actually have to go back and refill out the survey myself And I haven't done it yet. We're gonna put all those links up on to show notes on RCE dash cast calm And thanks a lot you guys for your time today, and we'll see that 3o's document in the future Thank you. Thank you. Thanks for your time guys