 This is research computing and engineering episode one with open MPI with George Basilica and Jeff Squires guys. Welcome to the show Thanks. Thanks Okay, so starting off exactly most people listening to this are familiar with research computing but MPI How would you describe MPI? MPI it stands for the message passing interface and and at its heart. It really is just that you know message passing So you start up a bunch of parallel processes together and MPI is used to affect the inner process communication between them Right, so you you're doing send and receive primitives and and various other types of primitives But at its heart MPIs, you know Send this message from this process to that process and the other process does the the matching receive of it and so on There are some some very handy things like collective operations as well So you can do broadcasts and scatters and gathers and reductions and things like that But but at its heart. It's really about communicating so moving data and moving bytes from one process to another It's rather amusing actually my wife laughs at me. She doesn't know why I have a job She says, you know all you're doing is moving bytes. How hard can that possibly be? but one of the the challenging things about an MPI software implementation is that you really need to do with very very high performance, so you know you want to get the minimum latency and the Maximum bandwidth and and be very efficient in your memory usage and your IO resource usage and things like that So all of these things get factored into a very high quality MPI implementation so that we can deliver, you know Very well performing Middleware stack to the user who just really wants to compute and you know compute their fast 48 transforms Or whatever the problem is that they're trying to solve in parallel Would just want to be the tool in the middle that that just works and works very very well for them Probably a lot of people who are listening who have never written a parallel program before with MPI probably wondering Communicating between systems. This is a solved problem. We have the internet, right? I mean what why would we need to redo this again? But MPI actually has a concept of what you're sending right you say like I'm sending 24 doubles to processor to and it's very simple There's no opening a socket or anything and it's also network agnostic, right? There's there's gonna be multiple types of networks here where a computer can communicate Exactly these are you know I get this question a lot. Why why don't I just use sockets? Why why do I need to use this MPI thing and you highlighted some of these things already? There's no connection management. You don't need to what's the IP address over there? What port is it listening on who knows who cares, you know What if it's not even a TCP based network that you're on what if you're on shared memory or InfiniBand or Quadrix or Cray or something like that you just want to send your data You just want to send it and have the other guy receive it and how it gets there. It's irrelevant You just wanted to get there fast And be able to do you know discreet messages. That's another advantage here So sockets or streams, right? You have to loop over reading until you get the entire set of data and then assign us You know assign structure to it so that you can interpret the message whereas with MPI you send a discreet message I'll send you four doubles in an int and you're gonna receive four doubles in an int You don't have to loop over, you know polling to get all of the data and things like that And not only that is it it's a discreet message, but it's also typed just like we said It's four doubles and an int so you can send a struct you can send actual data structures You know down across MPI and however it gets there. It doesn't really matter You're just sending the data and receiving the data and all the the network magic that has to happen in the middle Just happens automatically for you. That's kind of one of the points of of why MPI exists And we should probably point out quickly too. This is a Distributed memory parallel every one of these processors has their own discreet memory space if your code calls malloc and there's nothing keeping them from Saying which rank is actually calling malloc everybody called malloc and they all have their own little memory space So if I have some data as a CPU and I'm trying to give it to another one I have to explicitly send it to it and they have to actually explicitly receive it also correct That is correct. Excellent point here. It is it's all explicit parallelism so you explicitly send and you explicit really receive and You know depending, you know everything that you do you have n copies of Of your application running right and this is this is kind of one of the difficult things to wrap your head around for people Who are new to parallel programming that when you launch a parallel application? Let's say you launch a 32-way job or a 64 or 128-way job or something like that One of the most common ways to do it and there are other ways to do it But one of the most common ways to do it is that you're really just launching 32 or 64 or 128 copies of the same Executable and so they're all running independently, but yet they know who they are so a common Paradigm is I launch all 64 copies of this executable and the very first thing they do is figure out who am I? Oh, I'm number seven out of 64 And so that I know that my portion of the work is over here You know I go to the index number seven and and that's my assigned work over there and things like that So this is one of the difficult things in wrapping your head around parallel computing is that you have all these independent agents running Simultaneously sometimes they synchronize sometimes they don't and so on so it's it's just a new way of thinking for those who are accustomed to programming and serial Okay, so open MPI under what sort of licensing is it available under can commercial application include it and use it or Also, what type of network types as a support some MPI libraries either support Ethernet or Infini band You have to recompile to use a different network type. Is this an issue with open MPI or not? So you got you got two questions in there and tell you what I'll answer the license part here And I'll defer the other part there to George The license that we use an open MPI is BSD and I'm not a lawyer. This is not legal advice But it is my understanding that you know, that's one of the most permissive licenses that's out there So, you know, there's a lot of people out there who love GPL and that's great and GPL is good for a lot of different things But our goal in the open MPI community is to be as inclusive as possible We want to include everybody we want to include the users and the researchers and the academics and the vendors and you know the whole HBC community and In order to do that we had to get you know, the least frightening license out there and and in our Research and our lawyers told us that BSD is the one you want to go because that will be you know The most inclusive and people can do whatever they want and basically again my understanding is all they have to do is cite our Copyrights, but it encourages people to to join us because they can literally do whatever they want with the source code to include Releasing it under GPL if they wanted to But you know our goal was that anybody can distribute this source code for free There is no source license. It doesn't prohibit somebody from doing value add and reselling open MPI But you know, we wanted to set the license barrier as low as possible To encourage development and participation from all corners of the HPC community So do you know of actual software vendors right now with a commercial Application that's actually shipping with open MPI is a supported distributed memory parallel system So I'm gonna I'm gonna have to give a little bit of a weasel answer there Well actually no for for Sun Open MPI is their MPI Sun is open MPI is cluster tools on The Sun platforms and so you know, they have a whole team of engineers at Sun who work on open MPI and release it as part of their you Know High-performance computing products and so on There are the weasel answer that I'm gonna give is about ISVs who use open MPI there I do know of a couple of them. I don't really want to say their names And this is the weasel part mainly because I don't actually remember if they have released their products yet with open MPI or not And if they haven't released them yet, I certainly don't want to announce before they do But as some of the reasons that open MPI is attractive for ISVs is that you know, honestly, it's free it's pretty production quality it generally just works and You know, it takes a hunk out of a price For what the ISVs resell their their software for they can actually, you know, reduce their price a little bit rather than having to pay someone else for An MPI license. So from that perspective, it can be pretty attractive to the ISVs Okay, so they would probably want multiple network type support So what sorts of network types as open MPI support? Do you have to recompile it to enable a different network type or is it like an ethernet only library? So we support us Name a network and I'm pretty sure we have support or if we don't we have it soon Right now we support about 10 different networks. They are mostly targeted toward high performance Computing networks. So you will find in Finneban, Mirinet and so on and there are some other more Exotic networks that we don't have right now But the support will be there in the in the near future The other interesting thing is that you don't have to recompile everything. So you don't have to have multiple MPI For one per network and so on In open MPI everything is modular everything. It's included inside. It can build in one go So once you have your MPI library You can easily change from one network to another by just adding one parameter at the command In the command line and that's it open MPI will do the magic to switch to the right network If we don't specify open MPI, we try to find what's the best network available Was the one that will give you the most performance and we try to use that one now You support a shared memory type of network, right? Of course, what if I have What if I have to in Finneban connected nodes and each of those nodes has like say four cores And I'm going to run across all eight cores in the system do the because I have to use Ethernet between the two nodes Well, the ranks inside one node still use Ethernet or can they actually use multiple network types at a time? So they can use multiple networks at the time. So inside the node If you don't let's suppose that you don't specify anything by yourself So you allow you let open MPI figure out the magic So internally in the node will use share memory because as far as we know This is the best way to exchange messages between cores. I mean between processes running in the same processor And then for external communication then we figure out what's the best network So it will be from the user perspective It will be completely transparent and you can use as many network as you want in same time So if you have a very cheap solution actually to have a pretty good network at least from bandwidth perspective It's to buy for 50 bucks three network card one gig and put them on your cluster And with open MPI that will be completely transparent. We use all three of them So when you look at the bandwidth that you get instead of getting one gigabyte you will get three. Oh So the MPI library will actually bond for us With no administrator intervention Absolutely, and so to clarify one of the things that George said there is that you know We automatically pick the best network for you that may actually be plural. We pick the best networks looks yeah for you And we also do it on a per peer pair Basis so when we're looking at the best networks to connect We're looking at from process a to process b and that may be a different answer from process a to Process c for example if a and b are on the same node We're gonna use shared memory, but if a and c are on different nodes Oh, well, let's use infiniband because I see an infiniband link here or no I don't have infiniband, but I've got three network or three ethernet one gig cards Well, let's use all of those and stripe across that automatically for you So it is a really nice thing. You don't have to have kernel level bonding We'll do it up at the user level for you and just stripe large messages across it and round robin short messages across them as well How much control does the administrator have over this say a cluster has a an extra management network on it Is there any type of? cluster-wide configuration that can be made to prevent the management network from being used Yes, so For those who knows a little bit about the open imp. I we have a lot of parameters So some of them can be used to restrict The usage of some of the network so in the in the case where you have three Ethernet you can name the one that will will never be used or you can name the one that will always be used So I can say okay ATH zero. It's something that it's restrictive for Let's say management and NSF. So that's it I I let open imp knows that he will never have to use this network and Everything we happen internally so there's a lot of a Administrator level is there also there's also user level tunables that can be adjusted without recompiling Absolutely, so, you know, this was actually one of our founding philosophies in open MPI that state of the art back when We created open MPI was that okay? You want to change one of the internal parameters of MPI now you got to recompile or relink or do something like that? So a guiding philosophy to us from the very beginning of the code base has been you know Instead of using a constant even if it's a compile time constant Make it a runtime constant and make it a parameter that can be changed on the MPI run command line or changed on the you know if you supply a configuration file to open MPI or An environment variable there's a bunch of different ways you can change these parameters and so on but you know every time that we have to make a decision like should I use a An eager protocol or a rendezvous protocol or or you know a million others But eager versus rendezvous is a very easy one to discuss We made it a runtime parameter so that power users and system administrators can actually tweak these things the average user They're not going to care. They're never even going to use these parameters They're just going to MPI run and and the right magic will happen usually But for those are you know power users or administrators who want to set it up for it for the naive users There's a lot of control that you can Exert over how open MPI runs itself internally. How many different parameters are actually modifiable into current That's a really good question So so George and I were preparing for this interview yesterday and we were actually trying to count and I'd say it's upward of 500 Wow. Wow. So then between the 500 different ones and all the possible combination there must be you know Many thousands of different combinations that can just be tweaked at runtime the permutations are totally insane and and and I actually want this is a perfect opportunity to say that We have a lot of plans. So the current release series is 1.3 and I'm gonna say that because 1.3 will be released today or Monday or something very very soon from now We have a lot of plans for the 1.4 series and and one of them we consider is a very critical Usability thing so that the idea of being able to tweak the runtime parameters. It's wonderful a lot of people use it They love it and they're they love the ability that they don't have to recompile and they can just change open MPIs behaviors But it's also quite challenging. I mean upward of 500 parameters. That's that's a lot. How do you even know where to start? So in 1.4 We're gonna introduce a feature where we actually have a kind of a rating system for the parameters from 1 to 5 and 5 is the casual user will want to use these and One will be you know a back-end developer will want to change these and so our goal is to Severely limit the number of parameters that you see when you just look for hey What parameters can I tweak if I'm just a casual user? Well, we'll only show you a dozen or two. These are the most important ones that you'll want to tweak Well, let's say I'm not I'm a little more than a casual user. Okay. Well, here's here's 40 or 50 that you might want to tweak Or I'm a power user. Well, you know what? Here's about a hundred of them or so that you can tweak So you can set what your experience level is and what your interest level is and we'll show you that many Parameters and let you choose and tweak and things like that so we consider that to be a pretty important usability feature coming up in the 1.4 series but For some of the more important Obvious ones that users and or system administrators might want to tweak George. You want to you want to take this one? Oh so It's impossible to give a list and they are they are really Many and they they how you use them it will depend on what exactly you plan to do So there are some for the processor affinity some others from memory affinity So if you really want to have control of on how the processes are created and how they are bind To a specific resource like processor or memory. Well, you can use them I know people who use them, but I can count them all with one hand Now I use them on a regular basis. We have some other features that allow the user to Well or parameters allow the user to to find out if their application is correct So we can dump at the end when the user call MPI finalize we can dump A small status of the MPI library so the user can see if he had some Resource leaking or something like he didn't destroy some communicators or there were some communication In progress like, you know, the non-blocking sense or receive Which is illegal from the MPI perspectives and the user should take care of them. So that's That's another another one We have some that we take care of the Some of the network like the one that require registration of the memory at the MPI Level like MPI leave pinned which will allow us to keep the memory pin. So we don't have to do it every time This basically give us more performance on on these networks And I think I will stop here because otherwise I will go on for for hours Yeah, there's a couple though that might be useful for for system administrator types You might want to set, you know, like don't use the management level You can actually set that at a global level so that all your users will will implicitly use that so, you know Don't use ETH0 because that's my my management network. You can also say, oh, you know Make sure that when the MPI run they're actually using the infini band network or the the quadrics network or whatever network and so on So you can set kind of defaults on a global basis So that you know the users who don't know or care just MPI run and get these values Automatically it can be good to just absolutely set these in and ensure that that's the stuff that they do George mentioned a couple of the debugging parameters as some of those are my personal favorites one of the ones that I like is that We have a parameter that if you call MPI abort or if you trigger an MPI error for some reason We'll actually delay for a little while before we actually kill the the parallel job Which gives you the opportunity to attach a debugger and see what's happening in a live process rather than perhaps a core dump or something like that I think some of these you know debugging level controls are actually very important because Parallel program is just pretty hard and so tools that the MPI layer can give you to do this Particularly interaction with other development tools. I think that's a genuine the useful thing So if a user using a cluster or an administrator wants to see what a current runtime value is set to How do they find that out and how do they actually modify them? so we have a we have a common tool that allow the user to see everything inside either everything or by categories so We have different we call them component in open MPI Some of them that are related to the network some other related to the runtime and so on and You can ask the specific parameters of one of them so the tool is called Umpi info and And if you want to look for the for the parameter You can say minus minus param and then component and And framework or framework and component Or you can say simply all all and that will basically give you the whole list of 500 Parameter that you can change. Yeah, and this is exactly the area where in the 1.4 series We're gonna apply some of the you know what level am I oh I'm a casual user so instead of showing you all 500 You know will only show you a dozen or two So this is there will be some neat improvements in this area and so users can set these values either at the command line with MPI run some options They're executable or they can actually set environment variables Or they can set the set of file which can be either set by the C Admin in in the in the same directory where open MPI was installed or the user can set it directly in her in their home area in the dot open MPI directory So there are some default file that open MPI open MPI will look for every time MPI process is started So there are really many many ways to set this argument Usually, yeah, so usually if a user wants to set something that will be forever On on the cluster he will set it in his home area Okay, okay, not bad So if they wanted to have it there all the time they could make their own little config file and leave all the options in there Okay, so that actually sounds really flexible compared to some systems I've used before because there's been that compile compile So if a site if an administrator providing an MPI library started off with a ethernet cluster and then later on a faculty comes in or something and they add on InfiniBand this is actually really powerful because they can have one MPI library that will take advantage of the best network and then on those different networks depending which one they run on they can modify all These little tweakable parameters on exactly and kind of the power there is they don't need to recompile You know, oh, we you know, we got another round of funding and we added InfiniBand to it You can just re MPI run and you'll see the performance difference You know of depending on your application, of course, but you know if you're an application or bandwidth hungry application You know, you'll definitely see a market improvement You know moving to 10 gigabit ethernet or or InfiniBand or or what have you nice nice now Some users who have ran into problems on the cluster I run before this is a pretty common question And a small scale my application runs fine When I increase the problem size the application hangs Now I tried to explain to them eager rendezvous Could you guys go into a little bit more depth exactly what eager and rendezvous are and Why MPI send while blocking it doesn't necessarily always block mm-hmm so eager and rendezvous with some kind of Terms that we we use in between to scare people Basically for us it make a difference on how you want to send this data, right? So of course we can do something very easy every time you send a message We take whatever was in the message as big as it was it doesn't matter and send it to the remote node and To be honest for for a good user for a powerful user the one who knows exactly what he's doing I think that does the best approach because that kind of user the receipts already posted now We know that not all of not all users do that So then we the MPI library try to be Libraries not only us because most of them do the same try to help them a little bit So instead of sending the whole message we send a little piece With the marker okay with the MPI tag inside and the remote node We do the matching and when this matching is done which means the memory. It's ready to receive the message We send the remaining of the of the message. So this is all the that's all That's the rendezvous protocol now eager. It's everything that is smaller than this rendezvous We send it in just one go. Okay, so that's that's the difference between eager and and rendezvous And this can really trip people up because exactly what you said in the beginning there Brock, you know Maybe when I was first running my application I was just sending small messages and they were just sent eagerly there was no implicit synchronization with the receiver But then they they increased their data size they increased their problem size and and they do MPI send with a larger message and Underneath the covers open MPI switches into a rendezvous protocol and suddenly that send is not going to complete until the receiver Actually doesn't matching receive and so they can see a little bit of unexplained behavior here But in all fairness the MPI spec there there is a standard that we adhere to right the MPI standard It it says nothing about eager rendezvous or anything. These are frankly their their implementation details, right? The MPI standard actually Does not say well, it says that MPI send may or may not block and that's an implementation detail But to to send and expect that there is you know a receive buffer there That is that is implicit and it's in its implementation to find about exactly what's going to happen So we actually have license from the spec to be able to affect this kind of behavior And it's really it's a very difficult game to play, you know Because an MPI implementation has to tread a very fine line between resource consumption and Performance, you know, everybody always wants that super low latency in the super high bandwidth and so on But they also want memory to run their real application They don't want us to consume oodles and oodles of RAM, you know until they're ready to use it, right? So if you send a one megabyte message and we send it eagerly well That means the receiver has to consume one megabyte of RAM Because you have not posted a matching receive yet We have to receive it and store that message somewhere until you do a matching receive and then copy the message into your target buffer But we had to consume, you know all that extra memory In the meantime, so a rendezvous protocol is kind of a resource saving mechanism. That's that's why we do it That's what George was talking about. We try to give you a little help there But the eager is trying to get it get it sent get it off that machine It's safe to reuse that buffer so that the rank that sent can keep computing correct That's correct. We usually do it, you know for small messages eager is usually applied to something small enough that You know if you if you send a 2k message, well, it's fine We'll buffer it on the other side because it's small enough and it doesn't really matter But it does give you the best latency as well So if a user finds themselves stuck in this kind of situation and they want to prove it to themselves that You know, this is what's causing their problem They can actually adjust these eager cutoff points between the rendezvous and eager, right? It's just another one of those tunable parameters. They can just specify in the command line Yes for most of the network this this is the case And so the thing is that that will not necessarily correct their application because if the application deadlocked because Well, they don't follow the MPI Specs like, you know, the one of the biggest mistakes in that user doing MPI is to To have both processes doing MPI send followed by MPI receive So this as long as we are on the eager this will work fine because send it's kind of behave as a non-blocking So you you send a message you give it to the network and then you go on the receive and you will receive what? What's coming from the from the pier now as soon as you go for large messages? This will not happen because the send will become blocking and here we are you will deadlock in your application And of course the first thing that people do is blame the MPI that there is a problem No, it's Standard actually Specifically outlaws this communication pattern. Yeah, it says if you do this the MPI implementation is allowed to deadlock And so don't do that Yeah, so now that now they have proof that uh It's their fault and not some fault elsewhere in the system Fun little trick that you can do to verify that your application is deadlock free Is to change all of your your MPI sends to MPI S sent synchronous? and What that does is it forces a synchronization with the other side that the send will not complete until the receive Has been posted on the other side and if you can change your application to completely use S sends instead of regular sends Then you know it's deadlock free It's not going to be as performant as it was because you're doing all the synchronization But you know the the SN pretty much forces a rendezvous protocol and you know You can use that to check for correctness of your application Arguments you could just quickly do this with either like a preprocessor statement But couldn't we also just set the the runtime parameter for the eager rendezvous limit to zero Well, that's the open MPI trick no Yes, yes So it's a little difference between you know an MPI standard trick versus a specific implementation trick trick I see I See okay, so you're releasing 1.3 very soon There was some big changes in that is there any change major changes from that you'd like to point out Or do you want to run right in and describe what what you want to do in the future? Let's talk about 1.3 because we've To be honest, we've been working on 1.3 for a very very long time But much longer than we intended actually but in in fairness when we started the 1.3 series all the work on it We said at the very beginning that this is going to be a feature driven release It's not going to be a schedule driven release We're going to get all these features and we're going to take as long as we're going to take and Unfortunately that take as long as it's going to take took about a year and a half But that's okay. We're going to get it out in the next the 1.4 series will very definitely be a schedule driven release But in the 1.3 series, we have a lot of very cool stuff actually There's a lot of open fabrics updates in there. So we did a bunch of stuff to optimize memory registration usage We did message coalescing actually a fun little story about message coalescing we message coalescing is the idea that You know you're sending oodles and oodles of short messages and eventually you back up the network so that you know You're you're sending faster than the network can send and so therefore you're you're backing up into software level cues instead of hardware level cues and The the trick that you can do there is you can actually collapse Some of these short messages together into a single message when you see that it's sitting in a software cue Oh, that's exactly the same message. Well, I'll just bump a counter and say oh We'll send two of these instead of one and therefore you can actually save You know some some bytes that are sent across the network when that message actually gets out on the wire That's a neat little optimization It has very little to do with real applications Yeah, when I when I've looked at these things before they were more of a a benchmarking trick It's it's a benchmarking trick and and somebody got some cool papers out of it And so we resisted doing this optimization for a long time Because it frankly it adds complexity down in the you know the deep voodoo You know layers down in the in the bowels of open MPI and we're like well We just don't want to add that extra software engineering to do it But we finally bowed to pressure the sales guys beat up on me and other guys saying you need to have message coalescing so In the 1.3 series we have message coalescing so you can get really nice benchmarks Things out there. We also added support for iWARP some some pretty cool flavors of 10 gigabits Ethernet Nix we added support for XRC, which is Melanox's proprietary protocol for Decreasing some some resource usage. So that's some pretty nice stuff We have full support for MPI 2.1 as well So the MPI forum has gotten back together and and started working on newer versions of the standard and MPI 2.1 was kind of an important milestone that it fixed I don't know somewhere in the order of 50 bugs or so in the spec itself And so we support all of those fixes Couple other pain points that you know, we love open MPI and we think it's great But there certainly are some things that can always stand to be improved one of them that We we definitely got to complaints about from users was what then when an error occurred in an MPI application You you tended to see, you know 500 copies of the same error message You got that same error message from every MPI process in the job and so it just scrolled by you know I could grief. I only need to see that once. Why do I have pages and pages and pages of output? So we actually added a duplicate filter so that when you know all 500 of your ranks die Simultaneously, you'll only see that error message once and a little counter saying oh, and I received it 499 times more as well Some other cool stuff we did we integrated with Val grind There's Val grind the memory checking debugger on on Linux has an API that you can actually integrate into your application Or in our case into the middleware that will do memory checking for you and we can do some neat stuff saying hey You started a a non-blocking send, but then you actually started changing the buffer and that's that's bad Don't do that. Yeah, there's actually been some trouble with that in the past. You especially using Networks like infini band, right? We're trying to use Val grind would give you a bunch of crazy errors Yeah, so in finnaband and open fabrics in general and a couple other types of networks as well They're OS bypass networks, right? So the memory may be coming from places that Val grind is unaware of because the operating system honestly is is Unaware of or at least it comes in a in a non-conventional kind of way So we actually worked with Val grind It's API and the open fabrics drivers and in open MPI and kind of smoothed all those things out and said You know for for areas that Val grind would traditionally say oh, this is bad memory We just use the API to say no no nope. It's actually you know, we know a little better So Val grind this memory is okay. So all those false positives kind of kind of go away So that's actually genuinely useful So that sounds like there's a lot of neat stuff that I'm gonna get to play with once I go one dot three the stable install In the future. What is slated for one dot four? So we continue some of the thing that we did in the one of three Jeff didn't mention them like We have support for for torrent different kind of for torrent and we'll improve what we already have and we'll add more support so That's one. The other one is the the MPI thread multiple So we are kind of thread safe before but nobody had Nobody invested enough effort inside to make sure that everything. It's fine So the one three in the one three we did that so now we know that we are thread safe So from correctness point of view We are where we want it to be but we know that from the performance point of view We lag a little bit behind so that will be one of the major thing that I will be in the in the one dot four MPI thread multiple with the best performance you can get and we will add more support for for torrent We have some other plans to integrate more with tools So to to help the user to debug their application and not only to debug but one to understand exactly where the Performance penalties are coming from and how to make their application better Then of course we have the mca parameters that Jeff was talking about us to show less parameter to the user We we plan to to to have the connectivity map So right now the problem with open MPI is that if we if you don't specify what kind of network You you plan to use and if there are multiple networks there open MPI We do it stuff internally, but the user never know and that's something that we'd like to to address So to give a feature to give a parameter to the user that say you know do whatever you think it's best But dump me the the connection map Yeah, it's actually a surprisingly difficult feature to add and why we haven't added it so far Because it's a distributed decision. I know we said before You know what you use between A and B might be different between B and C and A and C And so we kind of have to gather all of this data and print out a nice little matrix to the user So it's a surprisingly difficult feature to add. That's why it hasn't been added yet Yeah, I may actually want to add the the MPI thread multiple stuff. It's actually kind of handy Going into future looking towards like mini core thread multiple allows better machine of MPI processes with like a threaded style So you could have like one MPI rank per node and it would run in cores worth of threads mm-hmm So this going into the future where we're getting many many cores on a system We may not want to duplicate memory so much with the discrete memory spaces Or we may want to avoid the amount of communication going on and actually allow use a threading library instead And that should make that a lot simpler An important caveat to our thread multiple support, you know, so our point-to-point Operations are our thread safe or we're pretty sure they're thread safe We'll see what happens when people start running real apps on them But a lot of the support Functions are not thread safe yet So that's going to take a lot of time and effort to do but you can do your MPI sends and receives and tests and Waits and things like that and those should all be be thread safe But doing all the support functions those are gonna require some more work from us. Yeah like the attributes and So on yeah, yeah, okay, okay, so this has been really good. I have a couple of questions for you guys um, if you could be dictator for a day of the MPI forum What major changes or any even even a minor change something that bugs you would you want to make to the MPI spec? Hey George, why don't you take this one first? Why me? Well, Georgia what George is alluding to both of us are actually on the MPI forum right so George represents University of Tennessee and and I represent Cisco and and both of us actually have multiple proposals up in Front of the forum already, you know a various different flavors some big things some little things and so on and so We're not dictators for a day But we do have our own little pet peeve projects that that we want to answer, you know and get get in there So for example, you know my I think the biggest proposal that I have in front right now is a better ability to layer MPI underneath different languages So, you know, it's cool that we have C C plus plus and Fortran bindings But what about those guys who want to do Python? And I've actually even heard people who want to do Ruby Matlab, you know various other languages that they just want to use MPI as the underpinnings for I think we as a standard can actually do a better job of being able to support that kind of model George well, so Jeff was partially right because Yes, I have my pet project with It's not what I would do if I was Dictator of the MPI actually I think that I would try to slash out some some of the things Yes, I think that it's the problem with MPI is that it gives so many I mean We have a lot of function and they are all useful for some for some people But when you when you try to learn something new and you look there and you see wow man They are four hundred for different function that a Each one of them do something different. I think it's a little bit scaring for for people and On the other side, I think that if we decrease this the the number of this function We as implementers can focus on what really matter You know be scalable do Give the best performance out of the cluster and then some other people can write libraries on top of it That will give to the final user the the extension that he's he's looking for so It's something that I I Don't know if I would like to enforce, but I think that it's It might be an interesting idea to look at tiny MPI Why not necessarily tiny but keep everything related to communication and so on inside and then move Everything else in extensions to MPI. So take MPI on the biggest loser is what you're saying Nice pop culture reference there Jeff Okay, actually Jeff I'm gonna wind up here in a little bit But there was something you and I had talked about at SC this year That was actually quite interesting to me given that the the next show after this is going to be Joshua Anderson from the hoomdee project Which is a Cuda which free goes you don't know what Cuda is. It's Nvidia's graphics card implementation for actually putting scientific applications on a graphics card and there was some discussion about how can we better write Distributed memory parallel Programs that are utilizing graphics cards, and we actually wanted to we were talking about mixing MPI in with that now What kind of ideas do you have for doing that? Well, there's there's some interesting ideas going on there And the very first thing I want to say is that you know, none of this is going to be the answer but I think that MPI has at least a small role to play in You know the GPU and hardware acceleration of mathematical operations and there's there's at least two different ways to do This and I say this and this is came up at super computing because a gentleman from the the Nvidia booth came up to me and said Hey, you know, I actually have some customers doing MPI programs and doing GPU kinds of things with it I want to be able to MPI send and receive directly from GPU memory because it's it's separate than the main memory And you know we can do some forms of sending receiving But we can't do direct forms where I actually are DMA directly the message to and from the GPU memory Can we figure out how to fix this problem? So that's that's one thing that we need to fix Network stacks, I'm sorry tiny MPI But the other one that so we are you know I was talking to this guy from Nvidia and we said, you know, well, what about MPI has this function called MPI reduce Where MPI will perform a you know a mathematical reduction function for you So, you know every one of your MPI processes has you know 2000 double precision numbers that are part of a vector if I want to do a global sum across that vector This is an intrinsic operation for MPI that'll do the communication and computation all in kind of one step for you And then give you the answer But this kind of operation is is very natural to farm back on to some kind of hardware acceleration Be a GPU or otherwise and so One of the things that we did when we want we walked away from super committing this here saying yeah That's a very intriguing idea We should Investigate being able to have MPI be able to take advantage of hardware acceleration now in addition to that So that's a parallel operation right the the reduction operation a global sum across all my processes, you know our global Product or whatever the operation is there in the in an upcoming version of the MPI standard There's a proposal from a guy by the name of Torsten Hoeffler from the University of Indiana who has a Proposal for a new function that allows you to do a serial reduction So I'm just in one process give me the the global Sum of this this vector right here So I don't have to use the MPI mechanisms for communication But I do want to use the MPI mechanisms for computation and this seems like a very natural way to hide some of the GPU Complexity behind it you've got people who already are familiar with MPI if we can just make MPI reduce be a little bit faster You know because we're farming it out onto accelerator hardware or we can give you this new you know local reduction function that you know not not Not a communication thing but just hide some of the CUDA or open CL Which is another programming language for hardware acceleration kinds of things hide that behind there So they don't have to learn a new API, but still give them the power of that acceleration You know that could be pretty neat and so we just kind of completed phase one of that in the open MPI code base We added the infrastructure to be able to have plug-ins or components for CUDA and open CL and MMX and SSE and things like that So now the next step is you know to actually write some of these plug-ins that you know If you've got a GPU we'll just automatically use it very much in the in the same light that George was talking about before that Open MPI will automatically choose the best network by the same token will say oh you have a GPU Well, I'll just load up that plug-in and anytime you call an MPI reduce all I'll farm it back there Actually that that brings up a good point that there's an opportunity for a lot of research here because those accelerators are hanging off of a bus That you know maybe fast maybe not and we're seeing this sometimes with the GPUs So it actually be kind of nice if you could do that serial reduce and it would mad it would just magically take advantage of You know 128 bit SSE or 64 bit SSE or a GPU if your data is large enough to move across that bus Utilize the massive horsepower and a massive memory bandwidth and move it back and actually be efficient. So it's actually a really Lost over completely the decision about whether you're gonna farm it back to the GPU or not It might be well, you know, we're only gonna do this when you know It's more than a hundred doubles or something like that or the hardware is not even capable of doing double precision I can only do floating point precision. So there's actually a decision you have to make first about am I gonna farm it back On to the hardware or am I just gonna do it on the main CPU and this is kind of the power of the open MPI project So, you know, I'm from industry. I'm from Cisco George is from the University of Tennessee's from academia You know, we got to have this the yin and yang of the project that I went and did the infrastructure part of this project And now I'm kind of handing it off to some of the academics to say alright now you guys You know come up with this come how do we do this? I don't know the right answer We need to experiment we need to research and figure out the best way to do it and the universities and the academics are Much better suited to doing that than than the vendors and that's one of the things that makes the open MPI project work so well There'll be another PhD out there. Thanks to this idea Absolutely, and you know, that's part of the reason that you know I said back at the beginning here that you know the open MPI project We really want to engage everybody, you know, there's a lot of people with smart ideas out there and You know whether you're a grad student with just a cool new idea for an algorithm or something like that We we want to talk to you and we want you to have your stuff be part of open MPI. That's why we're a community That's why we work together. Oh cool that that that's really great. So Where where can the open MPI project be found and how can actually people get involved? So the our central website is open-mpi.org and If you go there probably the best way to get involved is to just start looking at our mailing list join our mailing list Join the conversation ask questions and things like that and you know if you actually want to start contributing There actually is an intellectual property agreement. You have to sign Because we're trying to do this properly, right? We don't want to just take code and and hope that you know We can redistribute it under BSD license there actually is a form that you have to sign and the form actually is the Apache software foundation form We we very definitely benefited from that project's work there We we just changed Apache to open MPI and it basically, you know It gives us the right to redistribute things under under the BSD license again same disclaimer I'm not a lawyer. This is not legal advice, but you know for anyone who contributes code back to open MPI We do require that that Form is signed now. I don't want to say that to be scary, but it is one of those necessary legal things here But we love having more people involved and it usually starts with just a conversation or or somebody saying hey You know, I've got a I've got a wacky idea and you know We start talking and and and fun and interesting things happen from there Okay, cool. Well guys, thanks for taking some time out to talk with me Thank you for being the inaugural show of research computing and engineering and I hope to see you guys around them tend to be at SC every year and see you guys there, too So thanks again for being on thanks for inviting us. Yeah, thanks for inviting us. This was great. We appreciate your time No problem. Thanks a lot