 Okay, with that out of the way, let me quickly thank Ralph and Jeff again for taking the time in setting up this presentation and not only this presentation, but also the previous two parts. So thank you very much, Ralph and Jeff. The previous two parts are already available on the Easy Build YouTube channel. And part one is quickly climbing up to be the most popular video on our channel right now. It's pretty clear that this is an important topic to cover. And with that, I'll pass the word to Ralph, who will start part three of the presentation. Thanks, Kenneth, and thanks everybody for attending. Obviously, this is presented by the Easy Build community, my mouth. Anyway, and we want to give some recognition and thanks to them for this opportunity. To explain this to you. Like, like Kenneth said, the session is being recorded. We're asked that you put your questions in the Q&A panel. So Kenneth will moderate those and pass them along to us. We welcome them as we go forward and anything we can't cover as we're going, we will take care of at the end. So as a quick overview, there are several things covered in part one. And then again, in part two, I think you can see there the breakdown of it. Today, what we're going to concentrate on is going over the PIMX reference runtime environment and then go into configuration and debugging tips and then give you some previews of what's going to come up in the 4.1 and 5.0 series from OpenMPI. We talked about doing a recap amongst ourselves and the answer is there's just too much. So I've provided some links here that'll be in the slides when we distribute them to point you to the part one and part two video and slides. I apologize for that, but we couldn't come up with a meaningful recap that wouldn't consume a large part of the session. So with that, let's launch into Perte, as we call it. Oh, and before I actually do that. Let me go ahead and these are some questions that were left over from session two that pertain to PIMX. And I want to go ahead and address them here. So we had a question about examples of applications for the asynchronous cross model coordination activity. I have a link here to a Euro MPI paper from 2018. I couldn't find the link that I wanted to find. There was another paper given as super computing by the same author, Jeff Wall Valley, and that also went over it. So I would refer you to those. There's also some portion of the PIMX standard. There's a chapter that actually deals with this as well. So you can get it in any of those three places. Pros and cons of S run versus MPI run. I know Jeff touched on this a little bit, but I thought I'd go ahead and answer a little more about it from my perspective. MPI run obviously offers a lot more options because it's specific to that MPI implementation. The MPI run you'll find in open MPI has a larger range of PIMX support in it. It does a lot of stuff like the dynamics, the job control and the monitoring features that I described are all supported in there. The con to using MPI run is historically it's MPI implementation specific. So if you use the MPI exec or MPI run from say MPI, you won't find any of those features in there. So it really varies around between the MPI implementations. But for open MPI, it does cover a lot of this. And that even gets more extensive when we go to OPV5 because we actually embed what we're going to talk about today. This PIMX reference runtime environment or PERTE, that's actually embedded in OPV5 itself. So it expands the range of what you can do. The biggest positive for it is it works the same regardless of what MPI implementation you're running against it. And so that's really the biggest benefit to it. The negative, like I say, is it doesn't have all the options. I was asked about giving a separate talk that goes through the PIMX launch orchestration procedure. I'm happy to do that. I'll talk with Kenneth about scheduling it. It's just about an hour long talk to walk through it all. But it does go through all the things about how you orchestrate or interact with storage systems and fabrics as well. So I'll be happy to do that. So what is PERTE? As I said, it's the PIMX reference runtime environment. It's kept up to date with PMIX in terms of both the master branch of the repo as well as obviously any release branches. So whatever is in the PIMX standard, the latest version of the PIMX standard is fully supported in this runtime. It's actually set up so you can do it on a per user basis. So you can just get your allocation and then just launch PERTE in it. It'll fill the allocation for you and basically it looks just like a shim. So everything that's available for PIMX, you can then run underneath PERTE, even if you're host environment, whatever it is, slurm, job step manager, whatever it is, even if that doesn't support it. So one of the places it's used a lot is on craze since craze doesn't support dynamic environments or multi application environments very well, if at all. So people are using this as a way of getting around that where they get an allocation from their craze Alps environment or Shasta if you're if you're starting moving to there. And then they launch PERTE underneath it so that they can go ahead and run workflows and other such things cleanly. So it's a persistent DVM. It launches all the statements on all the allocated nodes at the beginning. And then you just use it has a PRUN tool is called that takes the place of MPI run and you just launch your applications using PRUN and they run against the DVM. Then when you're done, there's a P term command that will go ahead and tear it all down. It comes actually out of a open MPI the original runtime or take out of there, but it forked several years ago. It's its own standalone project now. And there's a bunch of people that are looking at including it in their distribution. Actually, I guess that will be included in their distribution with the next per day release. So this is how it's used. Like I said, you get your allocation. You tell you execute PERTE, it reads the allocation and launches daemons everywhere. And then when you do PRUN, it will these daemons will fork exec the prox everywhere and it completely isolates you for any limitations of the host environment. So it's used in like I said, it shims as a shim and all the non full featured environments. Cray Alps is a really popular one. Slurm is another one where you see it a lot. You know, PBS pro environments that people are using it there. In fact, the Altair guys have said that they may very well package per day along with PBS in the future. So that when you start up PBS, it automatically just launches per day underneath it as a launch support mechanism. So you might see that coming out like, you know, later this year. It also provides a user level development environment. So all I saw the debugger guys and they've been doing we're developing their integration with Timx. They actually were using per day to do that because they could just start it up and then have Timx based environment. And they could do it for each user and didn't wind up, you know, didn't have to be running as route. One of the more popular uses is as a workflow for workflow managers, we're seeing this now being integrated in by the National Labs into their workflow managers like Adios and Swift T and others. And because it gives them this full dynamic operations, multi app and multi tenant operations are supported. And it's actually very fast is launching. There's been several papers published on this where you can launch your applications. You know, up to several hundred times faster than you can, for example, with like Alps, which is one of the ones that was compared against. And then, like I said, it's going to be the base runtime for OMPI going forward after v5 or starting with v5. So it kind of fits like this and open MPI it's actually its own separate little project area. It has it uses the same MCA framework system that all the other OPE related projects do. And you'll find there's a bunch of frameworks in there and I'll talk about some of those. So I said, same MCA component architecture, we just stole it straight out of open MPI is the same build system as open MPI. The big difference is there are no embedded libraries in this. So it requires LibEvent and HWLoc. It also requires PIMX, and we only support versions 3.1 and above, but none of those are embedded. So they have to be provided externally. There are optional things in here. So if you want to run in a PBS environment, for example, we have support for that support for Alps, LSF and grid engine. So auto detect all of those so you don't have to manually specify them unless they're not in standard locations. They need to tell us where they are. We also auto detect slurm and we have singularity built in singularity support here. So you can launch containers without necessarily having to tell us that it's a singularity container. And for those interested, you can go to the PsyLabs website and they'll tell you about this integration. And if there are questions about it, I can point you to some of the things. And we also auto detect Zlib, that's our LibZ, but we do need that or use that for compression of the data for faster launch. So that's a valuable thing. If we see it, we will use it. Some of the key frameworks here, there's a there's the process placement one called RMAPS. There's the inner daemon communication system called the OOB for out of band. There's also the launch system that launches the daemons and also communicates the launch command to those daemons called PLM. And then the entire PERTE system basically is an event driven state machine, asynchronous state machine. And so the framework is called obviously state. And then down here at the bottom, I provide you with a link to a set of instructions starting with how you get the software, but it actually walks you through how to build it and how to run it. There's an adaptive command line in this system. And so a PERTE really is designed to support multiple MPIs and also OSHMEM environments, et cetera. And so each of those guys communities have their own command line. So open MPI has certain command line options. Mpish has others for their hybrid hydro system. And so to pick those up and make them as comfortable as possible for users, we added what we call the schizo framework. And it allows us to create plugins that then detect which version we're actually running, what we're trying to support if it's Mpish and open MPI or various versions of OSHMEM are all supported in there. And then we configure the command line to look for the options that that particular environment natively supports. And that detections based on the absolute path of the ART-V. So when we see, you know, which PRUN you are executing, what the alias is for it, we will go through and look at that absolute path. There's a set of configuration files, .ini files that we expect to find in our installed prefix, et cetera region. And those .ini files contain, here are the absolute paths to the SIM links to PRUN and what those, what those absolute paths relate to. So in an OMPI, .ini file, for example, what this thing is saying is that if the SIM link being used has an absolute path to MPI run that looks like it's as shown here, then that is actually being run as an open MPI alias. Same for this next one for MPI exec or for the OSHMEM embedded version in open MPI. So if we look at the ARG-V0 for PRUN, for the alias that's being executed for PRUN, and we see that this is the SIM link that's being actually used, then we will use, configure the command line to look for open MPI command options. If we see that it's in the MPI .ini file, that that's where that SIM link was defined, then we'll configure the command line to use MPI command line options. So, you know, the MPI guys are looking at perhaps converting over from HYDRA to using PERTE as their base based on this SIM link capability and the ability to reconfigure the command line. So you may see more of this being used in the future by other programming libraries as a way of just having a standardized runtime. MCA parameters are obviously part of PERTE since it's based on the same MCA system. There's a major difference in how these are handled from what you used to an open MPI, and you're going to see that reflected in the OMBV5. And the reason is because there's actually a two-step system here. One is you start the DVM with PERTE. And so once you start that DVM, obviously MCA parameters that would have related to the frameworks that the DVM specifically uses like, you know, OOB and RML, et cetera. Those MCA parameters really have to be given to PERTE. If you give them to PRUN when you actually tell the job to launch, it won't know what to do with them. So that is something you have to be aware of as to when these things have to be provided. Now, if it's a SIM link where somebody's typing MPI run, then you're okay, all right, because it'll automatically be passed to PERTE. But if you are running them separately, if you're doing the DVM mode where you're launching PERTE separately and then doing PRUN to launch your applications, you need to be aware of this. So in most cases, if you're doing PERTE, you'll find that the MCA parameters only control the default behavior. So for example, the mapping or the binding options, and then the per job behaviors controlled by PRUN command line options. And we ignore any MCA parameter on that command line for those values. So I know this is a little confusing. I'm running through it fast because of our time constraints. I can easily come back and give you guys more information about this. I'm also putting it on the website to help guide people when they look at PERTE. But it's a little bit, it's basically that these main ones up here, looking at the DVM, these apply to the DVM level connections and behaviors themselves. They're not something that changes on a per job basis. There are these things like RMAPS and HWLOC that do get changed on a per job basis. And therefore they are only set by, you only set the default behavior with the MCA parameters. And then we control it on the PRUN command line for each job. A lot of ORTE MCA parameters are gone because you just don't apply to PERTE. And so we just don't support them. There's a standardized way for querying and setting these runtime parameters. There's lots of ways you can do the MCA parameters. So we have, you know, this gets a little confusing, but I'll try my best to get through it. Again, we're going to document this for you. But basically, if you're looking at open MPI parameters, the double dash MCA will work. We will automatically translate that to being an OMPI MCA parameter if we see it pertains to an OMPI framework. We do respect the OMPI MCA environmental variables. We do have still respect the open MPI default programs files and the system level programs files. So both user and system are supported. We do a similar parameter for a pattern for PIMEX. So PIMEX parameters are supported in the environment. There's also the user level and the system level file. And we do the exact same thing for PERTE. Okay. So you have three levels of MCA parameters that you can deal with. There are some differences. So PERTE command lines. Jeff gets excited about this. So I'll just emphasize it here. All the PERTE commands have only a single R in them. The project has two Rs because it's an acronym. But because we don't like stuttering when we type, we only have one R in the commands. So you use PERTE to start the DVMP run to launch the jobs. OMPI prefix MCA parameters should be with the O in front to indicate that they're OMPI. Again, we will look at the vanilla MCA program option and we will translate it. If we see that the framework being referenced, the parameter being referenced is an OMPI parameter. Then we have PIMX MCAs and PERTE MCAs explicitly called out as to what they are. And then like I said, we will do, for the generic MCA, we do our best to match it based on the parameter name against the known frameworks by the project. Right now, we've got them all pretty well nailed down. But, you know, it's better if you use the specific ones so we know. Like I said, envars and the PRAM files, we pick those up. PIMX will automatically pick up the system and user default parameters for both open MPI and PIMX and forward those. And PERTE is the same for its own. So you don't have to worry about forwarding things out. We do it for you. Some of the build tips. So PERTE has no public APIs. Applications never link against PERTE. Okay, so you don't have to worry about mixing and matching PIMX and live with that HWLokes between PERTE and any of the application libraries. It doesn't matter. We, I want to emphasize again about the sim link thing. We need that so we know which version is being executed. And so we know when somebody types MPI run, you know, which one, which command line we should be mimicking. So be sure you do set up those INI files. Like I said, this is for Jeff. PERTE is the project name. It's been around for a while. All the packages in the libraries are labeled with two Rs in it. Single R is the operational name and because users had trouble stuttering when they were typing. And so it got confusing. So we just said, okay, PERTE is the operational name and it's just all the tools and MCA parameters. All that is with a single R. There are a set of tools that come. There's a wrapper compiler, which you really don't need to use. It's there solely to ensure that if you want to build an app to run it against PERTE, we will make sure that all the PIMX, libEvent, HWLoke things are all, you know, properly taken care of. But it's just there for convenience. Really, we use it mostly for testing purposes. But it is there if somebody wants to use it. Obviously, PERTE is there. That starts the DVM up. There's a PERTE info, a la MPI info or OMPI info. That gives you all the build information. The daemon for the backend nodes is there. P run is the launcher. And then P term is what you use to stop the DVM. Some debugging tips. So obviously this thing is designed to scale. It's been run on clusters with 10,000 nodes on it. It scales very well. But if you want to simulate it on a small allocation just for test purposes, there's a certain MCA parameter you can set, which is the routed one. If you set routed radix to one, it creates a linear chain on the daemons. It basically looks just like a scalable communication system, except it's got a scaling factor of one. So if you want to, you can just take an eight node cluster and make it look like it was an eight times 64 size cluster. And then you can also use this RAS-based multiplier. This allows you to launch multiple daemons per node. You can't run MPI jobs this way because shared memory gets confused. But you can use it just for testing runtime scalability to verify that you can launch a bin true or a host name application and verify that it's running at scale. There are PIMX tools to help you with diagnostics. I won't go over those here, but they are there. And then there's also a verbosity. Just like you may be familiar with open MPI, I would suggest there's starting points that there's the launcher PLM base verbose. There's the asynchronous state machine. The state base verbose set those equal to five. You'll get a lot of detail of what's going on. Those need to be set for the per take command itself. It lets you see what the command line is and what all the output is from the remote daemons and things. If you're having trouble with the DVM itself, you can set the OOB base verbose and the error manager base verbose. That'll let you see what the communication is and what the errors are. And if things are working, daemons are starting with prox aren't working. The first thing I always do is just set the PIMX server verbose option to five. Again, that's on the per take command line. And that will tell you what per take is seeing come out of the PIMX library. So my concluding remarks. Again, thank you for your attention and especially Kenneth for setting all this up. I just want to give you a reminder that PIMX is the standard. It's a document. Open PIMX is the library. That's the reference implementation of the PIMX standard. At the moment, it's the only implementation out there. I'm not hearing from anybody that they want to write their own. It's a lot of code and it's just a lot of plumbing. But someday somebody might. So we do have it separate from the standard. And then per take is the reference runtime environment and it gives you this full featured environment. If you want the PIMX support, there is a growing movement to provide it. I mean, it is there in Slurm and it will be there in like Shasta. I think like I mentioned earlier, PBS is moving that direction. So I think it's becoming pretty clear that it's going to be generally available. But if you need some particular thing and you want to get it out of PIMX, be sure you provide it included in your RFPs, especially at the desired feature levels that people know what it is you actually want. We are restructuring the standard document to make that easier so you can be clear about, well, I want PIMX, you know, according to these particular feature levels per the standard. So we are trying to make that easier. You can ask your vendor to go ahead and integrate it to that desired feature level if you already have a system. And if you find you just don't have it or you can't get it as quickly as you like, you know, consider using perte as a shim. That's something that people have found to be fairly easy to do and be fairly useful. So with that, I'm going to turn it over to Jeff. Okay, here I am. Kenneth, there we go. He gave me the ball. Let me share my screen. All right. Thank you, Ralph. Let's jump into this part of it, which is overall OpenMPI configuration and debugging types of things. This is just kind of a roll-up of a bunch of the questions that we hear most commonly and, you know, our suggestions for working through them at your site. So we usually give the same first several steps to do, but if you're having problems with OpenMPI, start, you know, particularly if they're launching problems or you're launching and weird things are happening right off the bat, you know, it's not something deep in your MPI program, but something regarding launch or initial, you know, starting of the application across one or multiple hosts, things like that. Here's what we suggest that you do. Start with something simple. Start with a non-MPI program. Fun fact, MPI run, and regardless of whether you're in 4.x and below or 5.x and above, so that's, you know, are you using the old orte system or the new perte system? Doesn't matter. You can actually launch non-MPI programs. So hostname that I have listed there, that's just the Linux, you know, hostname or POSIX hostname application. So just start trying to run a small non-MPI application and make sure that launches, right? That covers a whole wide swath of things right there. SSH keys, permissions, all kinds of things like that. So this basically just tests whatever the underlying open MPI runtime system is. Again, regardless of whether it's orte in open MPI 4.x and below or perte in open MPI 5.x and above. So it doesn't test anything about MPI at all. This is just the basics of the runtime itself. And once you get that working, then try a trivial light. Let's start up the MPI process and then shut down the MPI stuff. Don't actually send any MPI messages, but just, you know, start up the MPI layer, shut down the MPI layer, do a hello world kind of thing. We have a hello world example program in the open MPI distribution tarball. It's in a couple of different languages. I'm showing the one here for the C language. So if you go into the examples directory and you type make, you will see a hello underscore C. So you can just try running that and it'll print out a kind of a lengthy hello world message and it says hello, I am X of Y. And I think it prints the host name and the build string and a couple other things like that. But get that working. Sort through any, you know, file system or permissions problems that you have with open MPI, complaining that it can't find plugins or whatever the problems are there. Once that actually works, then take the next step of actually doing network-based communication. So ring is another program that we have in that same examples directory there. And again, this is the C version of the ring program. Run that guy. And here I'm showing all of these just running it on a single node, right? A single server, whatever your favorite language is for a single machine, right? This will actually do some MPI sends and receives. And so it will actually fire up the MPI layer and then also fire up the networking layer. And so this can work you through network selection mechanisms, network stack, API and library selection issues, all kinds of things like that. So each of these build on and activate more of the open MPI code base as you go along. Now, once you get beyond one node, then actually add a little complexity and do it on multiple nodes. So I am actually using syntax on here that just specifies the three hosts, host one, host two, host three, and the little colon one there says we have one slot on each of those hosts. So hence the NP3, run three processes, one each on each of those host names. And do the same thing with host name, hello C and ring C. Again, building on that complexity of just try the launcher across multiple hosts, then just fire up and shut down the MPI layer and then actually pass some MPI messages between them. Now you might also pass this through your batch scheduler, right where then you might not need the dash dash host command line parameter to specify where it's running. So if you run through slurm or torque or whatever your favorite local system is, you might not need this whole dash dash host thing. I just put it on here for reference for unmanaged environments where you have just SSH, for example, and you got to specify it on the command line where to run. These things will, as I said, this sequence is really very helpful in smoking out problems starting at the bottom and working all the way to the top where you have actually an MPI application that is passing MPI traffic. Now the next things here are other problems that we hear often. So the first one, the very first problem we hear a lot of is please check your path and check your LD library path. Particularly for first time users, people who are just getting used to parallel programming and like, wow, my application is running on lots and lots of servers simultaneously. That's something you got to wrap your head around. This is not normal, right? What we're doing is pretty complicated stuff here in the HPC community. And the concept of running on lots of servers simultaneously is a little weird. And so they don't think about like, oh, you know, my dot bash RC or whatever my shell startup files are need to set the path and LD library path even for remote non-interactive logins, right? Or bash scripts, right? Slurm and other bashing or scheduled environments tend to take care of this for you by propagating the environment of the head node onto all the nodes where you're actually running. But that isn't necessarily universally true. So we strongly encourage people to check their path and their LD library path both locally and remotely. To make sure that you are actually getting the open MPI installation or MPI installation that you think you're getting, right? Are you getting the MPI run that you expect or are you getting the live MPI dot SO that you expect, et cetera, et cetera? Sometimes you can be surprised that there's a, just some kind of loophole that on some remote nodes you're getting a different live MPI altogether. And that's why things are just completely falling apart. Now I put in here bonus points too. The Linux command for checking what libraries your program links against is LDD. So if you run LDD in the name of your MPI application, it should show you all the libraries it's going to link against. And it's using your current LD library path and other system setup to show you where those libraries are. So it's not going to just show you, oh yeah, it's going to link against live MPI dot SO. It's going to show you the absolute path of live MPI dot SO. So this is just another way to verify that you're getting the library that you think you're getting at runtime. Another common thing that we've seen in the last couple of versions of Mac OS, I don't remember exactly when this started, but it was at least a version or three ago. The dollar tempter value that you get when you launch a shell on Mac OS is really long. It's not slash temp, it's slash blah, blah, blah, blah, blah, really long type of things. Now this can actually cause some problems for open MPI because we start in dollar tempter and then make shared memory files. We have some metadata files, all temporary stuff during the execution of your MPI job. Sometimes we can exceed the max file name size for the file system and you get fairly amorphous errors back, I will say. And so on Mac OS, we actually encourage people to say, you know what, just override tempter, make a home slash temp and then export tempter equals hometown and then you should be well underneath these max file name sizes. It's a little annoying. Mac OS did it for reasonable reasons to be honest. We can't really complain about that, but it is something that catches people by surprise. And so this is something to be aware of. And we see this because people love to develop and test on their laptop, which is a completely reasonable thing to do. They want to do small runs on their laptop and then take their application over to the big iron like an organizational resource at university or your research organization or your company, whatever type of thing. So be aware of that for Mac OS runs. Third thing that we see is that we unfairly get blamed. And of course we all want to say that like, no, it's not our fault at all. We all know every layer of software has bugs. I wish open MPI didn't have bugs, you know, but we do sometimes have bugs. But point being here is that if somebody runs an MPI application and it crashes, you might get quite a few error messages out and the stuff at the bottom of your screen may make it look like it was open MPI's fault or Pimx's fault or Pertay's fault or Orte's fault or something underneath the actual application. That may not be the actual cause of why the application crashed. You're just seeing the bottom of the chain and the bottom of the chain may not show the first thing that went wrong. That was the real problem. And then everything else that happened after that was a consequence of the real first problem. So a canonical thing that we tend to see is that like, oh, you know, there was actually a bug in the application. It caused a seg fault. You went over a, over ran a buffer, something, right? One MPI process died due to a seg fault in, you know, legit programming bug in the application. But then another MPI process tried to communicate with that one and they got an error because that process was no longer around. So MPI then aborted. And if you only look at the last couple of lines of output, you're going to see, oh, yes, MPI chose to abort because, you know, we couldn't communicate with process 13. And you don't realize that process 13 actually died to a legit problem in the application itself. So our guidance is always, you know, scroll back up. Make sure you find that initial error and that will actually get you on the road to finding out what your problem is and moving on from there. Another question that we get is, how can I tell which network I'm using? A lot of environments out there have multiple networks. For example, Ethernet and Infiniband or Ethernet and Craybase networking, whatever it is, you want to make sure that you're using the good networking, whatever it is. I'm in Cisco, so we're an Ethernet loving company, but we even have machines where I just, you know, you have multiple Ethernets out there. One of them is one gig Ethernet, the other is 40 gig Ethernets. You want to make sure you're using the right 40 gig Ethernet and you're not also spraying across, you know, the gigabit Ethernet, which would just slow everything down. So if you look back at part two of the seminar, in the slides or go back and look at the video, I talked about how you can force a given network to be used. That is the absolute best way to know. Usually open MPI will pick the best one for you, but there are definitely cases where open MPI can't know and it needs a human to say like, oh, I want you to use this network or this set of networks. Don't use the other ones over here. Like you might have multiple 40 gig networks and one of them is for storage and one of them is for MPI. There's no way an open MPI can know that unless an administrator or a human otherwise tells open MPI which one to use. So go back and look at part two for that and you can also use this in conjunction with MPI benchmarks. Make sure you're getting the performance you expect. If you're using Infiniband, if you're using high speed Ethernet, if you're using whatever, run and make sure you're getting on that order of magnitude of performance. If your performance is significantly lower or your latency is significantly higher than you expect, you should go back and investigate, which network am I actually using and pair that with these MTA parameters to force which network is being used and then see why that's not happening automatically. Side note, I probably should have put on this slide here, if you find that you do need to give open MPI assistance on which network to use, you can always set those MTA parameters in the system-wide parameter file so that your users don't all have to specify a couple of different MTA parameters to make sure they use the right environment so that your network, you as, you know, the administrator of the HPC environment can put this in a system-wide config file that we've talked about earlier and then your users, they just MPI run 8.out. They don't know, they don't care, which is kind of the level of service that we like to give to our users. Also, we have an additional feature coming in version 5.0, which I'll talk about in a few slides. This is a not uncommon question, but it mostly does come from independent software vendors. They say, like, all right, we build open MPI, but then the customer can choose to install our application, which includes open MPI, into an entirely different directory on the target machine and the customer's environment. Well, we actually do have three different environment variables for this, and I included the per-take one because that's for open MPI 5 and above. So if you reset these three environment variables, that will reorient open MPI to say, oh, this is where I should expect to find all my plugins, my help files, the metadata, the config files, all these kinds of things. Didn't want to spend a huge amount of time on this, but we do get this question periodically, so I figured I'd put it in here. Here's a super horrible issue that we have seen coming up due to the segregation of what we've done with some of our libraries. So we've split out PIMICs. It's not so much of an issue because the libraries are not an issue there, but we've had requests from the Linux distros to say, hey, stop embedding HWLocs, stop embedding PIMICs, stop embedding LibEvent. We want to use the system installed ones and not have an extra copy just for open MPI. Completely reasonable request, actually. So in starting an open MPI 4, the open MPI actually prefers an external copy of PIMIC's HWLoc and LibEvent if it finds them when you run the configure. And I talked all about this back in part one, so go back and have a listen to that and look through the slides there for the details on that. However, it can get a little confusing because these libraries are becoming more and more popular in an HBC environment. Sometimes the applications themselves will also link against these libraries and potentially make their own calls into the library. So like if they're linking against PIMICs, you want to absolutely make sure that open MPI is using the same libpimics.so that the application is because if they're not, if they're just a little different and you have two different copies of libpimics.so in a Linux process, all men are bad things will happen and it really can be effectively random behavior. You will see random sec faults, random crashes or this variable was uninitialized because it actually initialized it in the other copy of the library. This stuff can be really insane to track down. So be aware that this can be an issue almost certainly when the applications themselves are using HWLoc, LibEvent or PIMICs and then OpenMPI itself. We are using all three of those things as well. So if the applications are using these libraries, be really, really aware of this duplicated library thing and make sure that the application is using the same libraries that OpenMPI is. Not just the same version, but the same.so itself. That is the best way to guarantee that you don't have this duplicated library inside of a single process space issue. There are a huge number of caveats here. I'm not going to go into all of them here because every time I go down one of these paths and think that I understand how runtime linkers work, I find out that I don't know anything about how runtime linkers work. So there's a lot of corner cases. Just go simple. Don't try and make it kind of like, well, this should work. No, no, no. Use the same LibPimics.so. Use the same LibHWLoc.so and so on. Avoid all those corner cases of things that you think should work. All right. Now, how do you get help from the community? Well, there are a couple of vendors out there. So first off, I didn't even put on here. You should be talking to your vendor if you're having a problem, particularly with their network or their environment. They're probably your first step. But if you are just a random person out there and you just downloaded OpenMPI and run it on your machine, your laptop or your environment, great, we actually have tons of resources out there for you. Here is the link to the help page. It is probably a bit intimidatingly long, but for good reason. We really need you to supply as much detail as possible. OpenMPI is used in a huge variety of environments across the world. We don't know how you're running your job. We don't know how it's configured. We don't know what equipment you have. We don't know how the local environment is set up. So we need you to supply as much detail as possible. Don't assume that we know what you know, because we are not sitting there right next to you. We don't know anything about your local environment, even though it's probably well known to all of your users. So please describe what your program is supposed to do. Describe what it's actually doing. And if at all possible, please give us a small reproducer problem. Our program, please don't point us to like, oh, I'm just using this giant's well known application, and this is happening. If you can give us a small example that is so much more helpful, because we may not be familiar with giant popular application that is very useful in your environment. We may not know anything about that. Another thing to do, and I mentioned this in a previous slide, check your baseline MPI performance in your environment. Run some MPI benchmarks. The Ohio State ones are a decent set of benchmarks. They will tell you what your baseline performance is and compare that to what you're seeing in your application if you're seeing a performance problem. If your baseline performance is horrible, then it's probably not an application problem. It's probably something in your environment. Those kinds of things. All right, so that's my tips on debugging and configuration kinds of things. Let's talk a little roadmap-y stuff. So let's talk about the 4.1 series. And the 4.1 series is actually coming up soon-ish, and I put it's expected approximately in August. We'll see how this goes. So what's coming in 4.1? Well, there's a bunch of general performance improvements, which is just ongoing work. We always do like, oh, hey, we can tweak a little thing and make a little thing better there. Not really worth mentioning in detail, but there's always the miscellaneous performance improvements. Some tangible things. So there are some libfabric and otherwise known as open fabrics interfaces improvements. So they actually support multi-dice environments now in the OFIMTL. And there's also some one-sided performance improvements. Not going to go into the detail of how that works, but basically in libfabric environments, our friends at HPE and Amazon and Cray and elsewhere all worked on these things to support multi-dice environments and also increase performance for the MPI one-sided. Yeah, MPI put, MPI get, and so on. University of Houston, our friends down there did a lot of improvements to UMPIO. So UMPIO, I again mentioned in, I think it was Part 1. That is our whole set of API for doing the MPI Parallel I.O. So they added support for, oops, I'm sorry, Luster should not be there. I will remove that before the slides are published. But they added support for IME and GPFS. Luster support is coming in version 5.0. Sorry, that did not make it back to 4.10. And since this is a minor release, 4.1, 4.x, it is backwards compatibility with the 4.0.x series, including ABI compatibility. So if you did compile something against 4.0.something, you should be able to just MPI run and not have to recompile or relink with the 4.1.x series. Now there is a big notable new thing here. Something that is perhaps a bit overdue. We've been working on so many other things that we really probably did not give enough attention to our collective performance. And over the years, our collective performance kind of stagnated and some of our competitors, frankly, had better collective performance than we did. So we did two rounds of improvements in the 4.1 series. So we did general algorithm tuning selection improvements. And what that means is, let's say you call MPI Bcast. There's a number of very well-known algorithms that have existed in the literature for two decades for, here's all the different ways you can do a broadcast efficiently. But we still have to choose which one is used at runtime. How many peers are there? How many hosts do they span? How big is the message that you're broadcasting? These things affect which broadcast algorithm you should choose. We basically tuned up all of that stuff and kind of improved them a bit. And that gives us a modicum of improvement, just baseline off the top right there. In addition to that, we actually have two new collective modules. This is all new code based on years and years of research from our friends at the University of Tennessee. Now these modules are new, and I wouldn't call them battle hardened yet. I wouldn't call them robust for the entire world to use that. So they are available, or they are going to be available in 4.1.0, but they are not the default. You must select them and manually enable them yourself. That being said, they show significant performance improvements compared to our prior generation of stuff. We really need some real world testing though. And so I'm going to talk to you a little bit more about what these two modules are. So the ADAPT module, the whole idea of this one is to tolerate scheduling noise. So processes that are descheduled, which is not common, but it happens, or processes that are just late in joining a collective. The whole idea of the ADAPT algorithms is that they are done on an event-driven kind of framework rather than I am just in the algorithm and doing nothing else. So it kind of relaxes lots of unnecessary synchronizations. So the graphs on the right here are a bunch of tests done by our friends at the University of Tennessee on Cori, which is a big machine at Lawrence Berkeley National Labs in the United States. And they did this across the 1024 cores. And so what they did here is you can see four different groups of output. The top one is with MPI broadcast. The bottom is with MPI reduce. And there are three colors in each one. In the colors, there was no noise injected, and we just did a straight broadcast to reduce. The red one is where we introduced zero to 10 milliseconds of noise, about 5% kind of noise. And the green one is when we introduced zero to 20 milliseconds of noise or about 10% of noise. And you can see how things work. So Intel MPI is on the left. Cray MPI is in the middle. This is a Cray machine, by the way. The open MPI current generation is the third one. And then the fourth one on the right is using adapt. And you can see the pleasingly that open MPI adapt is the lowest of the four. And so that's awesome. We are very pleased with that. The next one is the Han collectives. And that stands for hierarchical aware networking. And yes, as you, if you have seen parts one and part two, we have a lot of Star Wars references. This is a Star Wars reference. There's just no way around it. So what Han collectives do is support two level hierarchies, intranode and internode. And it basically reshapes the collective to just minimize how much stuff you send off node. And again, it's not actually implementing the collectives itself. It's just being the topology aware one and using topology aware piece and using that to select which collectives, which algorithms are used for both on and off nodes and doing a separation between the two. And this picture is kind of a joke because it's not the young dashing Han. It's the old grizzled Han who has learned over many years what all the right things are to do. It's not the mileage. It's not the years. It's the mileage. So here's some performance. And this is on Stampede two at TAC, which is the University of Texas in the U.S. These are actually the same graph, but small messages is broken out on the top and large messages is broken out on the bottom. So the X axis is the message size and bytes. And you can see at the top we stop at 128K. And then on the bottom we continue 128K and larger. And the Y axis is time. So of course lower is better. You can see a clear win here. So we're showing with Intel and Vapit to the open MPI current generation and then open MPI Han. And then in all cases Han is the lowest red line on there, which is super pleasing. And most importantly, we're getting open MPI off of the most embarrassing or the whole generation of open MPI, you know, off of the top line in the small messages, which is the most embarrassing slot. So great. I told you we have these things. How do you use them? All right. So you can open MPI 4.1. You can enable them in one of two ways. You can set their priorities of these two modules to 100. Or you can just include it in the call MCA parameter itself. So put them, you know, Han, Adapt, Comma Tune, Comma SM, Comma Basic. Again, you can put these in site-wide files if you want. I wouldn't encourage that yet. I would encourage you to have your users try this. Maybe give them an alias or something that makes it easy to test, depending on the skill level and comfortableness that your users have. But please have them tested. We could really, really, really use some real-world application testing with this stuff. Now, if you really want to get into the nuts and bolts of this, you don't have to use Adapt and Han together. You could actually enable them separately. So you can use them separately or together. We actually have more performance results that I didn't include here for timing reasons. We just have so many times to present these slides. But if you include both of them together, you get even better results, obviously. But in this case, you can examine both of them individually if you like. All right. So that's kind of the high points of the 4.1, which is a nice solid incremental build on the 4.0 series, a bunch of backwards-compatible things. So let's talk about the 5.0 series. The 5.0 series, well, honestly, frankly, we thought that was going to happen in the beginning of the year. But many things did not happen. So COVID happened. We found some of the development was taking longer than we expected. All kinds of things. We still plan to hope to release in 2020. See how that goes. Maybe supercomputing since it's 100% virtual this year will actually take a little time, a little less time for all of us this year. So we'll see. But what we did end up doing since 5.0 got delayed, we moved a bunch of the backwards-compatible pieces back to 4.1. So that's why 4.1 has a whole bunch of nice new stuff in it. And as we get into the next slides here, it'll look like, oh, OK, well, I could see how you would have included that in 5.0. But since it was backwards-compatible, you just pulled it back to 4.1, so we can get those cool features out to users earlier. In 5.0, just like with 4.1, there's a million minor improvements as well that we're not even going to talk about here. However, very, very important, we are breaking back compatibility with a 4.x series. So the API is broken. MPI, as Ralph talked about in all of his slides in all part three of this, basically, MPI run command line arguments are a little different. There's a bunch of stuff that is different. The broad strokes are the same. Dash, dash host, dash NP, dash, dash MCA. The broad strokes are the same of the MPI run command line parameters. However, a bunch of MCA parameters are gone, like Ralph talked about, all the Orte ones are gone. Some of the others have changed names, things like that. If you have scripts that are used for launching your jobs, you may need to re-examine them for 5.0. So this is a big heads up. I know it's a big deal. I know it's going to cause some disruption, but it was really kind of needed on our part to kind of refresh the backend. This is going to require new debuggers and new tools. There's a backend technology called MPIR that tools and debuggers used to connect to running MPI jobs. That is no longer supported. TotalView and DDT are releasing updated support to handle new Pimx-based things. Pimx is our way forward, and there's now native stuff built into Pimx to handle all these kinds of things. There is a shim if you absolutely cannot create your tools and here's the URL here, which you can get from the slides when we publish them later, which will ease your transition from MPIR, but really we need to encourage you to, you know, and to encourage your vendors to move up to Pimx. This is something, frankly, we announced a couple of years ago, and we thought it would only take one year to get rid of MPIR. It's taken a couple of years, but now honestly the community is in a much better place that there are viable alternatives. Those viable alternatives are actually fairly stable and mature, and they're actually getting rolled into all the various vendor products out there. So we're actually in a pretty good place that by the time 5.0 comes out, there will be viable alternatives. Well, actually there are many viable alternatives that will continue to be more evolution along that line. Ralph, I'd like to actually have you talk about this slide because this is all about 4k and perk. So if you are actually still on the call, could you give a quick run through this? Yeah, sure. So some of the biggest changes coming, and this top one is probably the biggest one that this group want to be aware of. We've dropped all support for PMI-1 and PMI-2. So the old way of linking against the Slurrum or Cray, PMI-1, PMI-2 libraries, that's no longer going to work with the OPV-5. We only support PMI-X. We've gone that way because both of those environments support Pemex now and so there's no reason for us to have to support multiple things again anymore. And also we really wanted to expand our use of Pemex to take advantage of all these new features that Pemex offers. So that's the one big thing. I mentioned earlier that Orte is being replaced by Perte. One of the changes there is that other than the MPI forms standard single-dash options that we have to support, we don't support single-dash multi-character options anymore. So you'll have to do double-dash. We are nice about it. We do tell you that. We do warn you and then move on for now, but eventually we will just simply drop it. Yeah, let me jump in right there. I will take the blame for this one because way back in the beginning of OpenMPI, I wrote the command line parsing code and we accepted things like single-dash MCA, single-dash report bindings, all these kinds of things. But in hindsight, that was probably a poor choice because even back then that was not GNU or POSIX compliance or common. And so for those of you who have gotten accustomed to doing those single-dash options, sorry, we're fixing that mistake now. There's also the adaptive command line. Like I mentioned, it requires a little additional setup. And I also talked about the MCA parameters. The big change is really in terms of other than the PMI-1, PMI-2 support. Pimix is now a first-class citizen internally, which you guys won't see that much, but internally there's a lot of things that got cleaned up and we just, and streamlined. So we just, everything calls just PMI-X and we're done. So what that means though is that you actually can configure OMPI to build just the MPI layer with no runtime support at all. And it will work in direct launch environments. So like the S run or app run, it will just work, provided that they support Pimix of course, but you don't need to build the runtime at all if you don't want to use it. The Pimix symbols are all exposed. So anybody who links against OpenMPI has immediate access to all the Pimix calls. And the other thing was that the info keys, the MPI info key names that are non-standardized. So the ones that MPI does not require, they're all now Pimix attributes instead of just arbitrary names that won't be used. And we did that for a particular reason and that is that one of the complaints that we get a lot is in the MPI implementation world is that if I ever want to use MPI info keys and I want to move my program from OpenMPI to MPI, let's say, all the info key names change again. And so my program needs to be completely configured with all these, if it's MPI, use these names, if it's OpenMPI, use these names. We're trying to get away from that. So we're just using Pimix attributes. And since that's a standard, hopefully we'll get other people to use the same thing and that way your program will now be completely portable between the MPI implementations, at least a compile time. So those are the big changes that I can report on. So back to Jeff here. Awesome. Thank you, Ralph. A couple more features that are coming in 5.0 is user-level fault mitigation. And this is also research that has come out at the University of Tennessee. Otherwise known, you may have heard the acronym ULFM. So it's a programming model that you can use inside the MPI API to create a fault-resistant application. So if you lose a process, you can actually still continue things like that. Also, we added support for AVX instructions for MPI op-operations. So it makes your MPI reductions, it accelerates the compute side of that. So remember, with MPI reduced, there's both a compute side and a communication side. So very little work had been done to accelerate the compute side. We finally did that to take advantage of AVX instructions that are available on Intel and AMD types of architectures. Also, with our friends from the national labs, a couple of different national labs, actually, we added support for user-level threads packages, queue threads, and argobots. So that's pretty cool stuff. Hopefully, by the time 5.0 comes out, we'll have Adapt and Han, the collective modules that I just talked about a few minutes ago. They'll be nice and battle-hardened and production-ready and we'll be able to make those the default. So big hopefully there, but this is why we need your testing and shake out the bugs in your environments. The OpenIB-BTL is gone. So for years and years and years, we have supported Infiniband, and iWarp, and Rocky through the OpenIB-BTL. In the 4.x series, the UCX-PML has started replacing that. And OpenIB's journey will come to an end at 5.0. So it will not be included in 5.0 at all. The UCX-PML is going to be what is used for all Infiniband support. Also, something that I've been asking for for a long time, the Vader-BTL has been renamed to SM because Vader-BTL is for shared memory communication and that name was extremely user-hostile. It didn't give you any indication what that module was for. So it's now going to be SM in 5.0. Although there is still a Vader alias. So if you have scripts that are still referring to the name Vader, please update them for openMPI.5.0, but that itself will not break you because Vader has been a very pervasive name for the last several years. We won't just give that up in one version. There will be at least some elements of MPI-4. The MPI-4.0 document is due by the end of 2020. The MPI-4.0 amusingly enough, COVID has made the MPI-4.0 more efficient. So they're actually making tremendous amounts of progress in terms of the MPI-4.0 document that's due by the end of this year. I can't really tell you exactly which features are going to be included 5.0 or not. We actually, the openMPI developer community has a meeting next Monday and that is one of the things that we're going to be discussing. A bunch of the openMPI community are on the forum itself and so a lot of them have prototypes that just need a bit of hardening before they come into openMPI but those things will be coming in over time. Kind of guessing that in openMPI-5 will not be 100% MPI-4 compliant but that's okay. All that stuff will come in over time. One thing that has been asked for quite a bit is and I referred to it earlier in the presentation is the connectivity map. Please show me which network we're used at runtime. Now we do have a solution that is actually already in our master. I think we're going to end up updating this a little bit. So what I'm showing here on this slide is still subject to change a little bit. If you, there's a couple of MCA parameters you can use. There's one that I listed here, hook.com method enable MPI-5 which is a heck of a mouthful in itself. But if you turn it on, it kind of shows this map and I am showing here just a four process job across two servers. MPI-02 and MPI-04 and you can see oh shared module or shared memory, not Vader, but shared memory is used on the identity communication, on the same server communication and USNIC which is the Cisco networking is used across nodes there. So it'll look something like this and there are more features to deal with this but this is a common thing that people have been asking us for for years. So something along these lines will be included in 5.0. And with that, I have hit the end of my slides and Ralph is going to laugh at me because I went longer than he did but I can believe it on him because he talked about that one slide a couple minutes ago and that was at least 20 minutes in itself. So he's the one that caused us to go over. OK, good. We do have questions. Yes, we do have a couple of questions. So let me scroll back. First of all, let me maybe start with a question I had which relates to the PRUN that Ralph was talking about. So you clarified one of the questions we had in part two which was about MPI-RUN versus SRUN and now PRUN is being added which may lead to even more confusion. So I'm wondering if you can clarify how PRUN relates to MPI-RUN and maybe also to SRUN. So where does it fit in that whole story? So first off, if you're just running open MPI, MPI-RUN will work just like it always has. You don't have to worry about any of these nuts and bolts. PRUN is used mainly when you're talking about running a perte as a separate DVM, a distributed virtual machine, and you want to run multiple jobs underneath it. PRUN is your starter for doing that. It effectively acts like SRUN would work for SLURM. The only difference being that PRUN has the adaptive command line so the command line looks exactly like the OMPI-MPI-RUN if you're running OMPI jobs. It looks exactly like the MPISH command line if you're running MPISH jobs. So it's designed to be kind of flexible in that way, but it does require this configuration of being able to tell us what the alias is linked to so we know whether you're running an OMPI job or an MPISH job. We need to have something that tells us that. There is a command line option that you can use on PRUN that will do that automatically, so you can just say PRUN-PersonalityMPISH and we'll know that this is an MPISH command line and it will parse it accordingly. Does that help answer your question, Kenneth? Yeah, I think it does. It seems long-term the intention is that PRUN will actually replace SRUN and there won't be any custom commands like this anymore. Is that correct? I wouldn't go that far. I think the various resource managers are always going to want to have their own. I think that you'll find more people using PRUN for workflow management and things like that where the ability to have that persistent DVM in place and to get around the limitations that you find like in Slurm or the Shasta or Alps environments. You want that consistent environment to run a lot of jobs quickly. I think you'll see PRUN becoming more prevalent there. Okay, good. I think that clarifies things a bit. We did have one dumb question so the person himself mentioned it as a dumb question. How do I know that the results I get from the OZU benchmarks are reasonable or if my environment is already screwed up? Is there any reference timings or things like this to compare it? That's a good question. Let me preface this by saying there are HPC class networks and there are non-HPC class networks. I want to stay away from individual vendors here. Let's say... I'll just use one gigabit Ethernet. Let's say you have a cluster with one gigabit Ethernet. You don't have any kind of acceleration on there at all. It's just a plain vanilla couple of machines with a commodity gigabit switch. You don't have any special nicks in there at all. No acceleration stack whatsoever. You're using plain vanilla Linux TCP sockets. On modern-ish machines, your MPI latency should probably be in the tens of microseconds. By tens of microseconds, I mean anywhere from 20 to 80. That's basically no acceleration. Commodity Ethernets going through a switch. These kinds of things. Half-round-trip ping pong latency should be anywhere from 20 to 80 microseconds. I would call that a non-HPC class environment or non-HPC class network, which is still perfectly fine for a lot of applications. So don't take that as a derogatory term. I don't mean that as a derogatory term. What I mean as an HBC class network is things where acceleration are possible, where you can do better than just the generic Ethernet commodity kind of stuff, where you have Infiniband or USNIC or Omnipath or any of the other accelerated networks. In those types of networks, generally your latency, your MPI latency should be in the low single digits of microseconds. So one or sub one is really good. Two is pretty great as well. Anything under 10 is pretty good. And those can be very helpful in MPI applications that send tons of short messages between themselves frequently. That's why latency is one of the big deals in MPI, HPC class networking types of things. So latency is a good one to look at, the first one to look at. The second one is your bandwidth. Almost all networks, even the quote unquote non-HPC class networking will get up to maximum bandwidth. So even if you have one gig or 10 gig or 40 gig, by the time the message gets large enough, you'll see the graph rise higher and higher. Like, oh, by the time that I'm sending a one megabyte message, I'm actually getting 30, 35, 38 plus gigabits worth of performance, that type of thing. Or whatever the top end of your networking speed is, you should have a nice steep ramp to that. Now, that being said, if you have 100 gig ethernet, I'm sorry, 100 gig ethernet or 100 gig infiniband or things like that, it does just take a while, no matter even if you have an HPC class network or not, to ramp up to the high speed because you just cannot hit wire speed with a 16-byte message. It just does not happen. You have to have a large enough message to be able to hit max bandwidth. But in general, the difference between an HPC class and a non-HPC class network is the slope of how fast you can ramp up to maximum bandwidth. So nice steep slope is what you want to see at lower message sizes that you can get maximum bandwidth versus, oh, I got to hit eight or 16 meg messages before I can start approaching maximum bandwidth. So latency is a great indicator. Bandwidth is a good secondary indicator. OK. Yeah. I think that's very helpful. Hey, Kenneth. Yeah, sure. Can I take you back to your question about P-Run for a minute? OK. I just wanted to clarify something because my answer wasn't complete. And that is that if the, like Slurm or Shasta or whatever, PBS, if they support PMIX spawn integration, then you could, instead of using their launcher, like S-Run, you could use P-Run to launch your jobs. So there is a path by which eventually you could actually be using P-Run as your standard launcher in all these environments. So. OK. OK. Good. Maybe related to the OZU question. I was wondering, so there's an, I think there's a release candidate now for OMP4.1. So what would be a recommended way for people to test that in their environment? I assume get again installed is the first step, but then what? Do you just run Hello World stuff? Do you run benchmarks? Do you go real world applications? And how do you give that feedback back to the OpenMPI community? Yeah, great question. So in OpenMPI 4.1, it should be a relatively low risk upgrade. Now we do have just a release candidate out yet, and that release candidate doesn't include Adapt and Han. Hopefully the next release candidate should include Adapt and Han. But going from 4.0.x to 4.1.x should be a pretty small jump. Backwards compatibility should be preserved. All the command line, all the scripts you've been using for forever should be working just fine. But generally, great. That'd be awesome if you could go download the release candidate, try to build it in your environment. Install, I would install it next to your existing OpenMPI install. Don't replace it yet because it is just a release candidate. Like I said, two of the big things are not even in that release candidate tar ball yet. So install it next to it, set up your environment by modules or your shell scripts, startup file, whatever you want to do, change your path and LD library path to point to that new one. You should be able to just MPI run your existing MPI 4.0.x compiled applications. So if you just change your LD library path to point to the new live MPI, you should be good to go. And just test all the usual things. Test your usual, you know, make sure we didn't do something silly and break, you know, MPI run of hello world and ring and then move on. Same thing with the troubleshooting, right? Then move on to real MPI applications. Those should all perform more or less the same. And hopefully when Han and Adapt come along, you can enable those. And if your application is using MPI collectives, you should see some nice performance improvements, hopefully. And there's a small follow up question to the OZU benchmarks as well. If there's a way to block the results that you get from the OZU benchmarks for quick interpretation. Is that something that's included in the suite or separate tooling? I don't think it's included. I haven't looked at the latest latest latest version because honestly, I have an old version and they haven't changed all that much over time. I think it just admits stuff to standard out. And you can do the parsing and plotting yourself using your favorite tool. I don't think they have any handy Python and scripts or something like that to plot it for you. Now, they have changed. Yeah, I was just going to say I was just looking at their website because I saw that question. And I just posted the link to that because at the bottom they tell you how to make the plot using good new plot script and they give you the script. Awesome. Okay, excellent. Thanks a lot, Ralph. A follow up question on ADAPT, so the new component in p4.1 that's coming up. Gaspard was wondering what's the magic behind ADAPT that's enabling it to increase the performance even maybe without having noise. So what's behind that component? So the way it was described to me by Dr. George Basilica from the University of Tennessee is that they changed the framework of how they are implemented. So instead of just being a blocking algorithm that says like, okay, I'm going to do a tree-based broadcast and so I'm going to do a broadcast, I'm going to do a send to my children in the graph. But if that child is not ready, I might end up blocking waiting for that child because he was late to the communication. So instead they're doing more of pipelining, which is something that we didn't do a whole lot of in our prior generation, paired with event-based programming kinds of things. So we already use libEvent elsewhere in OpenMPI. I believe they're using libEvent in ADAPT such that they just submit all the work in a graph kind of form to libEvent and libEvent will fire it when it happens. I believe that is true. Don't take me. I'm not the guy who wrote that work. So that is my understanding. Hopefully it is correct. Okay, then another question. Will the SHIM that's there to ease the transition from MPIR allow, for example, DDT to work against ump5.0 until Arnforge gets around to updating? Arnforge already has updated DDT to work with PMIX. As I said in the chat window there, they're synchronizing their release or coordinating their release schedule with when ump5.0 comes out. Okay, good. Let's see what else. I think there was a question related to ump5.0 as the C API changing much, but I think that's a misunderstanding. It's actually the API changing. The API is not changing. The API is changing. The C API is still standardized by the MPI specification itself. And there haven't really been changes in that. So MPI4 is adding a whole bunch of stuff in the C and other APIs. Just to clarify this whole line of things. MPI4 is adding a whole bunch of APIs and not really changing anything. Yeah, I don't think they're really changing anything. But for open MPI5, we are breaking the APIs. So it'll be a different SO number. So if you symbolically linked, I'm sorry, if you linked against, you know, the shared library for live MPI for open MPI4 and earlier, it will not automatically relink at runtime to the live MPI for open MPI5 and beyond because that has a different SO number. We changed some data structures around. It's not so much the C API that changed. It's more internal stuff that users don't care about that changed, like sizes of structures and things like that, that really require us to bump the SO number and make an ABI break. Honestly, this is stuff that's been waiting for three or four years. We've been making this ABI change for a long time and now we kind of really need to. So I'm sorry, but we kind of just really need to do it. Okay, there was a follow-up question there which Ralph has already answered. Is API stability guaranteed across minor releases? So I understand the answer is yes to that. So you are really careful about keeping ABI compatibility as long as it's the same minor version. Okay. There's maybe a wrap-up question and a small, naughty follow-up question from me. So Kasper is wondering, he's realizing it's an annoying question, but he was, he's asking about any more concrete expectation on the timeline for MP5.0. So you mentioned hopefully before the end of the year, is that the best you? Yeah. First off, completely valid question. Second off, I will say, no, and let me tell you why I am not going to give you more firm on that. The OpenMPI community is great actually. It has exceeded all of our expectations. I was one of the founding people. I think Ralph, you joined within three months of OpenMPI being founded, what are we at, 16, 17 years ago now? Yeah. So Ralph is effectively one of the founders too. We never dreamed that OpenMPI would be this successful. We never dreamed that it would grow beyond the four initial organizations that were doing this. It has become a giant community with a lot of vendors and a lot of money behind it and whatnot, and it is fantastic. It makes me incredibly proud to see what the community has been able to achieve over the years. That being said, it does have its challenges. One of the challenges is that as a community, we are not fiscally responsible to each other. So I don't pay Ralph. Ralph doesn't pay me. I don't pay the University of Tennessee and so on and so on. And so synchronizing us all to have all the features done at the right time can be difficult. And this is true of every open source community out there, right? That wrangling all the community members to actually have the features done on time can be difficult. This has traditionally been a problem in the OpenMPI community. It is unfortunately no different for OpenMPI 5.0. So it's both a curse and a blessing that we have this incredibly rich, diverse community spread across many different types of organizations that bring different viewpoints and different goals and requirements to us. And the end result is that we actually get something pretty great because it does represent a pretty wide swat of viewpoints. But the challenge in that is that everybody has their own internal timelines and deadlines and other projects that they have to work on and things like that. And so it can make forecasting really difficult. So I'm sorry. That is just where we are. Yeah, well, I had already answered on the chat saying October, November, as we were hoping for, but... That is aspirational. That is what we're talking about internally. And so that's what we're aiming for. I really hope we're able to deliver that for you by supercomputing. Let's see what happens. Yeah, I confess, I am one of the long poles in the tent this time around. Jeff is being kind, not to point the finger at me. I'm not the only one, but I'm certainly one of them. And if Jeff wants to pay me, I will happily make that date work for a firm. Yeah, I guess realistically, if it's not November, then December is a lot more difficult to make a big release, I guess. So we're looking at 2021. We tend to avoid doing things like that because if you make a release right before a holiday, then there's nobody to answer the phone if something goes wrong. Yeah, okay. We've done that a couple of times. We were ready in December, but we're just going to hold on to it till January because there's so many holidays and everybody disappears. Yeah, it makes sense. Okay, so I think we're out of questions from attendees. I did have one final question for you. That's sort of a follow-up to this. So it's pretty clear now what the targets are for OpenMPI 5, but I was wondering if there's something you already have in mind for the next major release. So is there anything you're not happy with at all in the current OpenMPI version that will require breaking API compatibility or are there any major features you have in mind? Well, that's important. We actually do have a Wiki page for 6.0 already. The biggest feature for OpenMPI 6.0 that I'm aware of is that I'm retiring before it happens. So you probably aren't going to see a lot of changes to the runtime anymore when that happens. And we will be sorry to see you leave, sir. But I will say we do have a Wiki page for OpenMPI 6, which is very sparsely populated at the moment. The biggest things that are on there are deprecating and removing old stuff. That's why I'm kind of chuckling when you say that. For example, the MPI-1 APIs that were removed in I think 2014 out of the MPI standard, they're still carried in OpenMPI, although they're not in MPI.h by default anymore. You have to specifically turn them on. That was us trying to ratchet up the pain on people saying like, okay, you have to do an extra thing to get the MPI-1 APIs. We will be talking about that again in OpenMPI 6, saying can we actually finally get rid of these APIs that were deprecated in 1996 and finally removed from the standard in 2014. So they've been gone for six years now. But I just had a question yesterday from someone who's actually on this call today saying, hey, I have an application that's failing to compile because of a symbol that's not there. They're like, oh, yeah, do this flag and then your thing will compile. And then please go talk to that application developer and get them to upgrade because it's very straightforward to get rid of and stop using those things. Anyway, that's my little soapbox for that's what I remember offhand from the 6.0. I guess similar for the C++ API as well, which is already deprecated. Is that something that's going to finally be removed? The C++ API is actually gone. We actually removed that one completely. Almost nobody was using it. Almost nobody was using it. These old MPI-1 APIs, they're actually still used. The C++ API, there was a handful of apps that used it total other than homework assignments. And so getting rid of those was a lot easier. I got a question just yesterday about somebody who ran into an app that still needed it. So it's finally removed in OpenMPI-5 then because I think it's still there and forth you can still enable it. It is gone, gone, gone. I think we're ready to wrap up here. There's no more additional questions popping up. So let me thank you again, both of you for taking the time for doing this. I think it was very useful for both the EasyBuild community and the HPC community at large. So thank you very much for taking the time. Thank you all. We really appreciate the opportunity, like I've said earlier. And thank you all for your attention. Yes. Thank you. Thank you, Kenneth. Happy to. Bye-bye. Thank you all.