 Okay the recording has started. So let me quickly introduce our speakers again. We're welcoming Jeff Squires and Ralph Castain. Jeff Squires is a computer scientist with a PhD from the University of Notre Dame and he has been working at Cisco since 2006 and he's one of the main developers of the OpenMPI project. Ralph Castain has a PhD in nuclear physics from Purdue University and has been working at Intel since 2013 and he is also the founder and lead developer of the PMIX project. Both Ralph and Jeff are among the top contributors in the OpenMPI project collectively so together they have about 30% of all commits in the project which is quite impressive and they have quite a bit of full requests merged as well so they are very well suited to talk about this topic and with that I will give Jeff rights to present unless that already works, it's already in place. So I think we're good to get started Jeff. Cool thanks Kenneth. Greetings everybody, whatever time zone you're in. I'm Jeff, the other floating head on there is Ralph. Say hello Ralph. Hello Ralph. Thank you. This is part two. A couple weeks ago we gave part one the archives video and the slides are certainly available from that. I want to say thank you to Kenneth and the EasyBuild community for giving us this invite and this opportunity to present all this information. There's a lot of good stuff in here and Ralph and I when we came up with all this content we kind of took to heart a lot of the questions that we typically got and that's why we you know put into this presentation and we had so much material that we thought it would be one session and then we're like well maybe we should split it into two sessions and then yeah maybe we should probably split it into three sessions. So today is part two of the three and a big thank you to the EasyBuild community. A couple of logistics real quick. As Kenneth noted this session is being recorded so if you don't want to be recorded please leave the stuff will be available later. There is a Q&A panel for you. We're probably Kenneth is going to act as marshaling all the questions. Some of them he might just answer in the Q&A panel that everybody can see the answer for or he might either interrupt or ask us at a good breaking point the questions for you so go ahead and type your questions in that Q&A panel and we'll get to them. So here's our overall overview for the whole thing. We covered three bits last time in part one so the background what exactly is PIMEX and how to build OpenMPI. We're going to cover a bunch more stuff today. So here's a super quick recap of what we covered some terminology that's important here in OpenMPI you'll hear us talk about projects frameworks and components. You'll see that's listed on the left hand side over there and that's really just a division of how we lay out the code in OpenMPI. Now this picture here is not a comprehensive list of all of our projects frameworks and components but there's a bunch of the common ones that you'll hear MPI, Schmem, Opel, PML, BTL, MTL things like that. We'll talk about that in a little bit more detail today but for the MPI types of things think of framework is a collection of plugins and component is just a fancy word for plugin. I guess I'll pick up here. I covered basically some background what is PMIX. As I said back then it derived out of the old PMI1 and 2 things that came out of MPISH really was focusing more on the exascale kind of systems and then on workflow orchestration. Started in late 2014. As of today it's obviously has grown quite a bit and is now supported by a wide range of libraries and tools and pretty much every all the RM vendors out there. Jeff and you know and its role really is to act as the go-between between the application process and the resource manager basically passing requests up to the resource manager and then the response is back for various things and we're going to go into a little more depth today about exactly what is that role and what kind of requests are going back and forth between them. Okay and then in the last part of part one there we talked about how to build open MPI and this is the super short version that you just download the tarball you untar it you go into the directory and you run configure make make install and there is a little bit of magic in there that you can specify a bunch of parameters on the configure command line. Most of them typically have to do with which network communication stack you want to use but there's a variety of other things as well. Go back and have a look at the slides in the video and we talk about all of these things and PMI exit in more detail here. So that's the supercap recap and what I'm going to do here is I'm going to stop presenting and Ralph is going to take over. Kenneth I think you have to give presenting prep permissions to Ralph Ralph and he'll take over for the first part here. Hang on a second here I just accidentally got off that page. You should have presenting rights now Ralph. Oh there we go there we are. Okay good so okay sorry logistics always get in the way no matter how hard you practice it in advance. So let's talk some more about PMIX we're going to get to perte I think in the third session and exactly how that ties into open MPI. But let's focus a little more on PMIX because especially when you get to OPI version five which is coming out later this fall hopefully you're going to find that PMIX is going to play a pretty major role in it. It already is kind of the backbone of it at the moment but it's going to become even more prominent in its role in version five. So where is it used? Well it's actually grown quite a bit when we originally wrote it you know like late 2014 or like 2015 open MPI was pretty much the only user of it and we did one resource manager at that time which was Slurm. Things obviously have grown quite a bit so you can see here a list of all the various libraries you pretty much all the MPIs and all the Open Shemem libraries now support it and there's some support now in PEGAS as well. From the resource manager side as I said at the last session pretty much everybody is now supporting it. I'm not entirely sure of the status of Univergrid Engine. I haven't talked to those guys in a while but everybody has pretty much picked it up and you can see that the upcoming Cray Shasta environment their launcher is called Pals that is now supporting it and we're actually working right now on Kubernetes support and I'll talk a little more about that later. But in addition to just the adoption rate we are seeing a lot of new use cases show up. The debugger guys have already got their products integrated with it and are not ready to release. But we're seeing things like Spark and TensorFlow picking it up and using it because it allows them to do some things in terms of marshalling processes together for collective operations on the fly. There's a bunch of stuff in the MPI world where people are doing some advanced MPI related things and then there's the ability to log information. So this isn't so much like an application writing data out. It's more a thing of being able for examples for an application to log into the job record of the resource manager is maintaining that I got this far in my application at this point in time just for helping in terms of debugging things if things go wrong. So there's the ability to do some logging that's kind of intriguing to people. And then I'll talk more about the container support but pretty much all the containers now do have some level of support for PMIX. So last time I mentioned that we started off looking at the scalable launch issue and there were a number of things we did to try and make launches scale better. But once you had that in place then it really became intriguing to look at what else can we do with this because we created it as kind of a generic application to resource manager interaction mechanism. And as you can well imagine as soon as you opened that up and said well we're going to use that to make things go faster at launch people said well I think I could use that for something else as well. So once the connection was made people started looking at well we'd like the ability for an application for example to generate an asynchronous event and be able to direct where it goes to. So I want to notify the other processes in my job that I'm ready to checkpoint or you know something has happened and you all need to take some kind of action. We also want the ability to be pass events in so if the system sees a node going down or that a node is overheating and is going to need to be taken down that the system can actually you know generate an asynchronous event to the applications and let them deal with it decide what they want to do. So once we had an event notification working then people came to us and said well we're running hybrid applications where we have like OpenMP and MPI both active in the in a process and we have problems where we can't have a way for the two to know it that the other guy is there and they are competing for resources within the process. So they actually took the event notification system and decided to use it a little differently than we had originally thought of and they now use it actually within a process for the different models to be able to alert each other to when to aid that they're present but also to be able to say like I'm going to need all the I'm going to go into a compute intensive section of the of this process I need you to stop communicating for a while so that I can have all the cores for myself. We also then started talking with the tools folks because they have had the problem that you know if I write a tool for example that works with MPI for debugging purposes I wasn't they weren't able to use it for any other programming model without completely rewriting its mechanisms for interacting with that model and that meant it was very difficult for them to try and expand their their use their usage model so like you know if you were talking for example about I have a data analytics tool let's say Spark or it's Hadoop or whatever it is well the parallel debugging tools that were available for MPI would actually be rather useful in that case but you couldn't use them because they didn't know how to talk to your job so we did as we went ahead and worked with them and defined a completely generic connection and interaction system for for tools to work with different job different programming models different launchers different environments all based on a set of abstracted interfaces so that they can now take a tool that that talks to an MPI job for example and use it pretty much anywhere for Shmem jobs for Hadoop jobs or Spark jobs or even for TensorFlow jobs and actually be able to step through for example the way you would with a parallel debugger today so then we said okay other people came and said well we have other ideas of things we could do for it with it so they started looking at for example being able to change allocations I mean if I can interact with the resource manager resource manager can connect to the scheduler and pass that request along well I'd like to be able to dynamically add or remove nodes out of my allocation and and that had some some obvious impacts on on different programming models so we added the ability to do that while we were doing that the concept of actually being able to loan a node came up and so we provided the ability for example for a an application to say I'm not going to need these nodes for the next 10 20 minutes or whatever I'm going to loan them back to the resource manager to the scheduler but the loan carries a caveat that when I need it I can I can get it back on a preemptive basis so the cloud folks have started to find that I've used and hand in hand with that they then wanted the ability for an application to say I'm willing to be preempted so might get a little bit better charge rate for example if you make your can't your application a candidate for preemption and then we define it a handshake so that when the resource manager wants to preempt you you know it can let you know I'm going to I need to preempt you you have a chance to actually do something like checkpoint your application or whatever let the resource manager know you're ready to be preempted and then be preempted at that point in time so there's a bunch of stuff in there that's not just allocating nodes but how you want goes about managing the allocation and deallocation if you will of nodes that all got put into that area another group came along and said hey we want to be able to dynamically assemble groups so if you want to think of it from an MPI standpoint you want to be able to dynamically create a communicator and then dynamically tear it apart when you're done rather than starting it you know creating it at the beginning of time and then when you tear it apart you have to actually kill the whole application this was of interest to people like the data analytics groups and stuff where there are periods of time in the computation where you actually want to use something like MPI but you don't want the whole job to be an MPI job because processes may be coming and going and stuff and MPI doesn't handle that very well so uh so we created a groups capability that allows you to asynchronously assemble and destruct process groups the storage folks came and talked to us about well what about being able to you know to indicate that you want to pre-cache files or that you want to asynchronously move storage around say I have a data analytics job and I know that I'm going to need a particular data set coming up in the next you know when I finish this particular calculation I want to be able to tell the storage system hey get it out of deep storage and bring it up to the surface or cache it into a local data store so that when I do need it I can get at it quickly or I may want to specify a storage strategy like hey this is data I really need to make sure this data is safe I want you to stripe it across across a certain striping pattern across your storage systems so we've just started this one we're adding the ability for right now to query what's available from the storage system in terms of side you know capacities bandwidths what's been allocated to you what's the total system look like and stuff and then we're moving I have a working group that's starting to work on on specifying some of these other APIs and directives and finally we were asked to start looking at power management mostly from a strategy standpoint of both the the application being able to request a change in strategy I need to go into a more power intensive section of the code or the operation etc so I need you to change the strategy that you're using and also from a resource manager standpoint because there are multiple power management in libraries out there providing them with an abstraction layer that where they can say hey I want I need to set this power management strategy and and they don't want to have to write you know five different sets of code to do that because there's five different libraries out there that somebody might want to use so as you can see it has broadened a great deal over time but the interaction architecture remains the same we don't want the application directly doing any of these things we're trying to keep the application as more the orchestrator in charge of these things in terms of making the requests and knowing what it wants to have done but we'd rather not have the application making the connection directly say to the fabric manager or to the storage system because it winds up if you know if you have a million processes and a million procs are connecting to those subsystems it's just overwhelming you can't do it at any any kind of scale so instead we maintain this abstraction that PMIX as a client serves the application there may be multiple app programming models inside that application that are coordinating through that PMIX client but the client remains this old connection up to the server which is typically hosted in the RM daemon and all orchestration requests go to that server and all responses come back from that server and then the resource manager can use the PMIX APIs itself as an abstraction to talk to each of these subsystems rather than having to write their own codes face to support five or six different storage systems file systems up here they can call the PMIX APIs and PMIX has the plugins that are specific to each of these file systems to execute that operation and then like I said the tool support connects into the server to allow tools to interact through PMIX into the application or even into the rest of the system management stack again through those abstractions so if you look at that the role of the of the PMIX sitting inside this application it looks a heck of a lot like a container and this is what the container people latched on to is they said hey wait a minute you know if this really became a container then maybe PMIX becomes the abstraction by which we can interact with the outside world and where that gets really interesting is when you want to be able to port your containers central to doing that is that you have to be able to ensure if I go back and look at that again for a second you have to ensure that this standardized interface here from in the side the container to outside the container can operate across different versions because the PMIX that's outside here may be changing as versions go through but the PMIX inside the container will be static and so you're going to have this cross version situation becoming more the norm than the than the unusual situation so what PMIX has got is a is basically is made a pledge to maintain cross version compatibility when a client connects to its local server there's a handshake that occurs that allows the two of them to select the most the highest most common messaging protocol as the versions change so if the client is at a higher version than the server the client will pick the lower messaging protocol that the server supports if the reverse is true then the server will pick the lowest the highest I should say messaging protocol that the client supports the two of them agree on what that is going to be we didn't have that early on so you can see that some of our earliest releases don't support that but by the time you get to version 211 and above and we're about to release version 4 basically they are completely interchangeable between the client and the server it doesn't matter which one is higher or lower they will negotiate properly they will select the proper protocols to work so what that then did was it said okay for the container guys what we're going to do is rather than them having to put you know some kind of a of an agent in between to even things out because you know this this external entity might not even have a PMIX server in it so you want to be able to keep the container as standardized as possible so the application doesn't wind up having to you know say well if PMIX is available do this if PMIX is not available do that so we introduced this concept of an EPIX daemon that's inside the container and that basically serves as a leveling agent so if the PMIX client sees a PMIX server outside that does everything it wants this EPIX daemon basically just does nothing it just relays things back and forth if it turns out there's nothing out here in terms of PMIX services then the EPIX server becomes a full server it does everything it can to support that that that client including talking to the to the various system management stack elements that are out there so this basically becomes just a leveling agent inside the container and so why do all that it's strictly for portability and this is something that you'll see talked a lot about a lot in the in the PMIX community we want the ability to be able to take the container or an application doesn't have to be in a container i want to be able to run that on something like kubernetes and i want to be able to take that exact same thing and run it on an hbc system under slurm or pbs or shasta if it's cray or flux or whatever it is and i want to do the same the other way around i don't want to have to make any changes in order to be able to execute it and that's what we're working towards with PMIX so if we put all that together and we talk about well how do you launch a gigantic job i'm not going to walk you through this in detail because it takes too long i have done this in other presentations and if people want me to i'd be happy to you know have a dedicated you know half hour to 45 minute presentation that we can tape that walks you through every step of this but it's an orchestrated launch in other words every step of the way the resource manager and the and the scheduler the workload manager are working hand in hand with the with the various system management components to to prepare each stage for the next stage so that there's minimum time lost in terms of of having to you know do something where i could have done it in parallel before everything is happening in an orchestrated fashion from beginning of time when i start to actually incorporate how long it's going to take me to get my files for example into the scheduling system so i don't i don't schedule your job you know 10 minutes before the files can get there if it's coming from cold storage but we actually know when the data is going to get there to caching your files in your library somewhere network near the nodes that are going to be allocated so that they're ready to go as soon as the allocation starts to being able to to give you all the information required for a job to actually operate communicate everything at the time this process even starts so there's no wire up protocol no all gather exchange of endpoints or anything like that every piece of information that job needs in order to execute is given to it at time zero when it starts to execute so that that's a pretty complex orchestration PMI X doesn't actually do the orchestration it enables the orchestration and we've been working with the resource manager and scheduler communities to build each of these stages into their code using PMI X is the glue for making that those communications occur and most of them are pretty much through stages two through four at this stage stage one we're working with them on because it's a little trickier where you're trying to figure out what files and libraries a given application depends on without necessarily having the user have to tell you everything so we have to get some things from them but we're trying to to make it as automated as possible we also again support the tools I'm not going to go through this in too much depth here but but just the tools basically all work off of a set of rendezvous files that are put out by the server telling you how to connect to the server so you can tell the tool here's the URI for that particular server go connect to it and that's fine but a lot of times you don't really know the URI of the tool itself so what you want to do is you want to have some kind of mechanism for discovering how to connect to it and what we've done is we've created a set of rendezvous files that contain that information and the reason there are several of them here is because you might want to you might know the namespace of this tool the job ID of this tool but nothing else about it where you might know the PID of the tool that you want to or the server excuse me that you want to connect to but you don't know anything more about it or you may know nothing about it and so you just do a generic search so there's a set of and you'll see this when you do these when you do the installations there's a set of rendezvous files that get created every time one of these servers starts up just to allow it to for a tool to be able to go and search these guys and find how to connect to that server so here's a list of some of the current support we have as all the usual startup stuff you know you can put data get data you can execute barriers you can spawn processes you can you can group things with connect and disconnect and you can publish data to a kind of central key value store and then look it up so this is all the kind of stuff that you typically saw from like a PMI one or two but then there's all the tool connections stuff that allow you to for example just generalized queries about the system you can forward standard IO for the tools PMX will do that for a tool so you don't have to write all that code yourself and then there's things like generalized query support so you can find out like what the status of your job is how did it get laid out how much you know what's the status of your of the of the scheduling queues all that kind of stuff is available i talked about the event notification system all that's in there there's a logging capability i mentioned where you can put status reports or error output into things like syslog or drop them into your job record from the resource manager and then the allocation stuff is all in there to be able to request and release resources and do preemption notification and stuff like that we also have some stuff that's just been added for network support so you can get you can ask for security keys credentials of various types um you can ask to we can set up the local drivers this is more for the for the launch capability but you can also query things like what's the state of the fabric uh how congested is it can i get a traffic report showing me where the congestion areas are um you know what's the capabilities uh that kind of stuff all that is now being supported i talked about cross version support that's there and the container support that that's that that creates there's also a set of job controls features in there so you can actually ask that individual processes or an entire job be paused or killed or or hit it with a signal of some kind you can ask that your your processes be monitored for heartbeats and then there's you know checkpoint restart coordination where you can ask that you can actually provide a job control signal saying asking the resource manager to to uh checkpoint your job or at least tell your job to checkpoint and then like i mentioned before we had this asynchronous ability to roll up and tear down process groups programming model support um Pemex has the ability to automatically forward um envars uh you so we um we have plug-ins for each of the of the major programming models so for open mpi we will look for specific open mpi uh syntax on part on environmental parameters those get picked up and forwarded for you um we also set up the mpi 3 um uh envars that are required uh by the mpi forum uh and then a bunch of other things basically we tailor the environment to that particular programming model so whatever it is that programming model expects to see in the environment or and also what environmental variables it might have we try and do that for for the the resource manager the same thing for open shmem we have one for that and then uh we also have this uh hybrid programming model support that i just discussed before so the architecture getting into the brass tacks of open Pemex um it's an mca component architecture it looks just like what you saw what jeff described for open mpi we borrowed it from them it's the same build system so it's all uh auto tools built the probably the biggest difference between us and open mpi is that we do not have any embedded libraries in the system um we chose not to do that so uh you have to provide an external version of either lib event or lib ev we support both of those and hw loke we also have um optional support uh some of the features for example if you want to get access to uh craze slingshot fabric manager then we need curl and lib jansen to be in there so there are some optional dependencies for some of the features we also do have python bindings on pmi x but you need scython in order to enable those because we use that to build them and then if you want us to support the luster file system when we do those storage have all that storage query capability everything then we need access to the luster client and we do use lib z uh and we do try to auto detect that if it's available we use it for compression purposes on some of these data areas because you can imagine we've got this much data flowing around um it can be a little bit big i've listed here a couple of the key frameworks the ptl is is basically you know the pmi x transport layer if you will it uh it actually mostly you use tcp today we have a usoc one a unix socket uh one that we used for local communications we've deprecated that but it's still there to support people who you know have containers that that have that use that component um and then i told you like i said the ronda vue files and then the other one that you really want to look out for is the gts the global generator so generalized data store um there are three components in there in particular the hash component that's always on and then there's a shared memory component there's two variants of it ds 12 and ds 21 uh that are that are present so it's worth uh keeping an eye on those two in particular and i give you the uh the uri there for for a list of instructions on how you get the implementation and how to build it so some build tips um if you want to build the extra external pimmicks to use with open mpi uh one of the real things you got to watch out for is that you have to use this an external lib event hw look then for open mpi uh the reason is because you have to use the exact same libraries for pimmicks as you used to build open mpi otherwise pimmicks depending upon these two uh libraries you'll get library and simple confusion between them if you don't if you don't keep that combination constant so if you're going to use external pimmicks you got to use external lib event and external hw look you need to make sure that all three of those match each other um if you're going to do a direct link for for applications so applications are going to link directly against pimmicks you can call pimmicks directly from an application that's normally how it's used um it is reference counted so if if if open mpi is using is involved in that application open mpi is going to call pimmicks in it if the application calls pimmicks in it that's fine it with no harm done you just need to balance to have the same balance for pimmicks finalized to clean up um if you're using open mpi uh if it's if it's the embedded pimmicks starting in version five those symbols are exposed so you don't need to do anything else if you want to use pimmicks from inside an open mpi application starting with version five if it's version four or below you're going to have to use the external pimmicks in order to avoid confusion and that means again you got to use the external lib event hw look uh either way the open mpi wrapper compiler knows how to do the right thing and make sure you hook to the right place if it's a non open mpi app we do provide a pimmick cc wrapper compiler just just to make sure you get hooked to the right pimmicks uh installation and the corresponding lib event hw look uh there are set of tools that come with pimmicks there's a patterns tool that reports what are the attributes that we support for that particular implementation uh there are three levels you can look at the client the server and the host environment and they tells you not only what the attribute is but a description of what what actually that attribute does you can uh use a tool to inject a pmax event into the system so if you you want to be able to tell your your application hey i'm hitting you with a sig sig user two for example p event will do that using a uh a pimmicks event so you can generate events on the fly there's also a lookup capability so if you want to look at what's going on inside the general generalized data store you can do that uh pmax info is just like mpi or op info uh for for open mpi it tells you all the build information there's a pps that will go ahead and contact the local system uh and use pimmicks to query what are all the jobs running into what their status is a p query tool that lets you just you know put on their hey uh here's an attribute i want a query about and uh and it will go and do it so you can think like you know i want to know what the storage capacity is on this system or what the fabric situation is in terms of you know how busy is it etc p query will let you do that and then like i said there's this wrapper compiler so there are a couple of conflicts you need to watch out for when you install this so slurm and cray both have pmi one and pmi two libraries of their own and those libraries are completely incompatible with pmi x okay they have their own communication protocols in fact those libraries are incompatible across the two environments so you can't take a slurm pmi one library use it on cray it won't work um so uh the problem is that pmi x has a backward compatibility libraries for both pmi one and pmi two so in other words if somebody is making pmi one calls in their application or in their programming library and they don't want to change that to use pmi x calls they can still link against the lip the pmi x library and we will take their calls into pmi one or pmi two and our library will translate that into the the corresponding pmi call and then execute it the difficulty is that if you install our backward libraries for pmi one and two into a default location you can overwrite the pmi one and pmi two libraries this slurm and cray installed and then you'll have broken anybody who's trying to use those libraries to interact with the you know with uh with the resource manager so what we recommend to avoid that is just to you know use the ds disable pmi backward compatibility option on the Pemex configure line and then we won't make our own pmi one and two libraries um it means that you know it won't be people won't be able to link against pmi x and you call pmi one or two and have it work but if they if you've installed pmi one and two for slurm or cray you probably don't want them doing that anyway um and so we're probably going to make this a default because we've we've seen usage of those backward compatibility libraries kind of die off as everybody's converted over to pmi x um but just as a heads up until it becomes the default if you're going to install on these little environments you probably want to turn those libraries off and I'm going to stop there uh Kenneth I don't know if there are any questions for me yeah we have a couple of questions um let me scroll back and find them the first one is related to the async and cross model um notifications that pmi x has the question is if there are any examples of open source applications using that the async and cross model stuff um there are uh there's a couple of research papers that were published on it as well um why don't I uh pass that to Kenneth after the meeting and then he can maybe include it in the minutes or something like that yes okay that makes sense um another question is you mentioned that pmi x has integration with luster does it also have integration with gpfs or pgfs or all the file systems moving not currently um we are working with those teams to try and get them to do that um we just haven't gotten them to take the time to do that okay best thing to do is if you're interested in that is to poke your your vendor and ask them to please step up and do that okay that's a good suggestion and then in terms of the lib event and hw log library so you mentioned they have to be in sync between open api and pmi x so you don't get any nasty linking issues but is that the same um when is it the same for slurm so do you have to link both slurm and pmi x to the same lib event and hw log no you don't and the and the reason is because uh your application is not going to um link against slurm so uh the interaction between the resource manager and your application is strictly through the the pmi x communication protocols not through any kind of interlibrary uh function calls so uh so the version of pmi x that you're using in your application doesn't have to be the same version that you figure using in slurm at all in fact usually they are not um you just need to be able to handle it for the cross version capability needs to support whatever combination you have so as long as you're above 211 on on both sides it doesn't matter what the other side is we'll just negotiate the proper thing. Ralph let me throw in a little extra color here too the the issue really is is the interaction of these libraries inside of a single process right so we're usually talking about the mpi process here and the mpi process interacts with uh potentially lib pimmicks, hw loke, lib event, mpi and therefore you need to completely disambiguate exactly which instance of the library is talking about if you accidentally and this can actually happen sometimes it's useful but almost always it's more confusing than useful you could actually accidentally end up with two different copies of hw loke inside the same process or two different copies of lib event inside the same linux process and that's where the types of problems occur that Ralph is trying to warn away from like oh if you're doing external pimmicks then do external everything from open mpi so don't use the embedded hw loke don't use the lib embedded lib event don't use the embedded pimmicks have external versions of all those and both pimmicks and open mpi refer to all those external ones and then you're guaranteed nope there's there's only one hw loke in my mpi process or one lib event in my mpi process and I realize this is kind of abstract and and confusing but that is just the crazy world that we live in and our computing systems are a bit complex so this point was very definitely worth bringing out because we've seen users get confused about this exact point and and just to be to be clear open mpi's configure system checks for this okay we've had enough problems with this we check if you specify external pimmicks and you have not specified an external lib event in hw loke we will error out and tell you you can't do that okay so so we try and protect you but we don't want to make sure people are now we are out now we are out enough people ran into this problem like you know what we're just going to make configure fail so that you don't get confusing weird seg faults when you run your applications later yeah i've seen that pop up it works well um one more question maybe um for turning off the compatibility with pmi one and pmi two should we do that so we can use both intel npi and open npi uh uh yeah you probably want to in at this point in time intel npi is has now done uh their pmi x integration so going forward you won't necessarily need to do that but um but you know we're it's not in in widespread use so your distribution probably doesn't have that capability at the moment but uh i think you know a year from now that's probably not going to be the case and you won't have to do that but it's probably today you're probably safe for doing that just disabling that that backward compatibility okay that's all we have for questions for ralph so let me pass presenter role to jeff yeah so we're in a kind of an awkward position here we're at uh 10 minutes before the hour and i have a bunch of slides to do what do you want to do here kenneth do we continue on to my part um i think it makes sense to continue yes okay and just we'll see how i think most people have planned for an hour and a half so even we should still have some time for questions that work for you cool well now we get past all this boring pimmick stuff when we get into the real good stuff the npi stuff ralph's probably on mute but he's swearing at me right now this is what happens when you work with somebody for 15 years um okay so in um in npi open npi 4 we have all these frameworks right i'm not going to talk about open npi version 3.x let's talk about current generation um 4.0.x is the current generation the current version of open npi that's out there we literally just released 4.1.0 release candidate yesterday and i would say it's an early release candidate there's still some things that aren't even in there yet but we needed to do some infrastructure things but anyway 4.0.x and 4.1.x is what we're talking about here so i just generalize that all together and called it 4.x these are the frameworks uh that are in the npi layer uh with the exception of btl which is technically in a different layer but we always talk about it in the context of npi so i i lumped it here as well um all of these actually do have meanings i'm not going to read all these out you can read them all out yourself they'll be in the slides later too um but some of these are more popular and more meaningful than others some of these are just behind the scenes things that you know users and system administrators never see in particular these are the ones that people typically care about the other ones are important too in their own way and there certainly are cases where it matters to people but these are these are the big ones that people usually talk about so let's let's go into a little detail about them so this one is a little less popular but we do get this question uh at least once or twice a month um i o is the top level npi file operations right so npi api such as file open file read file write and so on for many years there was a reference implementation out of argon national labs in empich called romeo um and they implemented all the npi i o apis and so in open npi we just had a romeo component and we just dispatched off to romeo for them to do all the the work uh several years ago i wish i could remember exactly which year it was um we came up with our own called opio so open npi i o and that was primarily work done at the university of houston with dr edgar gabriel and his group down there so opio is actually the defaults these days in almost all situations opio is still continually being developed and improved and they do good stuff down there this is part of the the fantastic part of being an open source community um i believe that uh luster support is among the last stuff that they're finally picking up i oh i don't remember offhand if that's going to be in in four dot one or if the luster support is coming in five dot oh i'm sorry i just don't remember that offhand but in most cases opio will just self select itself um if if it can do you know work on your file system and environment um otherwise it'll just transparently fall back to romeo so for example in a luster environment today if you do npi file open and whatnot it'll just automatically select romeo for you um the next one is call and that is for mpi collective operations like broadcast barrier reduce etc scanner gather all the rest um it's a really complicated decision as to which collective algorithm is used there are so mpi collectives and in general collective operate network collective operations have been an active area for research for two decades if not longer um and so there are a number of well-known network algorithms that are you know the way to do things right so like what's the right way to do a broadcast what's the right way to do a barrier what's the right way to do a reduce well picking which algorithm to use is a multivariate decision right it can depend on how big is the message how many peers are there what's the architecture of the network what's the architecture on the nodes there's a lot of things that it can be done and and it's it's uh not a fully solved problem so in open mpi four dot one we have all those actually we've had all these algorithms for a long time but picking them is is the trick in four dot one we did a bit of tweaking of that so we improved our algorithm selection at runtime so at the point when you invoke mpi reduce we do a little bit of thinking about which one should we do and then we invoke it behind the scenes we tweak a bit how that works to get some nice performance improvements and this is in preparation for open mpi five that we have the new components coming that represent a lot of years of research at the University of Tennessee so there's some optional collective components coming in four dot one they're not going to be the default but they're going to be there because we want to get some real-world usage for this because this is this is also a part of the strength of the open mpi community that we have researchers involved and they come up with awesome proof of concept code that we as a community that turn into you know harden it up and turn it into a product that can then be used by everybody so this these new components that are coming out of the University of Tennessee they're pretty good but we need to get them into the real world and find out you know shake out the bugs get them nice and stable so that hopefully at open mpi five we have a whole new generation of collective operations that is stable robust and gives a bunch of performance improvements over what's available in the four dot x series so more details about how to use those are going to be forthcoming i don't have that information today potentially we'll have that ready for the park three the next one is the pml and we see a lot of questions about the pml the pml stands for point-to-point messaging layer so it's the back end of things like mpi send mpi receive and and so on there are three main pml's that open mpi has so these are really the engines behind like even if you do a three gigabyte mpi send the pml is the one that makes sure that it might have to chunk up the message into multiple fragments and send it across and reassemble it on the other side whatever is necessary to get you know the reliable message transmission across the other side so we have three different engines for this right obi one is the first one it's our oldest one it was among the first ones that we did in open mpi and it is inherently a multi device and multi rail engine and underneath obi one it uses btl components otherwise known as byte transfer layer and i'll talk more about that in a minute we have another engine called cm and that is for a specific type of network that are called matching networks and i'll talk about what exactly a matching network is in a moment but underneath cm they don't use btls cm uses mtls for the matching transport layer and then finally we have another one called ucx and that uses the ucx communication library stands for unified communications x i think the x is meant to be a wildcard that it can talk to anything and that just uses the ucx communication library and they do a lot of offloading here so let's talk about these in a little more detail here so obi one i said obi one is a multi device multi rail engine it was one of the first ones what will happen is that at startup and throughout the process if you say hey i want to mpi send the number 17 obi one will go look and figure out which btl instances can go talk to rank 17 on that communicator so it'll actually even potentially it'll find all of them that can talk to 17 so let's say you have a good cisco ucs server and you have uh two internet ports out um and they have the us nick communication protocol on them i have to do a little bit of plugging for my own product i apologize for that um obi one will say oh you have two devices out there you have two us nick btl instances i will stripe the message across both of them it could be a bandwidth multiplier in that way and that's the same thing even happens with plain vanilla tcp and other network transports that we have btls for um it's the inherent multi device multi rail kind of engine there so uh like i said before obi one was one of the original uh transports that we had in open mpi um and going back to the first session uh we are very terrible at naming things and so yes obi one actually is a star wars reference there are a bunch of star wars references throughout open mpi um so when you look at opi one that's why it's not named something more intuitive like multi rail or something i don't know it's just called obi one because that's what it is and we've been stuck with it for years so sorry about that but on the picture on the right there you can see that's kind of semantically how it happens right mpi send calls obi one obi one found like oh i found three btls that can talk to the other side i'll give a chunk of the message to each one of them and uh on the other side the btls all receive their chunks the receiving obi one reassembles it into a coherent message and gives it to mpi the matching mpi receipt right now in which btls are available for the entire four dot x series here's a list of them um the ones that are common are really self sm uh tcp and us nick um i put some stars for the ones that are least commonly used but actually some of the others are not very common as well portals four is not very well commonly used sm kuda is not very commonly used i'll get to the more kuda stuff in a few minutes um and so on you the uct one there is kind of like an alternate ucx and the oh f i is an alternate for the lib fabric um the same thing with eugenie that is an alternate path for the eugenie um cray interfaces and i'll get to the preferred ones in a minute um here's what cm is so i i mentioned before that this is for matching networks and that is for a network that was created for mpi so in the last 10 15 years or so some network vendors have added natural stuff into their uh fabric that meets the abstractions of what mpi wants and that's what's called matching so the concept of having a communicator and an mpi tag um and then a payload and things like that that is just built into the network itself as opposed to something like tcp sockets or shared memory where we have to figure all that kind of stuff out ourselves like oh here's an incoming message let's look at the tag let's look at the communicator id but but forget all that in a matching network that stuff is actually an inherent part of the interfaces for that network itself so a bunch of networks are like that these days um and so what cm is is actually a super thin shim to talk to whatever the underlying communications library is so that's i tried to represent it by a thin purple bar there that we talked to the underlying library via an mtl plugin so an mtl is really just a conduit for one of these underlying uh matching network libraries and say oh here's a big old message do what you're going to do to get it to the pier um that also includes shared memory because particularly if the network itself supports matching it has to support matching for both local recipients and remote recipients and that's not necessarily a different namespace and so therefore the matching network fabric itself has to handle both local and remote transports so we don't even have a separate entity for shared memory here it is assumed that the matching network fabric handles shared memory for local recipients as well now in this case um there can only be one right so in the in the obi one world you can have a bunch of btls available and we use whichever one we split the obi one is the engine that splits it and reassembles it and does all the things here in the cm world if you're going to do matching at the network layer you have to know who all the peers are you can't have some peers of one type and other peers handled by a different matching fabric library they all have to be handled in the single network library itself hence there can only be one mtl and that is where the genesis of the name for this again apologies on the horrible names it means nothing outside the open mpi developer community um cm is a reference to a movie called highlander where there's a character in there called connor mccloud cm and one of the big themes of that movie is there can only be one i'm not going to say anything more about the plot of that movie if you care it's an old movie but there can only be one which is a reference to there can only be one mtl component used at runtime in the four dot x series there are three mtls that are used so there's ofi which is how most lip fabric uh base networks are used there's also portals four which is used in some of the us national labs and then there's psm two for single threaded omni path and that is the abbreviation for that is performance scaled messaging the last one the last pml is ucx so the ucx community kind of went a different way they said actually we're going to hide everything from you we're going to have an entire engine ourselves um and we're going to up level that to the pml and so when you have a ucx pml it itself is a multi rail multi device engine um and it just handles whatever the transports are so from open mpi's perspective we don't have any concept of what's back there from our perspective we have one ucx pml talking to another ucx pml there is a ucx library and some other transport in between there but we don't have any visibility on that so this diagram is almost a little misleading that we're not showing the actual transport libraries there but that's what we see from the mpi perspective and that actually is a specific design point for what the ucx community was going for in this design so with all of these you really get to a complicated what the which network is used which network stack is used at runtime because sometimes there's ambiguity like i just about every cluster out there has tcp but i might have some kind of accelerated transport that i want to use as well how can i know which one is going to get used all right well let's take the tcp question out for the moment and talk more about like all right assuming that open mpi sees my high performance networks out there which one is going to get used which stack is going to get used and here's three general rules which kind of sums it all up if you have in finna band or rocky where rocky is rdma over converged ethernet right so it's uh let's say you have a mellonex ethernet card that supports rocky or some other vendors card that supports ethernet rocky um you're going to end up using by default you will use the ucx pml because ucx pml is kind of the next generation of infiniband support and rocky is really just the infiniband wire protocol wrapped up in ethernet frames um so they're kind of the same thing so you'll end up using ucx for for both of those if you have a matching network like i described before or if you have iwarp you will end up using the cm pml and a relevant mtl all right and so a matching network being all the ones that we listed before and iwarp is kind of lumped in here um by itself which is a little bit weird um but let me explain that actually so iwarp um made a big explosion on the hpc scene i don't know a bunch of years ago it has kind of fallen back a little bit we actually had an iwarp user ask us about iwarp support recently um and we said yeah you end up using the cm pml and software emulation in lib fabric but that is because the iwarp vendors have kind of faded away from the hpc community so there isn't an optimized path for that um there is a deprecated optimized path called the open id btl um but we're kind of discouraging that because the whole open id btl is going away and open mpi five so uh in the interim if you are an iwarp customer please encourage your vendor to get involved in the mpi and hpc community if you want a better transport in your mpi layer um if you don't follow into number one or number two you follow in number three which is use the obu one pml and the appropriate btl so they get plural here as opposed to in two where you only use uh you know one mtl so this includes tcp for quote unquote plain ethernet environments and also includes shared memory say you're even running on a laptop or something like that um obu one is going to be used and and that's cool right uh there's a bunch of others here too let's show this kind of pictorially here so ucx is used for infiniband or rocky um here's cm pml plus a couple of different mtls for the different transports that are available out there so ofi remember is open fabrics interfaces and that's uh the formal name for lib fabric so use it for lib fabrics that support matching networks in particular amazon's efa craze eugenie and software emulation of iwarp networks um and then obu one pml plus btls is used for all the others for example self which is process loopback shared memory tcp my product us nick is in there uh as well and there's a couple other ones um oh that kind of truncated that there is it says other less common btls i'll make sure that is uh corrected on the pdf that we publish afterwards here's also just a flashback here of ucx and lib fabric um this was shown in uh part one showing all the networks that are supported in the two different libraries and then where they actually overlap so they both over they both have shared memory and tcp support but i grayed out the ones that we don't use lib fabric or ucx for in open mpi so the dark black ones that's what we use those particular libraries for um for the other ones like net direct net direct is a windows technology open mpi actually doesn't even support microsoft windows these days uh we also don't have a transport that uses raw udp sockets we could uh but there isn't much of a performance gain for it at least in our use cases so we don't do it and the shared memory and tcp uh we don't use that stuff um directly from ucx or lib fabric um usually you'll use ob one btl and the appropriate btl for that um i'm sorry ob one pml and the appropriate btls uh for that kind of stuff so just a comparison of where these libraries are and how they kind of fit in the overall jigsaw puzzle now common question is what if i want to use a different network stack or i could put a different title on here what if i want to force the use of a particular network stack right okay so let's look at the three examples here right so our three pml's well if i want to force the use of ob one and the btls i use what are called mca parameters and i believe we touched on these briefly in part one and ralph definitely mentioned them earlier in this session mca stands for mca our um modular component architecture and mca parameters are how we pass in parameters to the system at runtime giving information to the system on the user intent like i want to use the pm the ob one pml well we have an mca parameter called pml and its value is which pml you want to use so you can say dash dash mca pml ob one then you can also specify the btl mca parameter and you can provide a comma delimited list saying these are the ones that you are allowed to use open mpi so you can say tcp comma sm comma self and then open mpi will restrict itself and only use the ob one pml and only use the tcp sm and self btls nothing else will be considered at all so if you have a different network and a different network stack and open mpi has support for that ob one only i'm sorry open mpi won't even open up those components at all it'll only open up the ob one component and the tcp self and sm btl components similarly so with that big long-winded explanation it also applies to uh cm right so i could just say oh use dash dash mca pml cm and then dash dash mca mtl enlist the one mtl that you want to use so this can you know using either one of these will force open mpi to use that specific set of network stacks that you want either a bunch of btls and ob one or a single mtl and cm and in this way you can you know guarantee which network open mpi is using um this is helpful in a troubleshooting sense for example if you're seeing lower performance than you expect uh if you don't specify these you know the pml and btl or mtl parameters it could be because something is wrong with your network stack open mpi didn't choose it and just fell back to tcp right and so you might be getting lower performance than you expect because you're not using the network stack that you expect and so a typical um troubleshooting thing is like oh okay well let's force the use of cm and force the use of the ofi mtl because i have a matching network and i know i should be using that and if that errors out because no matching network is available well that's a telling clue to you you're like oh well i should have a matching network available why did open mpi fail to use it right and then you can go into more troubleshooting from there about like what's wrong with my matching network or did i not build open mpi with matching level support etc etc etc um and then finally the last one there are no sub parameters at least an open mpi for the ucx you just say hey i want to use the pml ucx and again that one's for infiniband and murky now it is harmless but useless to specify btls with cm or mtls with ob1 you can do it we're not even going to complain about it you won't get any warning message or anything just those will effectively end up being ignored because you're like well i'm using cm so i'm not even really going to look at the btls that you told me to use because i'm not using btls with cm and similarly with mtls and ob1 but we get this question a bunch so figured i throw it on the slide here all right let's talk about kuda kuda is a popular source of questions these days so the ucx and tsm2 supports gpu direct rdma remember there are multiple flavors of gpu direct gpu direct rdma is the one that hbc tends to care um they're fine distinctions i'm not going to go into the difference between them but gpu direct rdma is typically the ones that you want there are a whole bunch of tunable options and parameters that are technically outside the scope of open mpi we expose those but they're really underlying options of ucx and or psm2 and or kuda we're just providing a mechanism for you to get through them to open mpi so i'm not going to go into too many details about them i will just tell you in an upcoming slide here about how you can see those now a common question that we get is hey can i put kuda code in my mpi application and my answer is yes but with care it is complicated it is kind of tricky to get right we've really only advised this for for experts so i say it is not for the meek it is possible yes we do have an faq section about this on the open mpi website and there are some very detailed instructions there so go have a look at that if you want more details on that further on kuda we got the section this question in the last session and i didn't have a good answer so i have a more complete answer now can i run a kuda built open mpi on a node that has no gpu and i'm actually going to clarify that question to be on a node with no kuda libraries because those are technically two different questions so let's get to that in general it is certainly easiest if you have the kuda libraries installed across all your the nodes in your cluster say for example you have 100 nodes and only 10 of them fgpus because gpus are expensive it is certainly easier if you have the kuda libraries installed on all 100 nodes but some people don't like to do that some people like to say well i only want to have the kuda libraries on the 10 nodes where i have gpu's right um you can do that you may very well need two different open mpi installs or more specifically two different ucx installs right because ucx and psm2 are the things that link against kuda um and so you may need to have different installations of uh to match the fact that you have kuda on some machines and no kuda on other machines and that's why i say it's just easier if you have the kuda libraries installed everywhere we can actually and ucx will detect like oh i have the kuda libraries but i have no kuda devices i have no gpu's that can be handled gracefully at runtime but if you don't have the kuda libraries installed and ucx is expecting to find the kuda libraries you'll actually get the ucx pml fail to load because of linker error right so that's what the no means in the last big bullet there is that um if ucx was compiled with kuda support and you don't have the kuda libraries installed on that node the ucx pml will fail to load because of a linker error right um that being said again just put the the kuda libraries everywhere if you can and ucx and psm should uh psm2 should just gracefully handle like well i don't see any gpu's present so i'll just ignore that kuda functionality but it's still a runtime linker dependency that has to be there so this gets into the deeper question of how do we interface to external libraries again i mentioned a couple slides ago the uh parameterized system that we've got the modular component architecture parameters because what a lot of open mpi does is we're just the glue to talk to these external libraries like libfabrics and ucx and kuda and others um we don't necessarily control all the knobs of those external libraries and so we don't even try sometimes we just pass through the parameters and a lot of these you have you know corresponding mca parameters that you can use to get to these knobs for libfabric uh and and others um but ucx for example they chose to go a different direction they don't use our mca parameter system to pass parameters they just use ucx specific environment variables and so you need to refer to their documentation to find out what all of those are now that being said if you are going to set mca parameters you can set them in one of three ways you can set them on the command line so mpi run dash dash mca key and value right so in this case foo is the key and bar is the value um and then bas is another key and yow is another one so you can list dash dash mca multiple times on a command line and we'll use those all as unique individual key equals value pairs um don't list the same key multiple times don't say mca foo a and then mca foo b it can be confusing as to which one will actually be used don't do it just don't do it um yeah just don't do it um i have unique keys and values on the command line there you can also use environment variables if that floats your boat you don't want to have a long crazy uh command line you can export umpi mca and then the name of the key equals the value um those will get automatically slurped up by mpi run and the other runtime machinations as ralph talked about earlier you can also have text config files this is really helpful for having site wide defaults and it's an i and i style key equals value text file this is the default location where that file is so in prefix etc open mpi mca prams.com there's a truckload of comments in there that explain how it works but this is really great if you have users who don't know or care how your open mpi works but you want to set a pml and a set of btls and make sure that everybody uses but you can put it in this site wide file and people will you know your users who don't know or care will just do mpi run a dot out and it'll pick up the defaults that you've set in that site wide file i want to talk a little bit more about this in in part three because it gets a little more complicated in open mpi five but that's for a future date um also the oopy info command i mentioned this in part one but it can show you all kinds of things about your open mpi installation but it can also show you about what mca parameters are available so you can do oopy info dash dash all and optionally dash dash parsable which will put it in a machine parsable output as opposed to a pretty print format and this will show you all of the parameters that are available these parameters are exactly what's available in the mpi t programmatic interface as well so the mpi t apis can be used to read and write to these values in your application as well so go look in the uh mpi specification for the mpi t apis we expose all of our mca parameters through the mpi t api interface now mpi t has this concept of level and there are nine levels and so we have taken those nine levels and also associated them with our mca parameters as well so there are basically three sets of three the first three are the end user the next four are application tuner and the last three i'm sorry the first three the next three and the last three are the mpi developer so they're they're meant to be increasing level of complexity that level one is the the simplest stuff and level nine is the most advanced stuff so if nothing else you don't have to remember that three groups of three level one is the simplest stuff level nine is the most complex stuff we typically put stuff in level nine if we don't expect users to use them at all we put stuff in levels one to three is the relatively simple stuff and four five six we put things in there for more advanced users people who are really trying to tweak the system but that's the general scope of uh what we got there and with that i am at the end and i think we've still got a couple of minutes for questions so Kenneth i don't know how many came in yeah we have several questions and i think we can take the time to answer them um the first one is this ob one know about speeds of multiple channels so for example in fitment versus internet does it so it doesn't put chunks on both slow and fast wire slowing down the speed of the slowest wire good question and i'm gonna there's multiple parts to this question um we do have a ranking system of btls um so to speak so it's not always possible actually with modern linux kernels it is possible but back when we wrote ob one and the tcp btl it wasn't really possible to get the line speed of ethernet on there but even if you have say 100 gig ethernet you may not want to use raw tcp sockets because the latency sucks right you might still want to use uh say us nick or ob one for rocky or you know some other ethernet transport that has much better latency characteristics um even if the bandwidth characteristics are going to end up being roughly the same um so we basically have tcp as one of our lowest priority btls it's generally the fallback if nothing else is selected um there's a complicated mechanism inside open mpi to ensure that that happens but generally that's what what happens there so if you have a us nick uh thing or rocky thing almost certainly a different btl will be if you're forcing ob one there because uh rocky will default to ucx by the way but us nick for example is an ethernet one we do ob one that will naturally get selected uh over tcp sockets you say it will default to ucx but but only if ucx is there because correct yes i'm sorry that's an excellent point um that you had to have compiled your open mpi with the ucx library so if the ucx pml is available then it will default to do that um otherwise it will default to the open id btl in the open mpi four dot x series but the open id btl is going away in the open mpi five dot x series um and so if you don't build with ucx then you will not get rocky support or infinite band support so please start using ucx now that is what the vendors want you to use okay next question um is there any performance benefit to forcing a particular pml over just letting open mpi figure one out usually we will pick the right thing um we took a lot of pains to you know these internal ranking mechanisms and probing the system at runtime to figure out which to do usually open mpi will just do the right thing um and usually that's easy because you typically have ethernet and one other network that you paid extra money for because you needed an hbc class network type of thing so usually we're like oh we see the standard tcp and we see some other thing oh i should probably use the other thing and that makes it easy right um sometimes that doesn't happen though sometimes you're in a heterogeneous environment sometimes complicated stuff arises that's why we gave the mca system the command line parameters if you need something more complicated you can do it um and usually that is for one of two reasons one you just have some more complicated thing like maybe you have two different networks and you want to force using one of them um or force using the other one um or for whatever reason open mpi just made the wrong choice um the code that we have in there like i said it usually chooses the right thing but sometimes sometimes it doesn't right we're we're writing code that has to run in basically an infinite number of environments and they're all just slightly different so sometimes we choose the wrong thing so sometimes you need to specify it on the command line or an environment variable or a config file i think the question is also partially what type of overhead there is for open mpi checking at runtime because if you tell it what to do it won't have to check good question uh the overhead is just during mpi and it's we basically make these you know decisions at the very beginning of the application so it's not like the decision is made at every mpi send and mpi receive and things like that the decision is really made at the beginning of time it's a fairly negligible overhead good question um one more question any particular reason to do tcp shared memory with the btls instead of ucx so is there a performance reason or is it just because you had to pick one and it doesn't really matter oh uh like using the tcp and ucx versus the tcp btl with ob one um at the time i okay so i'm gonna i'm gonna disclaim my my comments here with i am not part of the ucx community so i'm gonna make a supposition here but i am part of the lib fabric community so uh i can say what our experience was in the lib fabric community so for example you could use tcp with lib fabric and the ofi mtl over cm right um the performance is not as great as you would think it is because you're adding a couple layers of software abstraction in there because we're emulating a matching network over tcp tcp is not naturally a matching layer so we have software emulation of of um uh the mpi style matching stuff um so it adds overhead to do tcp in lib fabric whereas the tcp btl in open mpi is native is customized is exactly what we need and nothing more in open mpi so it's a bit more optimized for our use case so when we want to do pure tcp we prefer ob one and the tcp btl because we know that that is fairly well tuned uh for our environment so that's how it is in lib fabric i suspect that a similar case would be true on the ucx side as well oh i should also say that on the lib fabric side lib fabric one of the primary design constraints is to support these high speed hpc class networks um so at least in lib fabric's early life there wasn't a lot of emphasis on tcp performance because tcp was meant to be like oh i want to develop on uh my laptop uh so i can write my application on my laptop and then go over to the big iron and run with the same exact application but just use a different transport under lib fabric so the tcp performance has to be correct and it has to work but we don't have to squeak every possible cycle and um you know drive down that latency to be as low as possible later there was actually quite a bit of emphasis put on you know getting good tcp performance uh out of lib fabric um but still you know the one that we have in open mpi the tcp btl is a bit more performant for our particular use case i don't know where you know how the ucx community views their tcp component whether they've tried to you know drive performance out of it or is it a similar situation you know i just want to have tcp so i can develop on my laptop and then go to the big iron and run my same ucx application uh without having to recompile or change my code yeah okay i think that's clear um there was another question which maybe already got answered um but i'll ask it anyway how would one get the equivalent of using dash dash mca pm ucx so specifying the ucx uh bml i guess um when running when using mpi run when you're using srun instead so do you just set the environment variable to specify that ucx should be used or is there another way yeah Ralph i'm gonna pull you in on this one too i think the right answer is you should put it in the config file um Ralph what do we do for environment variable forwarding and right Ralph Ralph left already so he had another meeting oh super sad um okay i i oh man i know there's a question about do we okay because when you're doing direct launch uh and direct launch is when you use like srun or something like that you're not using mpi run you're using the resource manager's launcher there is differences in the characteristics of how things run we are playing in their sandbox at that point as opposed to mpi run we control everything about that we do process binding for you we do environment variable forwarding for you we do standard in and out forwarding for you and things like that but when you're using something like srun we're playing in their sandbox and it's their defaults of what happened so for example srun does not bind for you by default and this was a matter of contention for a while um so like you could actually get lower performance if you do srun don't specify a binding and do your mpi thing so uh because mpi run we will bind by default srun and potentially others i'm i'm not an expert in these things um don't bind by default so there's a noticeable performance difference depending on how you launch and what options you specify to srun so all this coming back to say i don't remember offhand if srun will automatically forward environment variables for us i think they don't but i'm not gonna swear to that um no so that being said it is safe to just put it in the config file because the config file doesn't have to be forwarded it's just going to be you know available on your network file system and we can open that up on your mpi run on your mpi nodes and see that stuff sorry long answer all these things end up being unexpectedly complicated right okay we have one more question um using kuda-aware mpi basically allows the developer to pass device buffers to mpi api if such programs are compiled with an open mpi compiled without kuda support is there any fallback to non rdma communication can you repeat that question so say open mpi is compiled without kuda support is there a fallback to non rdma communication when you're using kuda aware mpi and the application oh okay so this gets to be very overloaded marketing terms um so let's say you're using ucx uh and ucx was not compiled with kuda support um but you're using infiniband okay um in this case what will happen is uh when you mpi send and receive let's see if you try to mpi send and receive from uh dpu device memory i think it's just going to fail um if you don't have kuda support inside uh ucx because the memory mapping is just going to get wrong um so i think that will just flat out fail but if you're running a kuda application and you copy things back up into main memory yourself and then do mpi send and receive then we'll just do normal rdma types of things across you know your ucx your infiniband network so the the reason i mentioned the the overloading and marketing terms is be aware of the difference of rdma versus gpu direct rdma right rdma is infiniband style uh communication offload right so directly sending from memory right um you know through the network card without going through the operating system uh specifying the source and target address all that kind of stuff that's rdma gpu direct rdma is sending and receiving from gpu device memory regardless of whether that's actually rdma or old style send and receive so that's it's a bit confusing of a term but that's actually what it means gpu direct rdma means sending from device memory no matter what the actual underlying transport is even if it's not rdma did i answer the question i think so because we're getting a response from okay raised it um that the question is answered so that's good um one more question and i have a related one so um is there an easy way to know which network btl has been selected and and my related question is is there a way to query what open impi sees in terms of available um btls or ah good one so this has been a perennial ask for forever like hey can you just show me which networks you're using um turns out it's a surprising difficult thing because that question is a distributed question it is a per process decision about what it is almost always they make the same decisions but the real problem is like all right let's say i'm running on 100 nodes and node 77 has its hpc class network is down and so we fall back to tcp on that one so everybody made the same decision everywhere except for when they're talking to number 77 and 77 his decision out to everyone so noting the exception is what actually became difficult because we basically have to do a global gather of everybody's communication endpoints to figure that out and that particularly when running at scale is not scalable um fairly recently um ibm contributed a patch upstream to us in the open mpi community open mpi version as opposed to their spectrum mpi infernal one um a uh oh what's it called i forget the exact thing it's coming in open mpi five that you can basically say put a thing on the command line and we'll output an abbreviated graph of uh what the connectivity was um and there's a couple disclaimers with that it's kind of complicated let's take a note Kenneth to make sure to mention that one in in part three um the second half of the question of can you programmatically query open mpi for that information um that is a no right now so right now the only thing we're going to have an open mpi or at least what's planned for open mpi five is this command line parameter that will emit something on standard out um having an api for that if that is something that people are interested in having we can talk about that because if we well i don't know that gets kind of tricky because the information goes back to mpi run it doesn't really go back to an mpi process that's tricky my question was actually getting it just on the command line as an output so you're saying that's not available yet but it's planned okay yes on the command line it's coming in open mpi five right now what we tell people is uh if you need to know which one it is your best bet is to specify the pml and either pl or the btls or whatever it is specify that out on the command line and see if you're getting a performance difference compared to when you don't specify it um so kind of uh discovered by inference rather than being able to show a specific listing there there is some debugging output you can do but it's fairly messy um you can do you can set some mca parameters which turn on verbosity in ob one or cm it gets pretty hard to read because you get just a truckload of output and you have to decipher it like oh ob one finally decided like he can't use us nick but that's you know very deep within 300 lines of output kind of stuff okay um there's a question popping up that we may be answering in part three but i'll raise it anyway are there any more advantages or disadvantages of using s-run versus mpi run so direct launch versus mpi run because it's a pretty good question about what we should advise to our users yeah again i wish ralph was here to answer this one i think the biggest win is um people who don't want to learn mpi run um that's not true there are pros and cons on both sides let's talk about the pros of using say s-run directly if your users and or customers are just used to s-run because they launch non-mpi things great don't have them learn mpi run they can launch their mpi processes through s-run perhaps they're already familiar with all the command line environments and options that s-run already has you can just keep them in that ecosystem and they don't have to learn something new like what is this mpi run thing why is it different than a great just use s-run right um and so it's really if you like that ecosystem keep doing that there used to be a scalability reason for that as well that direct launch was more scalable um i don't believe that that is true with all the advancements of pmix and modern integration so if you have recent versions of open mpi recent versions of slurm i think your scale of capability is going to be pretty much the same whether you use direct launch or whether you use mpi run because underneath they're using the same mechanism right um so i think that is not so much a factor we should get confirmation from ralph on that so please take that as jeff thinks that's true and we'll find out and we'll talk about that in part three um the pros of using mpi run is that ours is more tailored towards mpi right and open mpi in particular so we do things like bind by default we do things like providing a bunch of uh you know access to the mca parameter system we do things like having things that are relevant to the whole mpi ecosystem thing so it's it's a more specific launcher for what you're trying to do whereas srun is a generalized launcher and can launch a million different things and so therefore it's not super customized to mpi specific things so it's really not that one is worse than the other it's which one do you want to play in um and you accept the pros and cons of of that environment okay good i think that's what we have for questions there was one that maybe um popped up so you covered um the question about obu one versus ucx for pcp is it the same argument for the shared memory as well so why open mpi prefers obu one rather than ucx for shared memory very much so um our shared memory we have spent a ton of time optimizing the hell out of our shared memory um so i would definitely use uh in open mpi four dot x it's called vader and five dot x we're finally renamed oh i need to go back and make sure my slides say the right thing about four dot x and um this is a another terrible star wars name shared memory btl and uh all the way up through open mpi four dot x is called vader um terrible name i apologize in five dot x we're finally renaming it back to sm which is a little more intuitive for shared memory but the alias vader will still be there so if you have scripts and things like that that use uh you know mca btl blah blah blah vader that will still work it'll just be an alias for the sm one but in four dot x it's vader um that thing has been tuned uh and optimized crazy hard for the mpi use case it does things like single copy uh if you have knm or some of the uh or cma or some of the other single copy mechanisms that are available in linux these days uh as well and i believe that they are generally ahead of the shared memory performance in both lib fabric and ucx okay good i think we have all questions covered so thank you very much also to run this claimer on there i'm afraid to test that out and make sure i'm not lying to you um you know test the shared memory of open fabrics test the shared memory of ucx if we don't beat them uh we need to fix that because we should be more customized for this environment but you know if you got a problem let us know let's uh let's fix it okay good yeah you have it covered so i think we're good to wrap up for a little bit over time but um i think we have all questions answered so there will be part three indeed um on august fifth so about four weeks from now and thank you matt thank you very much everyone for joining us and raising all the questions oh man i apologize i'm going to go back to the previous one i forgot to update the last slide sorry this is the right date it's august fifth ignore what was just there on the last slide yeah august fifth is part three of the session thank you everybody okay thank you everybody