 Welcome to another edition of RCE again. This is Brock Palin You can find us online at RCE-cast.com where you can find over 100 episodes about research computing scientific computing and other topics I have again here Jeff Squires from Cisco Systems and one of the authors of open MPI Jeff. What's going on? Hey Brock today We actually have one of my colleagues from the open MPI community here today to talk about what kind of started as a Sub-project but has grown into an entire community and project Unto itself, so I think I'm just gonna jump right in here and say Ralph. I wonder if you could introduce yourself Yeah, sure. My name is Ralph Casting. I'm a principal engineer from Intel Corporation And we're gonna open MPI with Jeff for a long time now and we are We started this PMIX community that Jeff asked me to come here and talk about today All right, so Ralph you said the keyword right there PMIX. Why don't you give us the two-minute version? What what is it? Yeah, so it PMIX stands for PMI exascale process management interface exascale and Basically what it is is a it's a standardized way for applications to interact with the system management stack like the resource manager and things like that, you know to request services of various types and And be able to get a response back So what is a history of PMIX because I think there's some other PMIs out there Yeah, there are PMI originally started, you know quite a while back now as a way of having the providing a way for applications to wire themselves up so They'd be able to exchange addressing information basically to say how do I talk to you? But over time what happened was that the the needs grew to where you need the application needed to be able to interact over broader topics and So PMIX kind of grant grew from that as a response to that to say let's give you more ways to interact Okay, so you said a second ago that this is dealing with resource managers I assume you're talking about like torque and slurm and LSF and all the others that are out there But what exactly does that mean? What does an application need to get from resource managers? So there's there's really two things or two types of things that they do First off when you're launching a resource manager can provide you with all kinds of information about your job That's really helpful when you're trying to optimize communications and collective operations So they can tell you, you know, where all the your peers are located What addresses they have etc. That's information you can have at the very beginning So you don't have to exchange it afterwards Then the other things that you know that you want to do is for example You might want to ask for additional allocations of resources or or maybe you want to spawn additional processes So there are these kinds of services You know, what's what's the status of cues and things like that that applications as they're evolving really want to be able to do and so in my Limited worldview here. I think of this is applicable to MPI applications And so when you talk about wire up and communications and things like that you're referring to MPI applications of any scale honestly from, you know, two processes to two million processes That when they start up, they need to exchange MPI addressing stuff to get their ethernet addresses Are there in finnaband addresses or there whatever type of Networking addresses so that if I MPI send to you then I know how to open a network channel to you That's the kind of stuff you're talking about Yeah, in part of that certainly is is in there, but it's also applicable to non MPI processes So let me give you an example say I'm in a cloud and And I'm running an application doesn't have to be an MPI application any application in the cloud Well, there are things like for example when I get started I can use PMI x to communicate to the cloud manager that I'm willing to be preempted And that now can be a service kind of thing You know where hey if you're willing to be preempted you get a different rate on on your charges So I can now announce to the to the to the cloud manager that I'm willing to be preempted And I can use the PMI x mechanisms that the cloud manager can tell me hey I need to preempt you now and Wait for me to say back that I'm ready So I can go ahead and checkpoint my job or do whatever I need to do To prepare for it and then I get to tell the cloud manager back, you know, I'm ready to be preempted now So there's a whole bunch of things like that that you can do. There's got nothing to do with MPI so this almost like a Operating kind of like a message bus where different clients can basically say this is what I'm able or willing to do exactly right one of the Mantras we have it in the PMI x world is that PMI x does nothing All it does is it communicates your request to the local system management stack and returns the response back So that that management stack always has the right to say, you know nice I'm glad you asked that but I don't support it and you have to have it in your application some, you know mechanism for dealing with a not supported response, but But all the things you know people talk about what flexible workflows and the ability to manage their own environment better stuff We just provided that the the hooks by which you can do that so is this done like a core plus extensions like if My client doesn't know about something like how It seems like you could keep adding more and more information that you could be announcing that you're capable of doing or communicating with How do you make sure that things can remain compatible between adding more and more things being mentioned or announced? well, we we adopted an architecture that says we have very very simple API's and It's like the job control who you announced for example that I am preemptible There's just one job control API and then you can provide a an array of key value attributes to that to that API that actually Dick, you know describe what the the operation that you want to do so, you know if we want to Extend the kind of things that you can do like for a job control I want to add something beyond just you know announcing that I'm preemptible or whatever We don't change the API. We don't add another API. We just give you a new key value attribute that you can pass And so that that's how we maintain backward compatibility. We just we don't we have just a policy We do not add API's unless there simply is nothing that fits and then we'll add one So what exactly is the big deal here because way back at the beginning and again I'm admittedly taking a limited worldview of MP of MPI applications, but I understand this applies well beyond that as well, but we used to just do SSH and Pass things on the command line like if I want to start a 32 process or 128 process MPI job We could just do SSH and maybe do that a little smartly maybe via a tree or something so that it wasn't linear Why why do we need all this extra control stuff? Of what used to be a relatively straightforward process Well, okay, so there's there's two different parts to that one is you know if you're looking at an MPI job Which I know is your your focus Jeff the The problem is there's only so much you can pass on a command line And so as your job gets bigger or as the amount of information that you want to pass gets larger You just can't fit it on the command line anymore And so you need some secondary mechanism for for making that for passing that information around what we used to do Was was the PMI approach, right? Which was to say well, we'll take you know Everybody will since we have a limit on the on the command line We only use the command line for what we have to and then everybody simply broadcast that and there's some you know out of band Mechanism by which it gets exchanged And that has all the scaling issues and so that's why we went this way is saying well Let's just get back to that basic thing give a mechanism by which the resource manager can convey that information to you Outside of the command line limitations Okay, so this is a way for applications to talk to the resource manager But how does that address that I'm process a and I want to communicate with process B and Therefore I need to know some kind of network address for process B. How does that work? so Yeah, so the way we did that in the past right was that process a would discover a network address He would broadcast it to everybody and process B would then receive that broadcast say oh, okay I could communicate to you What we've done is that when the resource Manager is getting ready to start the job. We've given them a an API a function They can call that will talk to the network and find out what are the addresses that are going to be used by the different Processes what nodes are they going to be on and what addresses are they going to be assigned to and then we include that in the Information that's given to every process when it first starts up And so when the process first starts up process a starts up it can ask What's the address for process B and it already has that information there? So is this something for end users or is this just something for like creators of MPI libraries like open MPI and other types of higher-level tools that users interact with this just between the creators of those middleware projects and The resource managers or is this something that a user can directly interact with Yeah, it's really both So the libraries, you know open MPI libraries the open Schmem libraries, etc they embed interactions That that use these PMIx interfaces to do their basic, you know wire up and and other operations that they do but then the Application developers themselves are using it because they're the ones who know how they want You know that their application has certain workflow requirements, you know They want to be able to allocate more than any resources or whatever they want to do They use it directly themselves And then you are seeing people embed it in tools So, you know for example if I might have write this tool that allows me to launch jobs It says certain command line options and everything that I really like in the past You'd write that but it would be specific to the resource manager that you were locally using now You can write it with PMIx instead and take that same software and simply move it from one resource manager environment to another Without having to change any of the code Okay, so this is really talking about just having a portable standard API that multiple resource managers and middleware developers can develop to correct Okay, so this does not necessarily Replace something like let's say the task manager, you know TM spawn inside something like one of the PBS derivatives out there Because PMIx doesn't actually again do anything But now open MPI doesn't develop to TM spawn and all the other different interfaces out there They just developed a PMIx That's exactly correct. Yeah, open MPI calls PMIx spawn and PMIx takes care of the abstraction for it So just to nail this down then you're saying PMIx is a library. Is that right? It's actually three pieces first there's a there's an actual standard and And that just defines the API's and and and Defines a set of attributes, you know key value attributes That we all agreed that we would support or at least recognize But that it says nothing in it at all about implementation and people are always free to implement it themselves The second piece is an actual reference implementation. That's a complete library both client and server and That fully implements all the all the PMIx function calls So again, all it does is communicate so the client function calls communicate to the server function calls that then call the relevant back-end Resource manager functions to actually do something Then the third piece that we actually have a reference server It's it's it's about a runtime if you will it looks just like a resource manager That supports PMIx with the exception that doesn't have a scheduler in it And so if somebody wants for example to develop some PMIx based code in an environment that doesn't yet support PMIx itself They can run this reference server in that environment and it operates just as if it was it sitting in a PMIx resource manager Except for like I said, it doesn't do scheduling. So those three elements are what we mean when we talk about the PMIx community Okay, so you're you're then also implying here that the resource managers out there are also supporting Pimx directly. Is that a correct inference? They are some of them already do slurm for example has been doing it for for a year and a half now and IBM has been developing their jobstep manager and that that is you know really based on Pimx Our PMIx all across the board and others are coming along at various stages. So, yeah eventually we we do hope that That most of it if not all of the resource managers out there will in fact provide that support directly Now that's actually a pretty fascinating place then that if applications have Want you know a standard that they can write to to do their large-scale applications like MPI is one of the biggest But there are others out there and then also the run times have a standard to talk to the back-end schedulers Well, that would also be fantastic because as you and I both know maintaining You know the MPI implementation to talk to all these different resource managers who have different abstractions and different APIs and whatnot It was kind of a nightmare. So Since this is a fairly large disruption in this community, how did you manage to pull this off? well the you know being Having some role in open MPI obviously put me in contact with a lot of these people to begin with like you said We had to talk to them because of the infrared interfaces. We had to provide So there was some personal contact involved in that as well But the you know the real thing was that this is a need that All of us that were involved in these communities, you know the resource manager folks, etc We all knew we needed to do something and some of the resource manager folks had already been starting to write proprietary responses to it Which was what it was causing a lot of consternation in the user community because you kind of had to lock your application Then into that environment, which is you know something people really don't like to do So it was a recognized need out there and the real key thing I believe in that way that let us get the adoption by the resource manager guys Was the the stipulation that that the resource manager always has the right to say not supported so So, you know if you don't want to for example say you don't want to support a particular back-end capability You just provide a null in that function pointer and the Pimx server if it gets a request for that We'll see the null on the back end and just return and not support it for you So you don't have to do anything If you if you want to support that capability that function, but maybe you don't support every option that somebody could pass Well, you have the right to be able to look at those options and say, you know, I'm sorry I don't support that and reply back So I think that not supported ability was the thing that was one of the key things so a second part of this inference of You know the the resource manager supporting Pimx natively are they writing their own code their own implementation from scratch or are they using your reference Implementation and then as a consequence of that is there a standardized network protocol that you use because then you can divorce The software implementation from what is communicated across the network So They're all free to write their own code so far. Nobody has done that and there's no indications that somebody wants to do that The we did not we did not standardize the protocol between the client and the server So if somebody writes their own and it's incompatible Well, that's you know, you just have to make sure that you link against this the the same client library that they used You know, they have to provide both a server and a client library, but the API's would be the same to be compatible Let's say so far nobody has gone that route it's a lot of code to write and Nobody has seen a proprietary or a competitive advantage. I should say in Implementing their own the competitive What what's happened is that the competitive basis between the resource environments has shifted from the API's to What level you know, which API's they support on the back end and which ones they don't Okay, so one of the goals of PMI X is you know, you've got X scale right in the name of it Um What exactly does PMI X bring to the table if it doesn't actually do anything that gets you that extreme scale? well the It's really again in two pieces if you will the exo scale Requires that you be able to you know be a first off You have to be able to launch the job in a reasonable amount of time If you just took the current method of broadcasting and sharing things, you know Exo scale machine might take you know tens of minutes to start up a job that size and That's obviously that's something you really don't we would not like to see so Part of that is is the ability to then have the resource manager share the information that it already has I mean we went back and looked at what was actually being broadcasts around It turned out that more than 90% of that information the resource manager already knew But it didn't have a mechanism a standardized mechanism by which it could share it to the application So either the application had to come up with a you know a resource manager specific way of dealing with that communication or We had to standardize it so the the library could be portable and so we took the approach of saying look the resource manager Already knows a lot of this stuff. Let's go ahead and just give them a standardized way of communicating it Then the second half of that was we had to go to the resource manager guys and say there's an additional 10% of the information that we would need and if we had that we wouldn't have to broadcast anything and we had to get them to agree to provide that 10% and And that's we were able to do finally is to get a get a commitment from them And so we created a list. It's on the web That has here's a list of all the stuff we need and the resource manager guys are going through and filling that list in so Aren't you just moving like all the start-up time from the MPI? You know like runtime, you know wire up whatever you guys call that very first step to get everything going Aren't you just shifting that problem to the resource manager? How? You know normally it's not the individual nodes on you know resource whatever the resource manager has run any individual knows It has a lot of this information right like how? I'm not seeing yet quite how this actually benefits if if all the information still just kind of held in one place on one part It's just now into resource manager Okay, well, let's take the the endpoint information as an example so The way it works today All right in the past has been that the application process starts it It discovers a resource. Let's say let's use sockets as an example It opens a socket and gets a socket ID Socket number and then it has to broadcast that because none of its peers know what socket. It's listening on So one way you could address that was be to assign static Sockets to your processes and then you don't need to exchange the socket information anymore because you can compute what socket they're on But the problem is that you might not be the only application Running on that node and so you don't know what sockets you can actually take so What we now do is we say okay resource manager you use the PMI X plug-in for the for fabric that plug-in will Manage a pool of sockets based on its knowledge of what's being run across the across the different jobs Across different nodes and it will use that to assign static socket numbers For this app for this application Those numbers then are included when the resource manager sends its launch message out to the to the compute nodes That information goes along with it So instead of having an exchange The daemons are given Before they even start the political processes They're given all that endpoint information and they just simply convey it down to their local clients and That eliminates the need for each client to broadcast that information So if you go through and you look at what what pieces of information the different libraries are asking for you make at laundry list up and And then you ask the resource manager the workload manager when it sends the launch message out to the individual nodes You ask it to include all that information for every node in that message And so now you just use TCP sockets as an example and that's kind of a baseline, but many more high-end HPC environments for example use other types of networking and they're several available so this I Just want to clarify that TCP was just an example. You could do the same thing regardless of the back-end type of network, right? That's correct, and we already do Support those at least at least all of the the most popular ones But basically yeah, that's right. You know the the network interface support in PMI X and the server side Is this a set of plug-ins so their plug-ins, you know for all your favorite flavors of fabric and We in those plug-ins. We've worked with the network manufacturers to get those plug-ins available And those plug-ins are all now capable of creating those addresses for you and So just to drill down in this little bit more so not only are you making the data available? Let me just provide one thing that was kind of an inference there was that when you say it's sent out to the Demons you mean it's sent once to each Let's just say node or server on the network even though that's kind of an amorphous term But like once and then if you've got 20 or 30 or 40 or more cores on that server that Damon receives the information once and can Locally give that information to all of the processes that start up whether it's 20 30 or 40 using local IPC not networking IPC, right? That's one of the wins. That is one of the wins. That's correct All right, and then additionally on top of that you also do some compression types of techniques in the launch message that sent via PIMX, right? Well, the law PIMX doesn't send the launch message the resource manager does we just provide the information for them that we say look This information needs to go along Here's a payload that has all the information that we need you to take along You'd be surprised That the information is not as big as you might think our typical launch message Our PMI X enabled launch message from like slurm for example, it's less than a megabyte Because there's so much, you know, we compress obviously anything we're gonna provide up to slurm slurm Just gets it as a blob and just sends it along But you know slurm use you know other and the real all the resource managers do they use like regular Expressions to describe where the processes are located in things to try and keep the the launch message down So it really is only about a megabyte in size or so even for an exascale size machine Ah, that was that was my next question. So a megabyte for How many like what what scale how many processes are you talking there? Well, we've been launching In our in our biggest test cases a million processes on like 30,000 nodes and that that's about you know About one one and a half megabyte launch message So it's not very big So what about use cases besides MPI? I know that's where this kind of started But you also talked about you know on cloud how you could say you're preemptible. Has there been any implementations? That kind of touch on one of those other examples Yeah, there are the And I apologize that I cannot I'm not at liberty to give you you know names and details but Because those companies haven't taken it, you know public yet, but but there are people working on You know cloud Interactions like we talked about earlier. There are also people working on different kinds of tools You know that they can take advantage of it You know debugger tools etc that can use these kind of interfaces to do more than they do today For example on a debugger today you get a Node-level representation of where everything is and there's a limited amount of you know of information as you can provide Because the interface is limited So but with a PMI X interface you could ask for example to show the nodes in a network based layout You know, where are they on the network relative to each other the processes? I should say where they are relative on the network to each other You could ask the fabric for traffic reports and show where choke points are Because the interfaces allow you to be able to make that query to the system management stack and be able to get that kind of information back Now you said something a minute ago that I kind of want to dive into a little bit You said you were testing at a million processes. How do you test at that scale? Well, we have Friendly users at who at facilities that have these kind of big machines and They will generally take a little bit of time out and run some tests for us which has been very much appreciated We also have ways to simulate scale So for example, we can launch multiple processes on a given node and make them look like they're sitting on different nodes and so One of our our collaborators was kind of to do that on Amazon where he in fact takes a small number of Amazon nodes But makes it look from a PMI x standpoint like a much much bigger cluster and we can do some scaling tests on that that may not be You know fully Realistic in terms of a cluster but gives some pretty good scaling law measurements So we so we have ways of getting that information even when we can't get a hold of the big clusters So could PMI x help in heterogeneous environments by this I mean, what if I have you know hybrid systems are popping up all over the place where you have sometimes you have Accelerators and nodes you might have FPGAs are coming back on the market as alternative types And you know like for me at the University of Michigan our cluster has Different machines of different architecture types with different accelerators Is this a way that a given application could basically optimize for whatever machine it landed on by asking a system? Do you have these things I know about? Yeah, you could query what we know what resources are available to me There are there are coordination Mechanisms in PMI x that might allow you for example to say well if I I don't see or I have only four GPUs on my machine How many of you got on yours? Oh, well, maybe then we use a coordinate We handshake and you go ahead and run something over there So there are these mechanisms in place I don't think that everybody that we have really fully understood everything we can do with PMI x The community has been trying to come up with you know Just trying to enable people to experiment with it and then the expectation is that people are going to do things with it That are beyond anything we had in mind but that feedback will come in and then we'll be able to perhaps offer better mechanisms for them to do some of those things but I Think this is yeah, we've never had this kind of capability before to be honest with you And we're kind of feeling our way at the moment to say well, you know, is this is useful to you and how might you use it? And and hopefully you know the community to have a chance to try and test those things out So you had mentioned earlier too that you know, you could ask the fabric You know how congested you are in different places and with more complex fabrics Especially on the bigger system just definitely useful, but even on regular systems We might have you know in you know fabric islands or you might have a dual-fat tree or something like that Is there a way or is anybody doing anything where you could effectively choose say an optimized collective operation between hosts based on You know effectively the fabric architecture Which again the application can find out about by asking through one of these standard interfaces I? I don't think anybody has done that yet again, I think that's one of those things that What we're looking to see is you know given the given people these tools that they can do that We expect researchers in particular to start asking those kind of questions And I believe that there are groups in fact that are starting to ask exactly that kind of a question about the collective optimization But I don't think they have published anything yet And This really only goes one direction right now and by that I mean if an application encounters an error condition that is could be system-related Think of like the PBS health check I can't use the PMI interface to send information back to the resource manager or to some sort of metric system or something like that Well, you can there's there's two ways and you can do that one is you could just raise an event and And we support complete binary Payloads and we actually take care of heterogeneity for you So you can put binary numbers in there or whatever you want to do and then the event when you raise it You have the ability to pass however much information you want in there So you could raise an event to the resource manager saying hey, I saw something. Here's a complete blob of information about it And then the resource manager can do something with that obviously that have some kind of a Agreement with the resource manager that a you're going to listen for that event and be you know have some idea What you're going to do with it But but you also have the ability to log so one of these people asked for was they said Well, you know if I see something or if I just want to even log that I have a certain amount of progress And I want to stick that in my job record So, you know every resource manager keeps a record of the job that you could go back and look at see what happened You can actually insert messages into that log From the application and that way, you know, there's a when you get the official Job record of what the job was did you how much resources it used etc Those messages will be there for you so you can record them for yourself or you can try to communicate them to the resource manager Now one thing you've been very consistent about Through this whole discussion is you've been saying a lot of we and the community this and the community that who is involved in the community Well, you know Intel obviously through me and then there's Melanox is with us on IBM Are those three are probably the biggest code contributors in the moment You know, we have a list of others, you know Fujitsu is involved folks from wrist We have Livermore and Los Alamos from the national lab community that have been involved And then there's been, you know, slur the sked MD guys the all terror PBS guys have been there So there's there's a and I'm probably leaving some folks out I'll have to apologize to you afterwards because I There's about about 12 to 15 active members at this point in time And these are members who like you mentioned the first several of them were code contributors to the Open-source Pimx code base itself, but these other members are helping to shape the standard, right? I mean, it's not just about contributing code, right? That's right. That's right the Ever everybody participates all those people participate in the In the standards process which is based on the IETF mechanism So there's always an RFC that that has to be, you know, backed up by a prototype implementation Which is usually in the reference library. It doesn't have to be But it has to be visible and then it goes through review. It goes through Comments and then we have weekly teleconferences and At those weekly teleconferences we schedule and and have a review of those those proposals and then those things get accepted Or rejected usually accepted into the standard. So all those groups participate in that Okay, so this really is quite a bit more than just Ralph's little toy This is a full-on community that is full of all the interested parties across this vertical, right? Yeah, yeah, you know, it really is a lot more than that than me and And I should give credit to everybody that's involved is it's a lot of work from everybody's standpoint You know part of it which I have to understand is that you know, it's you can create this channel by which these two parties can communicate But there's a lot of work behind the scenes that has to happen that the resource manager guys have to agree They're gonna provide and then you know like for example You have to go to the fabric people and say hey guys We need you to provide this this endpoint information blob, etc That's work on that part that they have to do to provide that information Or if you're gonna talk to the file system guys and you say well, you know We need you to be able to pre-cache files for us and we tell you or tell us how long it's gonna take them to Retreat be retrieved. That's work. They have to do so it's really a Collaboration across all these different elements that makes it possible So it seems very flexible. What's the strangest or probably in this case more unexpected use of PMI X you've seen so far The one that surprised me the most was a request really from the from the cloud some people working in the cloud world Where they wanted to be able to loan. I think I miss this really they want to be able to loan resources back So they actually have workflows where they they you know, they need a envelope of resources Eventually at some points in their computation or their work, but there are times when they don't need it don't need all of it And you know, so I had never anticipated somebody actually Loaning resources back to the system and getting them back later That that to me was a surprise That last point was what I really wanted to get across it really involves all these different parties actively collaborating Because that's what's really different in my mind with PIMX versus before is it's the first time I'm aware at least of You know the resource manager the fabric the file system the library guys Language library guys all getting on a weekly telecon collaborating on how they're going to orchestrate this application environment All right, so there's a lot of open-source code here You must have a fairly permissive license that you distribute this under because some of these resource managers are Closed source and proprietary. What what license are you using? We use the three clause BSD So it's it's absorbable by people using, you know, a GNU license as as well as You know proprietary people they're all welcome to use it So Ralph thanks a lot for your time. Where can people find out more about PMI X and get involved? The probably the first place to start would be going to github everything's on github for PMI X so the the reference page for for PMX itself is PMI X dot github dot IO slash PMI X Or you can go to the code repository itself There's a group because we have both the Implementation as well as the reference server there and that's at github.com slash PMI X Yeah, Ralph, thanks a lot for your time. Thanks, Ralph. Hey. Yeah. Thank you very much for having me