 Welcome to another edition of RCE. Again, this is Brock Palin. You can find us online at rce-cast.com We're only a few weeks away from SC. Both Jeff and I will be there. We will also have limited edition RCE t-shirts. So if you stop one of us and say you like to show, if we still have a shirt, have yourself a free t-shirt. That's the real reason to go, right? Forget all these computers and vendors and colleagues and stuff. It's to get the t-shirts and the swag. That's the real reason to go to SC. We all know this. Yes. Yes, it's all about the swag But I guess if you really don't want another free t-shirt, you can just stop and say that you like to show also and It's nice to always hear from somebody. It's always nice to hear that because this is it's really is just too random guys doing a podcast We really get no backing from our corporate overlords at all. So This is truly a spare time thing. So let us know if you like the show or if you hate it and you have some suggestions So so there you go Okay, Jeff, who's our guest today? Well, we have the primary developer of a project whose pronunciation is very hotly debated from what I understand I've always said Pad B myself But apparently that is wrong and I think we'll be coached in our wrongness today by Mr. Ashley Pittman So Ashley, I wonder if you could give us a brief introduction for yourself Hello, Jeff. Yeah, my name is Ashley Pittman I'm based out the UK which probably explains the the disagreement in pronunciation I've been working in MPI and HPC typically sort of comms libraries for for about 10 years now and and PADB is one of the things that's Come out of that So there it is PADB rather than Pad B, but yeah, you got me right off the bat there You have to forgive me because just a year or two of just conditioning and I'm gonna keep saying it the other way So well, anyway, let's just jump right into it. What is PADB? Very difficult question. It's a It's a tool It's a unit's command line tool for querying and querying and giving you information about a parallel job that you You might have running on your supercomputer Which is similar but not the same as a parallel debugger So it will give you things like a parallel stack traces and job-wide state of of, you know, of parallel jobs So but it still has things like you know, like you said, it has like stack and stuff in there So is the main feature that a parallel debugger would have is be able to look at source while it's running Um, it doesn't allow you to look at source So there's there's some overlap for a parallel debugger Although it doesn't it doesn't have a lot of the in-depth features that you expect a debugger to have So it's more for a brief overview or or being able to look at the Parallel side of it rather than the necessarily the debugger side of it Okay. So what what's the history? What was the motivation for writing PADB and in the first place? Um, it was Oh six or seven years ago now. I was working for quadrics My main role there was was working on the collectives programming the collectives Um, and we used to you know, sort of write all this fancy new code and all these wonderful algorithms and send them off to people And they'd come back and they they sent me an email and say you've broken barrier again Um, what what what happened nine times out of ten is they hadn't I hadn't broken barrier But they broke in their code and they had five hundred and eleven processes waiting in barrier and one process It wasn't And you know, we sort of go through this process and they say well, well, you know, what you need is a parallel debugger We've got a site license. Why didn't you fire it up and the answer is well that might be fine for you sat at your desk But it's no good for me 2000 miles away So it's a it's a lightweight view of a lightweight way of Viewing what's happening within a parallel job without going to the to the sort of overhead of Of a full GUI session Okay, that was the next thing I was going to ask. So this is completely a command line tool It does not hook in with a GUI at all. Has anybody written a GUI to go on top of it? Have you written a GUI? um There have been several attempts at GUIs and the it's one of the most It's one of the most common questions I'm asked if I'm honest. Um I don't think it's particularly well suited to a GUI. I think it has it has strengths elsewhere and I think if a GUI if there was to be a GUI put on it then It would lose some of those strengths and in a way it would become you know Sort of cheaper way a cheaper alternative of some of the some of the existing parallel debuggers out there and and it would lose its unique selling point All right, so let me let me jump back a little bit So as you said it is kind of difficult to define exactly what PADB is So let me let me try and give the elevator pitch for this and you tell me if I got it right So it's more of a point in time query Kind of thing then then a full fledged debugger It's something that can go out and say hey, what are the stack traces right now? Yeah, and other because you're a snapshot of state Okay, and and and you're you're chosen delivery mechanism for this really is The command line and and and one of the reasons I would imagine that could be useful is that some of the other debuggers could pick up PADB and and use it in the background. Is that ever a use case that you would envision? Yeah, that's that's that's use case. It's been discussed several times. The other Going on similar lines. The other way you can use it is you can use PADB to to look at the parallel To look at, you know, a job that may be as hung or isn't doing what you expect And it gives you a very brief overview and and then you can make a decision as to whether you do want to To sort of fire up a full featured parallel debugger and look at it that way or whether it's something you recognize and And can sort of put it to one side and and move on So in relationship to a parallel debugger, do you have to compile your application with debugging symbols or does PADB never even touch that part of the code? I try very hard and I believe I succeed in in making PADB very easy to run. So there's no there's no Complacional relinking or anything like that required. It's It it works purely at the process level. So there's no There's no additional steps that you need to to to take to run PADB That being said though, let me clarify it if if you did compile your application with debugging symbols The normal backtrace functionality would give you more information like file and line number in addition to just functioning Yeah, if you if you compile with with minus g the standard the standard option then PADB will give you line numbers If you don't then then it won't and that's the limitation of GDB That's the limitation of GDB. It's not there's there's no special compiler wrappers or library that PADB ships with Okay, so it doesn't require debugging symbols. It just works at the process level What interface is it actually talking to MPI? You said this was for quadrics No quadrics was hardware. Is it actually looking at Hardware or is it looking at the libraries? And if so, does it have to be coded specifically for the library? Um, so nothing needs coding specifically Um PADB will work with any any parallel job. Um, it's not MPI specific There's a couple of features that that only work in the presence of MPI, but it will give you stack traces things like You know sort of global arrays or NWKM or things like that that don't necessarily involve MPI um Where it gets this information from is either It can either just sort of grub around in slash proc and collect the information back to the back to the user Or simply from GDB so anything that GDB can can get PADB can potentially get as well With some additional with an additional caveat that there are some MPI specific features as a there's a MPI debugger DLL But I'm sure Jeff actually knows too much about that that PADB can interface with and give you some very deep down and specific information about MPI in particular Yeah, so actually let's step back and clarify one thing So none of this is quadric specific, right? You were talking about how this motivation came about while you were working at quadrics And ian ago, right? Yes, this is a it's an open source project. The license is lgpl It happened to start when I was at quadrics as they purely out of purely out of our needs at the time Quadrics is no longer with us, but the the beauty of open source is that you know, this code is evolved often now supports a A wide range of Hardware and software doesn't really support any hardware it Interacts purely with the software Cool, okay, so tell us what kinds of things you do get when you use PADB on an MPI application Um the specific features for MPI is something called the message cues So when you when you send a receiver message when you send a message it goes in the in the send queue And when another process sends you a message it goes in the receive queue As you know with MPI these cues are asynchronous and you've got sort of asynchronous send asynchronous sends and receive and sometimes with sends you don't know where they're from So um, so so PADB can come in and and look at a process and say well this rank has you know This rank is waiting to receive 12 messages. This rank is waiting to receive three or four messages and this rank over here is waiting to receive 128 messages And you can also see who they're waiting for and what tag and what communicator and stuff, right? Yeah, the message size the the Method the data the MPI data type Where they're from the communicators, right the whole receive signature so this is actually really great because The kind of the premier debuggers have offered this feature for a long long time But now there's a nice Open source way and those premier debuggers are great But they're they're a little expensive and they're out of the reach of of some people So having a tool like PADB is well, I can say from personal experiences tremendously Useful you could say oh, whoops. I have a wrong I have a mismatch send over here because I can see the The the message went to rank 38 instead of 37 and and that is why my application is hung Yeah, and that kind of thing when you when you look at PADB output that that kind of thing sticks out like a sore front You know if you've got as a same if you got 127 processes with with one message One process with no messages one process with two messages and it's It becomes very very apparent very quickly How that's maybe maybe not how it's come about but but how you need to go about looking further at a problem So if you're not actually having to write specifically to the library, where's this information coming from? I mean this isn't something that's normally Available to a user. So how are you actually getting this information out of the MPI libraries? um, this is This is a very specific And obscure feature within MPI which is currently going through the process of becoming part of the specification at the minute There's a there's a debugger callback DLL that's actually shipped with the MPI library That PADB then DL opens itself which does a sort of RPC mechanism into the library So it's something that your MPI library has and supports and has probably had for for a number of years Originally by total view many years ago and now there's What it's just said there's three tools out there that do it. There's this this being the in the open source one So, yeah, let me explain that a little bit further being the resident MPI twonk here that What what Ashley's talking about is there is this functionality that was developed in the 90s by the M pitch guys in conjunction with the total view guys that they developed this kind of add-on plugin That allows third-party tools to query and say hey MPI, what's in your message cues? And it's not part of the MPI standard At least it's not part yet, but it is something we're working to add to the MPI three standard Simply because just about every MPI supports it these days, but it's very ad hoc and it's It's kind of grew up around a message header or a c header file that you know Kind of grew out of this effort in the in the mid to late 90s But since everybody's got it we figure it's the right time to standardize it. And so it will Potentially be another chapter in the MPI three document or more likely just a companion document That says if your MPI wants to do the message cues stuff Do this so there's my little plug for the MPI three forum So this is no relation then to PM PI or any of that because that's for just timing Information jeff you can correct me on this because I'm I've never really used it myself No, I think that's absolutely right. There's there's there's no relation between PM PI and PADB PM I I have to jump in so he's correct. It has nothing to do with PM PI But PM PI is not just about timing PM PI can be Used for lots of things. Let's not digress over there But I did just have to say that Okay, okay, maybe we can talk about that some other time It's this this isn't this is a sort of extra feature that that PADB can can use when targeting MPI jobs Nothing in the core of PADB is is MPI specific Um, I'd say it works with sort of creation man. I'm CI and W. M It doesn't really doesn't really care what it's what it's talking about It's just if you are looking at an MPI job, then there's some extra state that you can get that Probably wasn't available to you before Now that being said, I know you've worked with the various MPI implementations out there to get PADB hooked in nicely. So Could you give us some example of some of the things that you needed from MPI implementations to make it Make them play nice with PADB. Like what do we do for you an open MPI for example? I'm gonna say something really offensive now Jeff and say I get confused about what what I put in open MPI and what if we did an MPH too, but Oh fair enough. We're all the same to the outside world. Give us any example one one of the things is is when it comes to when it comes to An unmatched receive on a on a communicator Um, the receive has with it The the rank that it came from But that rank is local to the communicator communicator not global to the job So that you need a way of reverse mapping communicators to remote ranks, which is Is possible depending on your interpretation of this this vague spec that's been around for 15 years or so One interpretation of the spec makes it possible for for you to do this reverse mapping another interpretation of the spec Makes it impossible to do the reverse mapping. So I've had sort of long long conversations with people about About the about the interpretation of the meaning of Of what these codes mean and sort of trying to persuade people really to add these new what they see as these new features and what I see is a That's a fixing a bug report fixing a bug It does this have anything to do with intra communicator or not intra communicators But actually like creating your own communicator inside your job and the descendant So if you create a sub communicator or reorder the ranks and create a communicator of the same size But reorder the ranks Then you need a way of converting from The local communicator from your rank in the given in any given communicator to your rank in mpi com world Yeah, so I should sorry. I'm jumping in with all these little mpi tidbits But this is actually saying things exactly right, you know, when you say, oh, what's that process is rank? Well, that's not accurate. You want to most people really mean what's that process is rank inside mpi com world Because you can have lots of communicators and your rank might be different in all of them So it's actually a little pet peeve of mine when people say, hey, what's uh, what rank are you talking about? I know what rank in com world. Are you talking about is what they really mean? So thank you for saying it correctly, ashley He's told me well, jeff Yeah, I've I've been chewed out by him a couple of times So just I'm pretty sure you mentioned to clarify padb is something you run Like you already have a parallel job running if it hangs up or you're just curious where it's at You can run padb from the command line So it is a external tool that attaches gets information and detaches from the running job. So it doesn't Sit in between the whole time Yeah, that's that's exactly right. Um Yeah, the other parallel debuggers out there tend to have two different modes of operation either the Either you run a job and then you attach to it or or There's another mechanism of running Of using a debugger where you sort of start the debug and start the start the job inside that debugger Um, that's not a mechanism that padb works with and I don't think it sort of maps well to the model So padb is is entirely about there's a job running and it will query and tell you state about that job So does that mean then you need to support resource managers so that I say connect to my job And it can figure out where all the different MPI processes are spread across this, you know, large system Yes so as I as I said before the um PDB is is MPI agnostic. It doesn't even care if you're using MPI or not What it absolutely cares about is what resource manager that you use a sort of resource manager Scheduler is a bit of a blurred field at best Um, but yeah, you need it needs to find a way of finding out what what hosts are involved in the job What their PIDs are so what the PIDs of the the processes or ranks are Um, and some way of launching itself as a parallel job to then Do the attach So what's been the you know, Jeff was asking about what things have the MPI guys Needed to do for you or it's been the biggest pain for you What's the been the biggest pain or the biggest help you've gotten from a resource manager? um I think the biggest pain is when we saw resource managers change their output format So I can't I can't pass it automatically and find the PIDs anymore That tends to be the the common one So PADB itself, uh, how hard is it to add support for a new resource manager say I've written my own or I've got Uh, it's not one does PADB supports. How hard is it to add? um If somebody comes to me and asks me that question the major The major guys sort of turn it around on the head and say the major thing that I need to know is How does the resource manager export the information that PADB needs? Then what that is is a Basically, you need to you need to know a list of jobs. I do see you need to PADB needs to know Sort of which jobs it can target which jobs are running and then for any chosen job that PADB is is Told to target it needs to know the number of processes or the number of ranks The host name and the period of every single one of those Of any one of those processes and if that information is available From the resource manager in a format that's That's you know, sort of constant over different versions of that resource manager Then then there's a fair chance that you can get PADB working in in a day or so If that information is not available, then you're looking at patching the resource manager and that's where it gets difficult So you mentioned host name and pids that means that you must be actually going out to those host names and And attaching to those pids in in some way or fashion How do you actually launch and And get you know your helper daemon or whatever it is you need launched out on the back end nodes Do you have some mpi magic for that or do you just use rsh ssh or the native resource manager? How do you do that? So of course at the back end PADB is itself a parallel job And there's a lot of parallel code within it to do what it does And to attach it can either there's two ways either it can Can rely on the services the resource manager provides and sort of piggyback on top of that Or it can simply lay on top of pdish and and basically just use ssh out to the nodes The pdish approach is nice and simple, but it's limited to the nodes it supports For open mpi for example, what it does is it simply writes a host file One one host per line and just mpi runs itself Then it has its own Runtime that it uses it doesn't it doesn't link with mpi. It has its own Parallel runtime and comms library that it uses Right, okay So you're just taking advantage of the fact that we can launch non mpi executables as a convenient way to get out on the back end nodes Yeah, yeah, simple as that And it depends what it does there depends very much on the resource manager For slurmit it s runs itself for mpd it It mpd runs itself for for mpi it or t runs itself but for For pbs and it just writes the host file and uses pdish So is that the complete list of supported mpi and resource managers or Or is there a pretty long list for this? um There's a pretty long list actually and there some of them are better supported than others um The ones that I didn't mention in there there's there's rms, which is the quadrix one, which is not so widely used anymore. Um Hydro, which is the new mp h2 one, um Then there's pbs and lsf which are based on submissions that I've had and not not actually Running anger directly myself, but I happen to know that well, they've been submitted and they work for a number of customers. So I believe that they work, but they're they're harder for me to support Now we've been talking all about the the mpi aspects of of Pad b here and and and obvious one that comes up because I know that pad b is used in Some of the us do e labs is how scalable Is it so you know people? There's a lot of talk these days about making our tools work at you know tens hundreds thousands tens of thousands of processes What's the biggest you've heard pad b used? um so I would I would say pdb is a Is a sort of typical open source project, you know, sort of sit at home working on it um But the the the major difference between between this and sort of the the standard open source project model is that because of the quadrix history it really You know at one time it had maybe 10 users, but all those users were in the sort of The top 20 top 25 sweeping packages in the world Um, so it it really did start off at scale And it's probably more widely used on on the very large systems and it is on the smaller ones Um, I've never used it above 4,000 cores, but I know of people are using it sort of two or three times that And I've had positive reports of Everybody running out of that scale So what's this you mentioned that padb works with doesn't really care about mpi like it does some special things with mpi But it can be used with anything have you seen any really strange use or unique use of padb? That was kind of unintentional, but you know definitely works with the model it works under um so that the My interest is is hbc and the model it works under is where you have a Sort of defined parallel job with you know, sort of one to n minus one So not not to n minus one ranks within a within that job um, I don't actually have a sweeping computer in my basement, so When i'm actually developing it at home what I tend to do is um Is I run it against firefox Which is a multi-threaded job So I just you just sort of query firefox and say well firefox has got Seven threads pretend each one of them is a rank for in a job and it will come back and it will show me the stack Trace of firefox neatly aggregated and with the you know sort of stack trace Stack trace across threads rather than across jobs and it it works really well in that model as well Okay, I never would have thought about uh using it to just look at all the threads for something So you can look at both processes and threads. So does padb then work with mixed mode mpi plus threads Um, yes, it does actually if you've got um, this is a patch that someone sent to me earlier in the uh um Yeah, if you've got if each process in your parallel job has multiple threads then then it will It will come back and it'll give you it's only in the stack trace mode It will give you the stack trace for for each thread individually and it will do the the aggregation based on the thread id rather than the rank So it it does the sensible thing in in most cases for that So what language is padb written in? um, it's written in pearl with a small c component to to query the the mpi message queues And uh a question that I love to ask developers just for for no apparent reason is what uh, what do you use for version control and why? um, I know you always ask this and I and I want to answer um, I use subversion um For two reasons partly because it's it's better known and better used in mercurial and partly because I started pdb before I learned about mercurial um What I think is probably interesting to the to this conversation actually is Most of the development that I don't do at home on actually do do on clusters I do on the amazon compute cloud and I fire up sort of 14 or 16 virtual instances and run a A parallel job at the compute cloud and and sort of debug that for And then shut it all down just get a monthly bill bill from amazon at the end of it, which is a For me, it certainly works out a very good way of working So I hate to keep coming back to mpi, but sorry. This is just what I what I know and work on so I know that Some of the guys at sonical Which would be sun plus oracle Work with you for a while to get pdb to do something special in our nightly regression Testing in the open mpi project. I wonder if you could describe that a little bit Yes, so one of the beauties of being being command line and just a snapshot is you don't actually need to be a human to run it You can you can automate it and And run it and sort of save the results away for offline analysis Um, and a fine example of this is the the open mpi test suite Some of the open mpi some of the sites that do open mpi testing have this hooked up So when the an mpi test runs after after it hits its time out before it gets killed Uh, pdb is is run against that job and the the output of pdb is saved to Save to a file somewhere and then published on the web for for all to see um And then the job gets killed and the next test goes on obviously and the developer can come back and look at it offline and See what was actually happening in the in the job to cause it to hang Does that mean there's like a kitchen sink option with pdb? I can just say give me every piece of information. Give me where they were stuck in mpi. Give me their stacks. Give me Yeah, I don't know. Yeah, there's some very um There's some very detailed options. Actually you can give pdb. You can get it to give to generate a tremendous amount of output if you want um or or very very detailed output, but But the one that people really need to get started with and the one for this is there's a full report option And what that does is it just says give me all the information that's likely to be useful Um, and that's what that's what I recommend and that's what people normally run in in automated testing or Or even just if you're doing It's only like me like working for a network company for doing remote support If you say we'll run this Run this in full report mode email email the output to somebody and they've got They've got all the information that's likely to be useful to them there So what are some new features that are coming up in you know stuff that you've got worked on that you're you're working on You haven't released There's a couple of things. I'd like to I'd like to do more with with the collectives. So um it does tree based output where where it looks at the wrong can it and it sort of condenses or reduces the Hundreds of stack traces you get into into a very small number and this shows you where the where the diverge Um, it does that currently on the rank within MPI calm world Um, and I'd like to do that In terms of the rank for in a communicator um, which which would help the problems on Well on on on the subset of jobs that use custom communicators um As a couple of usability Points I'd like to address is actually is actually quite good for a sort of job-wide top So you can run it you can run it in a repeat and just get you to Show you all the ranks and sort by cpu and fail met cp usage or memory usage Things like that. Um, there's a better interface coming on that. Um, what really interests me actually is Is automating more with the debugging process of having pdb Know what the common types of deadlock are and they're what the common types of problem are And and can go out and sort of look for anomalies in the job and have pdb come back and say well Almost all your processes is doing this but there's there's a few odd ones over here that are behaving differently and if pdb can can automate that and And take a lot of legwork out of it then then I think that's That's where the future lies So who's actually implementing this? What's the pdb community look like? Um Well, initially initially it was just me at quadrics and then Well, I left quadrics a few years ago and they're not with us anymore Um, and it got put to both sides for two for two or three years. Um, I picked it up again Probably 18 months ago and and ran of it almost on my own for a long time and sort of getting back in touch with the old users um Most of whom I was delighted to hear are still using it and Glad to hear that it's ported to to other systems they've got Um, there's a fair few vendors have got in touch now and and improved the support in there For their specific resource managers Um, like jeff mentioned son got in touch and and actually there's a salaris port for it as well um, and then there's a few tier ones that have come up with new features and And gone beyond really just just making sure that it works on their kit and actually coming up with new features That everybody can benefit from So, uh, what's the website and is there any contact information like a mailing list or anything like that where you can actually download padb or maybe submit patches Um, yeah, the the main website is padb.pitman.org.uk um It's an open source project. It's hosted on Google so you can get the source code from Google code and there's there's two mailing lists as a developer and a user mailing list as well or or Simply email me emailing me directly. Ashley at pitman.co.uk Okay, actually, thank you very much for your time. Yeah, thanks. Ashley. This was great. Thank you very much