 I've lost my page. Your name is Rand. Rand, it's right there. Rand Housseau, who is a senior research software developer at GS Science. GNS Science. And I'll stop embarrassing myself. Take it away. Thank you. Good afternoon, everyone. If you're interested in my background, credentials or anything, you can find me on LinkedIn, places like that. So let's get right into it. Message, passing interface, and inversion of control. What is it? First of all, MPI is on clusters, supercomputer clusters, or even on your desktop. It's how an application communicates with another instance of itself. And this is kind of the de facto standard. If you're writing software to run on clusters or on your own multi-core machine, you're probably going to need MPI. There are other ways, but this is not a shared memory thing like OpenMP is. This is separate instances running with their own memory. IOC is inversion of control. It's the idea that I'm waiting for an event to happen, and I'll process it when it happens. I'm not really in control. The framework is. And you're already doing this with your UIs and that sort of thing. I'm a senior research software engineer, and that is, in fact, the same shirt that I'm wearing. But you see, I do like my monitors. This was taken back in 2009. Seismic data is what I deal with. It's what I'm hired to process. So what is seismic data? If I give you a little view here, would you understand that, man? To me, it doesn't mean a whole lot, because my site is software, not the geophysical side of it. But apparently, that's like layers of different densities, different velocities in the ground. And the seismic data is collected. And that's one of the displays that the guys use. But I don't really. Sorry, Ryan, can you? The seismic data is either collected on land or it's collected from out in the seas, out in the ocean. There are these specially designed big ships that are going along at like five knots. And they have air guns that go off in a sequence of some sort. They go off, these air guns explode. And then the sound from that goes down into the layers of the earth underneath the ocean floor and comes back up. Some of it, of course, bounces right off the ocean floor. And some of it travels directly over to the streamers that have the sensors on it. There are different velocities for each of these layers and these hydrophones that are receiving the data are recording typically on like a four millisecond sample rate for how many seconds depends on what type of a survey they're going to do. So this ship is pulling a number of these cables. And they'll sometimes usually go back and forth over the train that they're covering. And sometimes they'll go at angles to get larger atoms, larger directions to try and resolve things a little better. So how much data are we talking here? Well, the most recent sensor recordings, if you want to work this stuff out, it's pretty intense here. If you're doing a 16 second recording at four milliseconds, that's a lot of data. Now these up to 12 kilometer streamers have these sensors, these hydrophones at every 12 and a half meters. I'm trying to understand there's a lot of data right here. So if you have 12 towed streamers that are 37 meters apart, such a thing, and the ship is going along at about five knots and there's an explosion every 16 seconds or so, 16 to 20 seconds. And you're recording that much data and you're covering an area that is say 6,000 square kilometers. That's a medium large survey. How much data are we talking? Well, 20 terabytes is a relatively small collection. 20 terabytes up to say a petabyte of data. A fair amount of data, yeah. And the algorithms that we use for processing this are, well, here's a few of them that I've written. 2D and 3D Kirchhoff time migration, pre-sac depth migration, SRME, 5D interpolation, and then most recently broadband. Now I'm not the one writing the whole of the application. There's another guy I work with and he spends his time doing all the research, looking in the papers, and finding out what the best algorithm is. What really needs to be done to do this tomography, to resolve this 3D image, say. So he does that part and then I write the application that does all of the other things, the communication, the getting the data in and out of his algorithm and saving it. And one of the most recent ones that I did was this 3D SRME application. And the requirement I was given was 500 million traces. These traces are that recording for 10 seconds, 15, 16 seconds worth. That is a single trace from a source down and what it has received over that period of time. So there are millions of these. This particular survey only came out to being five terabytes, or at least that's the subset that I was using. Running on 100 cores in one week was the requirement. So this is a little bit longer execution than most of what you're probably doing there. Our final throughput was 12 traces per second per node, which is something like five times faster than the requirement we were given. So the algorithms that I've written, the programs that I've made are running on the world's largest supercomputers. One of our company has called and asked us if one of these things would have a problem, say running on 85 or 86,000 cores. Is it gonna have a problem? No, it won't. Apparently they had some software from somebody else that wasn't capable of doing that. But this stuff does. So if you take the top 500 list of the world's largest supercomputers and you remove all of the government corporations and the universities, what you have left is the petroleum exploration for like the top 100 or so. These guys are serious about the number crunching, serious about running these applications because they're looking for oil. A software can also be used for other things like carbon sequestration, fault finding, these are the big players, these are the guys with money. Enough money so that they can buy their own, rather large, I mean, look at the figures on this, a 2.2 pedoflot machine, 2.2, 2200 trillion calculations, 23 petabytes of disk space. It's a little larger than your typical work environment computer. And this one, I don't believe this one even made the top 100, it's not on the top 100 right now, this is small. So on to some MPI. MPI is a pain, all right? For those of you, has anybody here ever used MPI? Come on, it's a pain, right, do you agree? It's a pain. And there's a lot of people trying to find better solutions to it. And I, hopefully I'll convince you that this is a good solution. MPI exec-n8 hostname. This right here is how you would run the hostname application on eight nodes, right? So it's one application that is being spread onto all these nodes and being run. Now in the case of hostname, it doesn't have to communicate with itself. This other one right here, 86,000 copies of Stitch, which is one of the applications that I've done, and some extra parameters for that. Usually these things are submitted into queues, you don't have anything to do with it, you submit it to a queue and you say how much process and you want, how many cores and whatnot. And then it handles it like a torque, PBS, or SGE. MPI, for Python. MPI for Pi is the one that I've used. It's in fact the only one I've used. Let me throw in here that most of the work that I'm doing is in C. And Python is like my second language where I'm mostly working in C, so just want to be clear on that. But there's alternatives to MPI for Pi. A very simple MPI program for Python is just get the name, rank, and size. Name is obvious, it's what's the machine called. Rank is MPI's terminology for which number am I? Of those 86,000, which number am I? Because sometimes the program needs to figure out what to do based on its number. And size is how many of them are overall. Very simple program there. The send function, because if I want to send a message of some sort from one node to another, remember, this is one application running. So the application has to send information to other copies of itself. And that makes the coordination really difficult. I mean, I imagine writing a program that's gonna run on a whole lot of nodes that has to communicate with itself. In this list right here, you'll notice there's a tag. The tag is a special number that basically you decide, I'm going to send this type of information to a particular node. In this case, the destination there is the node that it's going to. The receive has the similar parts of I'm going to receive this type of information from that source. Now the source can be minus one, which means I only receive from anyone. I'm not gonna show you all these, but I thought I'd just put up here one page of what would be 13 pages of MPI functions in C. This should frighten anybody from doing MPI the hard way because grief, MPI abort, well, that makes sense, right? I can't abort the application, but MPI start, that's sort of one. We don't have to deal with that one in Python. And this is only one page of 13. There's a whole lot of functions that have been made just to try and get around the problems of how do I take this job and split it out to run all this hardware and get my result back. Send and receive in Python, of course, are a whole lot easier, but there's still that tag in the destination or the source. The com there is a communicator, so it's probably not even worth getting into with comms right now, so message passing. If node M, whatever number that is, wants to send a message to another node, N in this case, N receives a message from, or N receives a message from M of type tag, so imagine the receive on the receive side. I'm gonna receive a message of this type from anyone or from that number. I can only receive this type information from that node or just from anyone. And then there's the problem of the rendezvous, which is if, in this case, if M is going to send a message to N, I get to the point in the code where this node is now sending a message over to that one and I have to wait because I can't reuse that space. MPI won't let me until that one gets to the point where he can receive a message of that type from me or from anyone. And so, in this case, M sits idle. You can flip it around and say, M is now waiting for a message of that type from node N and there's again a delay where one of the nodes is not actually doing anything because you have to wait for everything to come together. And if you're just simply trying to take, as most people do, take your work and divide it up into a whole bunch of, I'm working on this part of the data, I'm working on this part of the data and we don't even have to communicate, then you have the problem with, you're sharing this cluster with how many other people? And it could be that one of those is going to continue for quite a bit longer and it's not gonna finish. So, the bottom two nodes are already done, but the application isn't done until the last one is done. Alternatives to MPI, because of these problems, there's a number of others I won't get into. UPC looks kinda neat, but I've never done any programming in any of those, so don't ask questions about them. But there are a bunch of alternatives that people have come up with because I think they're doing it all wrong. All right, and that's where IOC comes in. And you all know what IOC is, don't call us, we'll call you. I've registered a delegate so that when an event happens, I do this processing. It's just like your browser, just like your file manager, whatever, I click on that button and it does something. Why not map that idea into your MPI? So the two of these, the MPI and IOC solution allows me, and I'll have to explain a little bit of this, allows me to send a message to any node at any time without any prior coordination because the application is just waiting for events. So when the application is running, the first thing it does is determine which number am I, and if I rank zero, determine the size and type of the problem. This is how I'm actually doing my applications. I register methods to sort of, if this message comes in, I'm gonna execute this function. Give me the data here in this function. Standard IOC, standard delegates. And rank zero is then also in charge of terminating when it's done. The framework that I've written, when it starts up, it gets the information about this node. What is this node? What's its capabilities? What's its name? How many cores does it got? How much memory? And sends that all to the control node for its use, and determine how to break up the problem. And with this framework, a little tweak upfront, lets me add this additional flexibility, and that is a control tag. So a node is going to send a message of type tag to another node. And the first thing that the layer does is say, well, I'm gonna send this control message first and then I'll send the data. And that control message tells it to switch over to receive that type of a tag from that node. So on the receiving side, I get this control tag and I switch over to just receiving from that one that sent me this control tag and getting that data. Therefore, well, that's the key, right? That's the key. That allows me to do any message to any node at any time without prior coordination because everything is sitting in a wait state waiting for events to happen. If I get event one, I go and do this. If I get event two, I go and do that. So this is actually implemented here and there is an example code that I could show you where if I get a text message, I append that to the view. If I get a number of message, I append that to the view. If I get a UI click, I send that type of message. I get, you know, send this, I do that. It's, think of it as exactly the same thing. It's IOC on MPI. I know it's ugly, come on. And the code, okay, first of all, I'm not good at naming things, right? And guys usually come in after me and they rename things and they, you know, take care of all, I use tabs, right? So they put spaces in and take care of all that stuff. But anyway, so I create this object and then I can register all these classes. So if I get one of these things in, I do that. And then when I start, that's where the framework takes over and the framework is in charge, handles all the communications for me and the special rank zero, the first node, is the one that is given a message that, you know, here, start the whole thing running. So it gets the ball rolling. The pros and cons are sitting in a wait state just seems to be counterintuitive. I'm not doing anything, I'm just spinning my wheels, right? I'm sitting in a wait state. I get a message in, I process it and I send it back out. There's also the overhead of sending this little control message up front. And for this greatly reduced maintenance, there's one of the Kirchhoff 3D time migration one. I was taking the whole region of this survey that we're doing and I was breaking it down to how many cores I got, you know, a couple thousand cores. I'll take and divide it down a couple thousand different pieces and I'll send these out to, you know, be worked on by that node. You're working on this section, you're working on that section and they come in and ask for data, ask for the traces, I give them the data. But because you need an overlap area, it became extremely inefficient. Each area, you know, it kind of affects the one next to it. So the smaller you make these, the more overlap you have and the worse that gets. So I did a refactoring, I changed the entire application so that everyone is working on the same region. That way it's only bound by how much memory you have and your nodes have already told me that. So it's a much larger region and then I can merge all the images at the end. Because I had it done this way, the refactoring took me almost a day. Now admittedly, this is all number crunching stuff, right? So your mileage may vary with what you're doing, but there you go. So you can go out to GitHub and you can see the source code is out there that I'll show you now and the presentation is out there, these slides that sort of, so I've already tried this once. How many nodes should I run on? Let's just do three. Simple. So here is this little application that is going to take a number, this is rank two here and I'm gonna send two, let's give it, send it over to one. Let's just send this number over to one. Now let's send a text message, one of my favorites. I'll send this to everyone, everyone else actually. I can broadcast this number to everyone, I can send to zero. I didn't send it to myself. So what you're seeing here is the similarity between registering your delegate or your method so that when I get that in, I'm gonna run this function. You're seeing that the similarity between the UI version of control and the MPI version of control. Alrighty, I think that's about everything I have to show. That is my talk, thank you. Yep, thank you very much, Rand. We've got about five minutes for questions. I'll play the running aroundie game. Do you have an idea of the amount of overhead you talked about for the control messages? What sort of factor it is? Well, since I can't speak for the Python overhead, but in the C code, and now the Python code here is actually very, very small. It's less than 100 lines. Yes, but in terms of bandwidth, how much overhead? Like I said, the Python I can't speak for. But I can say that on the C code, and if you've been to like the Scython presentation of what I knew, say that you can make your Python almost as fast as the C code. In the C code, I'm told that these applications that I've written are the industry leaders. They are as fast or faster than anything else anyone is doing. So I would say that for the type of thing that we're doing and how I've cut down on all the optimization, cut down on sending as little as possible, it's not as important as other things. Yes, we do tell people that you need a gigabit backplane if you're gonna be doing more than a couple thousand cores, you need bandwidth is important. How do you solve the balancing problem? Does node zero keep track of who is occupied or not? You're right, unfortunately node zero is sitting doing like almost nothing during this. It's determined how big the nodes are. If I get four responses that are all the same name, then I look and see how much memory do they have divided up, whatnot, so that I'm not overloading any of them and making them too much. But aside from doing that and maybe sending out some work or telling that this one over here, you're doing the reading and the buffering for these nodes. Aside from that, node zero, rank zero doesn't do a whole lot. But think of it from the worker nodes, and those are most of them, it's 99% of them are actually doing the number crunching themselves. And while it's doing work, it's flat out, it's 100%. By the way, don't trust the monitors because when you're doing MPI, it shows that it flat out 100% anyway, even if it's not doing anything. But as it's working there, as it's got work to do, it's got the processor going. And when it's done, it says, you're giving me more work. So more work comes in, it does that work, it gets more work. If it gets a, you know, give me your image, then it gives you the image. If it says shut down, then it shuts down. But because it's only responding to these images, or to these messages, it basically sits at flat out processing, except for the time of doing the communication, give me the next bit of work. So if one of them is sharing with, if I'm sharing a cluster with somebody else, and a couple of these are running really slow, it doesn't matter because the others are doing the work. So, great talk, thanks, a really interesting idea. My question is, do I have to refactor my whole existing MPI model? And if so, can you do it for me? Is it appropriate for the type of task that you have? Are you doing a number crunching thing? Yes. Then you help me, and we make this a public domain thing, it's already out on GitHub, you know? And you think about, you've already got the core algorithm that you're number crunching that you're doing, right? Yes. So turn it into a state machine. If I get this thing right here, I do this function. You've already got that function, and you've given it the parameters, and it's doing its work, right? Turn that into being called from a state. So it shouldn't be too hard. All right. It shouldn't be too hard. The code is terribly simple. Now the C code, I can't give out all the secrets, right? The C code is 2,500 lines. It's doing a little bit more, but this'll get everyone started. Any further questions? Come on, got time to kill? Everyone understood everything. We're all good, we're ready to write it. Cool. Thank you very much, man. That was awesome.