 Welcome to another edition of RCE. This is Brock Palin. You can find the entire collection of all of our back episodes on RCE-cast.com. There's an RSS link there as well as a link into the iTunes library. I have also again Jeff Squier from Cisco Systems and one of the authors of OpenMPI. Jeff, thanks again for taking some time out. Yeah, and I would just like to point out that all of our shows are wardrobe malfunction-free and misbehaving singer malfunction-free as well. Well, it kind of helps that we're audio-only and we don't even try to sing. Yeah, I think that would be pretty ugly actually. So there's other information off the website. You can find a link to Jeff's blog as well as my new blog. And our Twitter feeds and all the usual stuff on there. I also throw out some questions and stuff about the podcast. You guys can give us any questions you'd like to have included about upcoming stuff. You can follow me at BrockPalin, all one word. And you can find Jeff off of mine actually. Yeah, there you go. Good enough. Feedback is always appreciated. And we always take suggestions for new topics for podcasts and things like that. And by the way, Brock, you got to say the name of your blog because it is just kind of funny. Oh, Fast Failure as a Service. Failure as a Service. Yeah, I got a hat tipped. Matthew Britt who came up with that one was nice enough to actually let me use that name. He's a great guy. I work with here at U of M. And that name was just like, wow, can I use that? All right, well, let's go ahead and get on with today. Brock, who do we have today? Who we have today are, we have two people. They're both at Argonne at the Mathematics and Computer Science Division. We have Rob Latham and Rajiv Thakur. We'll give them an opportunity to introduce themselves, but they're here talking about Romeo. So Rob, why don't you go ahead and give us a little rundown. Hi, I'm Rob Latham. I've been at Argonne for 10 years. I work on the Romeo library and some other I.O. libraries and often work with applications to help them use this stuff effectively. I'm Rajiv Thakur. I'm a senior computer scientist at Argonne. I've been here almost 17 years now and I work on MPI, MPH, parallel I.O. and so forth. I was the original author of Romeo, which is what we're going to talk about today. Yeah, so this stuff is obviously near and dear to my heart. I know these guys. Rajiv, I see all the time at the MPI forum meetings and as of this recording at least we'll see each other in about a month at the next forum meeting in Chicago. Yeah, if I could interject a little something here. This topic, MPI I.O. and Romeo, this is something that a lot of users don't know exists and if you're a user or you're a writer of MPI applications, this is something to be aware that it is actually possible to do parallel I.O. and it can make your life a lot simpler. So actually let's roll right into this. Can one of you give us a rundown of what was the motivation behind MPI I.O. and what is it? So MPI I.O. is an interface for parallel file I.O. from a parallel program from MPI programs and the motivation behind it was similar to the motivation behind MPI in the first place. Before MPI I.O. there wasn't a single portable way of doing parallel I.O. from a program and MPI I had just been released and so it made sense to think about whether something like that could be done for I.O. and Mark Snir from IBM Research at that time first came up with this idea I think in 94 to explore the use of MPI-like interface for I.O. So I do have to ask, where did the name Romeo come from? Yeah many people asked me about it and so that came much later because I was implementing MPI I.O. and before I had to release it, it needed a name as such and I couldn't come up with a good name so the MIO came from MPI I.O. and are potentially from my name and all was needed to complete a word in some way so it doesn't mean anything but it has MIO that looks like MPI I.O. Gotcha so we kind of avoided the question till now but what is Romeo? So Romeo is an implementation of the MPI I.O. interface that MPI I.O. is now an official part of MPI and now means since 97 so it's a good 15 years so it's an implementation of that one chapter and it's a portable implementation it works on many file systems and it works with many different MPI implementations and it's actually included as part of many MPI implementations so users don't see it separately as such anymore. Because of its liberal licensing and popularity it is common to use them as weak synonyms MPI I.O. and Romeo often get used interchangeably but one is a standard and one is an implementation. So before MPI I.O. was defined or Romeo existed did people even do parallel I.O.? Yes parallel I.O. was in its early stages in those days but it was used and different there were a few parallel file systems there was one from Intel there was one from IBM and IBM Vesta and they all had their own interfaces they were posits like interfaces but they had their own extensions there were also some higher level libraries like the passion library that I worked on as a graduate student so there were different ways of doing I.O. there was no one standard way of doing it. So actually before we go too far into this what does MPI I.O. give you as a user in terms of benefit? It gives you a portable interface that is like MPI so if you're in an MPI program it's quite natural to use it and it allows you to express your I.O. in the form of collective operations in the form of non-contiguous operations expressed using MPI data types and so forth and it provides well-defined and clean semantics for what it means when multiple processes are writing to the same file and so forth so you get the semantics, you get the features that you need for accessing files in parallel and potentially you can get the performance from an implementation of MPI I.O. So what is the difference? Why can't I as a sharp application developer guy I've been doing MPI codes for years why can't I roll my own parallel I.O. What's in it for me to use the one that's built into MPI? So there have been a lot of application people who have over especially recently in the last five years it seems who have gone just that tried to do these optimizations by themselves and use their own way of doing parallel I.O. and sometimes it can work well but often there are certain file system characteristics that say that need to be dealt with and one of the benefits of Romeo is this file system abstraction layer and so we can put in an optimization that's specific for GPFS, specific for Panassas, specific for Lester and those details can be hidden from the application user and then when the application person does his I.O. on machine A or machine B or machine C you don't have to worry about retargeting and reinventing those optimizations and some of the optimizations can be hard to implement so a library writer might want to do it once but for an application person it's probably a lot of work to redo all of that in his application. Can you give us an example of one of these optimizations? I think the most desired optimization right now is this idea of aggregation and we have these processes that scale to hundreds of thousands of MTI processes now and people find pretty quickly that if you do I.O. from all 100,000 processes you'll swamp the I.O. system and just get really bad performance so instead you pick a subset of these processors and use them to do I.O. on behalf of everybody else it does a couple things for you but for the sake of making a more friendly request workload to the file system that's really the biggest benefit. Now you can do this as an application person but sometimes there are some machine specific topologies that you can take it even more advantage of so that's something for example on BlueJean that's done on behalf of the user and if you were to implement it yourself you may end up putting all your processes on one or two network links versus being more evenly distributed across the system So when I was reading up on this I ran across something called HINTS is that specific to MPI I.O. or is that specific to Romeo? That's a tuning feature defined in the MPI standard you know a lot of these interfaces have some way of hinting even if you open a file in POSIX you pass in flags saying what you're going to do with it I'm going to read the file, read only or write only and well these are something similar where you can provide some indication of your intention to the MPI I.O. library HINTS are very simple to use they're string-based key value pairs and they have the other defined benefit that you can do whatever you want as far as setting keys and values and the implementation can ignore them so for example you set a bunch of HINTS that are specific for one file system or one implementation and implementation D doesn't know what they are and it doesn't exactly to affect your program it will just be quietly ignored So an example of a HINTS would be how many disks you want to stripe your file across the size of the striping unit and so forth So how important is Parallel I.O. to any given application? Well the applications find Parallel I.O. pretty important now You've got bigger and bigger machines and more and more computation but the storage part is getting a little bit performing a little bit less it improves performance less rapidly so you need to get a lot more simultaneous I.O. operations going So the Parallel I.O. becomes important for analysis to study these data sets and make sense of them it's important for the initial data sets to feed these simulations and then of course for defensive I.O. checkpoints I.O. and periodically outputting a history of what's going on those are all periods that have to happen in order for computation to continue or to not waste computation so the faster those pieces can happen the more science it can get done So there was a number of talks I've been at where they talk about trying to scale I.O. operations these 100,000 plus core systems and there was a lot of talk about kind of moving away from POSIX and other things Is that something that's more in the file system layer or does Romeo have a part in that solution? Well I think Romeo could be a part of that solution Of course in computer science we just put more and more abstraction layers so deep underneath MPI-O is a Parallel file system with POSIX I.O. calls but Rajiv mentioned this file system abstraction layer inside Romeo so you could very well put something in there that didn't do POSIX but instead spoke in terms of objects or scientific databases or whatever else was appropriate and even though these drivers have since kind of gotten a little bit rusty we do have drivers in Romeo for things like grid FTP and some experimental drivers for logistical networks and things that aren't really file systems but have benefited from being underneath an MPI-O interface So do I have to do anything as an admin to be able to support this or do I just have to provide a file system? Well the file system does have a bearing on how much Parallel I.O. performance can be provided certainly being able to support simultaneous connections without data corruption or loss of performance would be great and if a file system has some super crazy optimizations for non-contiguous I.O. or concurrent I.O. that would be beneficial too and we've done this in Romeo for different file systems like PVFS or some of the more ideal distributions for Lustre and even found ways to work around NSF's consistency semantics So for an administrator the tricky part is making sure that Romeo is built with support for all these different file systems and sometimes the MPI that comes with your distribution or vendor may not be any little more tuning but sometimes that's easy to fix with a little bit of cooperation from what will provide you the cluster So here's a weird case can you actually use Romeo without a single namespace shared Parallel file system? Probably not because we expect to be able to directly read or write but you could write a new device underneath it's sort of one that talks to a file system one that can deal with such a thing so you're kind of implementing this shared file abstraction one layer below Romeo Alright so Brock actually threw out a bunch of buzzwords there let me try and disentangle one of them he mentioned Parallel file systems now it's got the same buzzword Parallel that MPI does so is MPIO a one-to-one mapping for Parallel file systems or are they similar in complementary technologies or what is their relation to each other? So Parallel file systems provide just the basic read-write of continuous data in Parallel so they support concurrent reads and writes to a single file and they try to give high performance for that they may have a few additional optimizations but Romeo has a lot more functionality than that so Parallel file systems don't have a notion of collective IO they don't know that it's multiple processes of an MPI program are part of one application and they might need to access one big data set in Parallel maybe different parts of it but they're actually accessing one three-dimensional array or so forth they don't have that level of knowledge or they're not designed for that so they are similar to the POSIX API maybe slightly more than that but underneath what they can do is they can stripe the file across multiple servers or IO nodes or disks so they can give you good IO bandwidth and performance for multiple clients accessing different parts of the file simultaneously So how important are new hardware technologies things like solid state drives or caching front-ends for file systems and such? All those are important in that they'll improve performance so the file system can do something to do a better job of managing those devices or tuning for those devices and probably somethings can be done at the MPIO layer also to do better with that I don't think we're doing anything right now specifically for... No we don't have anything planned It's kind of suggested that some of these features of MPIO that make optimizations like this possible there's this idea of consistency semantics which have these rules for when data is visible and accessible and permanent on disk and these rules actually learn those really well to having say a burst buffer or some other solid state device handy holding these intermediate requests before writing out to the four permanent storage solutions so these are all details as the hardware gets more sophisticated the MPIO layer can be improved to high details from the application user but as Rajeev said, we don't do that yet but one of the areas we can look at down the road Now our file system vendors like parallel file system vendors even hardware storage vendors are they being influenced by the MPI standard like we've seen networks that are created that were pretty much purpose built for MPI communications is the same thing happening on the storage side or is HPC mostly the recipient of innovation that happens over in storage? We don't see very much attention from the vendors except as a secondary concern when customers buy machines they're buying CPU cycles and high performance networking and the storage is sometimes a secondary concern and the vendors prioritize the vendors if the customer is giving them money for X and Y they're going to focus on X and Y but we have had some good relationships with vendors like Kray and IBM to help get the best performance out of these file systems as you see that from the second part of your question it makes MPIO and Romeo more reactive to parallel file systems than driving the development I think And it could be that in some procurements there are some applications or benchmarks that are run that internally use MPIO through libraries or whatever through HDF or net CTF or whatever and then they ultimately hit the file system so the vendors have to make sure they meet those whatever the performance requirements might be So it's a very good point the requirements are often specified in terms of application behavior not specifically MPIO must do this but more high level applications must achieve these science oriented benchmarks So we talked a lot about abstraction in there and performance but then you also mentioned during the hints we might want to say how many hard drives are we going to go over what are some of the common hints and stuff should a user ask their administrator what our settings should be You know for the most part in today's ecosystem the hints are often meant for the library library people or other specialists to use the application person probably doesn't need to worry about hints at first probably the most important angle of optimization for the application person is to just keep on using collective routines to use data types or high level libraries to describe the IO workload and in these ways provide enough context for the libraries and the MPIO implementations to do the right thing but hints are really useful for folks like me who come in and work with the application people and can help make these suggestions based on what's going on but if the application person has done collective IO with data types to the library to the process of them today that's pretty much or using a high level IO library like HDF5 or PRL on that CDF those are really the biggest things an application person can do to get good performance And there are some hints that can be used to change some parameter settings within ROGEO such as the buffering parameters maybe the sizes of buffers used for collective IO or other optimizations and some other algorithmic parameters so that's in addition to the IO hardware or the striping and those kinds of things and these are all advanced these are all advanced users or library writers or those who want to tune their IO performance Okay so for something like Luster where the default setting is one stripe or whatever the admin sets it to is ROGEO going to kind of use a sane default or is it just going to use what the file system's default is Well that's a good example of the Luster driver in ROGEO for a long time there wasn't much smart there were not many Luster specific smarts inside of the ROGEO Luster driver it just did the most simple common interface to Luster but recently there has been extensions added so that the ROGEO driver will request a larger stripe size even if the default stripe size is quite small and so these are things that the ROGEO driver does on behalf of the user without any intervention Now this naturally leads to the question what file systems does ROGEO support? Well ROGEO supports the big popular parallel file systems right now the GPSS, Luster, PanFS, PVFS as we mentioned earlier supports some non-file system file systems like GridFTP although that's been a little bit long in the tooth these days it also supports some very old file systems which I don't even know or even up and running anymore such as the old HP parallel file system and some legacy sorry NEC has a scalable file system these are all options around there that have kind of gotten old in the days but for any parallel file system that folks will run into today ROGEO will work with Let me ask a derivative of that you mentioned several parallel file systems in there and earlier we were talking about how ROGEO uses a POSIX Read&Write interface as the lower layer do the parallel file systems offer different than POSIX semantics like when you know a parallel write is coming do they have a specific API for that rather than just plain vanilla POSIX Read&Write? Well a lot of file systems don't it's not, for example the more research-oriented parallel-oriented, parallel PVFS project that was done between Argon and Clemson did develop a set of not POSIX semantics and API calls which were developed and implemented with MPIO access in mind they're not at all tailored towards clients ever using them directly but meant instead to map almost directly to the MPIO the Romeo MPIO calls and so in those cases you can do things like it's not so much the scheduling of operations it's more being able to use some very rich descriptions of the IO patterns so you've mentioned multi-dimensional arrays which is quite common in scientific applications and the POSIX interface for non-contiguous and strided IO is okay but it's kind of primitive in many respects and sometimes file systems like PVFS have a much more robust way of describing these data types allowing for much more concise representation and better performance in some cases So back on the supported file systems what if Romeo doesn't have an explicit driver? Am I completely out of luck using MPIO? No, there's a catch-all generic using POSIX with no optimizations or no assumptions about anything special driver which in many cases is the first and sometimes the only driver used by some of these file systems and when we say supported file systems we mean that someone's gone in and made additional efforts to exploit any of the optimizations like direct IO or any particularly sophisticated interfaces or tuning that those file systems might support but there is a catch-all POSIX-like interface and so if you can do open, close, and read and write to a file system there's a basic driver that will work for that Now we mentioned earlier in the show here that Romeo is used with lots of MPI libraries it's actually used with open MPI as well before open MPI I used it in LAM MPI as well How did this happen? How did you guys tend to take over the world like this? Well, Romeo was implemented using the MPI to external interfaces the features in the external interface chapter such as the MPI type-get and contents and type-get envelopes that allows you to understand what is an MPI derived data type to parse the derived data type in a portable way so that allows you to hook up with any MPI implementation and also the generalized request so for the non-blocking IO we use the generalized request so that we can use the test and weight functions so these were added in MPI 2 and Romeo was also being written at that time so I took advantage of that and wrote it in a way that it can work with any MPI implementation and there was no other MPI IO implementation and people didn't want to re-implement everything so they just took Romeo and added it and there were many MPI implementations like the SGI and HP and whatnot and so they all just added Romeo We were all lazy is really what it came down to and you guys did a great job so why reinvent the wheel? A derivative question though so this is obviously MPI specific technology but is there a core engine inside or something like that inside Romeo that is useful in a different potentially non-MPI context? Yeah the basic collective IO optimizations and the data saving and so forth they can be used outside of an MPI context and then they have been used in the IO community outside of MPI but right now the code uses MPI for communication and so forth so if you didn't have MPI it would be hard to or you would have to use something else for communication to get that piece Is there anything you've had around? From a software engineering standpoint there's no live magic you could pull out of Romeo and use somewhere else there's pretty tight assumption that MPI will be around but as you said the ideas have been around for a while the idea of 2 days IO and some of these update type operations certainly places can take the ideas and re-implement them in different contexts and use the ideas that have been proven helpful in Romeo and use them in different contexts So say I'm a file system vendor or some sort of database vendor or some sort of large data warehouse vendor anything like that and I wanted to make a Romeo interface to my data who should I contact about that? That would be me, Rob Layton RobL at mcs.anl.gov and I will work with you to make sure that Romeo we've got a pretty good history of working with vendors for file systems and other cases we just spent a year ago we incorporated a whole bunch of Cray and Sun to generated patches for Luster to make it better and perform well before that we took a bunch of patches from Panassas that took advantage of some of Panassas tuning optimizations that's how it goes I don't know if we should talk about this much but Romeo is a fairly mature project at this point we don't have the standard hasn't been changed in 15 years the M3 is coming around but the code is fairly stable at this point so adding a new driver is the only kind of changes that happen in Romeo these days maybe an extra file system changes That's a perfect lead-in for my next question what's coming up in IO and MPI 3 is there anything new being discussed do you seem to imply that there is? The main thing in MPI 3 is probably going to be the non-blocking collective IO functions which follow naturally from the non-blocking collective communication functions there have been discussions of some other features but I don't think they're going to make it so the main thing is it's mostly stable the interface and most probably the non-blocking file IO so instead of the split collective IO functions which were kind of a it was not a real non-blocking collective IO function there will be a truly non-blocking collective file read write type of functions the HDF guys are proposing some other things as well that we're giving them a very hard time about we'll see if that makes it but like you said Rajiv I don't have strong faith in those given a short deadline for MPI 3 as well either so let me ask the flip side of that what's coming in Romeo you said it's mature and whatnot do you guys anticipate on implementing all the new whatever comes in MPI 3 or what is the rate of change these days is it pretty slow? it's fairly slow but of course we'll work with the HDF 5 guys to you know the proposed changes not the non-blocking collective any of the proposed share file changes we can work with the community to incorporate this stuff the other big changes are some algorithmic changes possibly as machines get even larger on the scale of say blue gene that's 160,000 MPI processes we're starting to find algorithms inside Romeo that are not as memory efficient as they could be they allocate some data structures that scale with a number of processors we can use things like these MPI 3 neighbor collectives to maybe come up with some more scalable algorithms but we need to pay a little more attention to how we perform at the very largest scale it's kind of amazing that Romeo has progressed for 15 years without some without any more significant changes to the algorithms that it has had but it's been a good run and it needs a little more attention at the larger scales right now so I'm a user I want to start implementing MPIO should I look to Romeo's documentation or should I talk to my MPI vendor the basic, well maybe I should let you answer because he wrote the book but you should look at the MPI spec or the MPI tutorial the MPI tutorial material on how to use MPIO functions and that's all you don't need to worry about and as long as you have an MPI implementation which is MPI 2 compliant it'll just work for you the best resource is using MPI 2 we just call it the purple MPI book around here and Rajiv is one of the co-authors of that book and it's been a great resource for all the MPIO features from the very simple to the very complex and it lays it out in a pretty straightforward way and it's a great place to get started on how do you participate however, maybe the other point to make is these days MPIO is more often a foundation layer most people use MPIO through another library like HDF-5 or parallel net CDF or audio or something else so maybe depending on your needs you may not even need to use MPIO directly at all but there may be some better way to go about it okay guys, well thank you very much for your time what's the website and contact information for Romeo? Romeo is hosted at www.mcs.anl.gov.com okay, well thank you very much and we will have this up soon thanks again for your time thanks guys thank you thank you