 Welcome to another edition of RCE. Again, this is Brock Palin. You can find us online with the complete back catalog at RCE-cast.com. You can also find a link to the iTunes store where you can subscribe on your Apple products and a link for any other podcatcher you possibly interested in using in an RSS feed and all that goodness. Once again, I have Jeff Squires from Cisco Systems and one of the authors of OpenMPI. Jeff, thanks a lot for your time. Hey, Brock. And you know, it's it's that time of year again. We are everybody is starting to gear up for supercomputing. And I can tell you one of the things that I'm involved with supercomputing is the student cluster competition again that you and I have been involved with before. But this year I'm on the other side of the table. Cisco is directly sponsoring a team, the Illinois Institute of Technology, and we're going to outfit them with a bunch of Cisco servers and some of our switch gear and the low latency product that I work on. So it's actually going to be pretty cool, pretty fun stuff, even more so than usual for me. Yeah, that's a really cool program. As you mentioned, we've been involved before. We've been judges for it twice. And we've had the competition organizers on the show and we've actually talked to a winning team once before. And it really is a great place where these kids are getting good training, learning how to do stuff so that, you know, basically if you're looking to hire a cluster admin, a fresh one, these are the people you want to talk to. And developers, yeah. Yeah, and developers because they are the application systems, everything all in one shot, specific to this industry. All right, okay. Today. So our guests today come from Oak Ridge National Lab and we'll be talking about Adios. And we have with us Scott Klasky and Norbert Podhorsky and Scott when you take a moment to introduce yourself. Sure. Thanks. So my name is Scott Klasky. I'm a distinguished scientist here at Oak Ridge and a group leader for scientific data. So I work generally with very large or big data problems. We've designed the Adios framework for one part of the solution. And we do a lot of other things in my group for data management, data analysis and visualization. Hi, my name is Norbert Podhorsky. I'm a research scientist in Scott's group. I'm the lead developer of the Adios software. And I work a lot with application users who run codes at large computers, mostly at the Oak Ridge leadership facility, but also some other users worldwide. On research wise, I'm mostly interested in how to enable moving data from the memory of an application to other places and process it on the fly. Cool. So I wonder, could you give us a quick intro? What is Adios to start off here? Well, that's a really good question. So in the very simplest sense, just think of it as an abstraction framework for data intensive science or what some people call big data. So it really relies on the fact that we wanted to work with both data in transit and data at rest. So the win for us is that as we start thinking about data being objects or say variables in a simulation or objects in experiment, you can just view this as self-describing data being chunks. So whether we're working with something for post-processing or during an in situ calculation, we're trying to do these things very efficient. And Adios works on essentially the largest systems in the world in order to achieve this. So how did this get started? What was the driving force? So you can call this something which is a very long running project. I was a student in relativity at University of Texas at Austin. And when I was a PhD student, I really was trying to call less black holes. And I needed to sometimes write my data to the file system for post-processing. But a lot of times as I was trying to understand the numerical methods, I was trying to analyze and visualize the results. So we wanted to abstract this framework. And this was done in conjunction with one of my advisors, Professor Matthew Choptuic, who is now at University of British Columbia. So we designed this system, but it was a point solution. We progressed later on through fusion where I worked with professors Zihang Lin and William Tang. And they were trying to run. And this is going back approximately 15 years ago where we had IBM SP2 at NERSC, one of the DOE resources. We're running at the time, largest simulations, 1,000 processors. We wanted to analyze, visualize the data in real time. It was a large amount of data, about a terabyte of data being produced in a day. So we had to stream process, create the workflows to do this. Then Adios screwed further as we had larger systems coming out. So as I moved over to Oak Ridge from Princeton, what I saw was the need to really formalize this. So I worked with mainly Rutgers University and Georgia Tech, IE professors Manish Parashar, Carson Schwann and Matthew Wolf, to really formalize this and create something real. And we started to get a lot more collaborators from all over the world bring this together. So now we're actually, we actually have seven releases and we're trying to make this something which is fully open, not just open in terms of the source, but open in terms of how people develop. So more or less delivering something which is similar to service-oriented architectures for the largest computers in the world as well as for smaller systems. Okay, now one thing I think we left out at this point is what does Adios stand for? The adaptable iOS system. So with all of this, you know, with this long pedigree and long history and evolution over time, you said you have seven releases. Are these seven major releases that have, you know, kind of created over time, you know, major functionality improvements or new features? Or is it just more adapting to the different IO landscape over time, the different architectures and how we get data from RAM out to stable storage and things like that? Excellent question. So pretty much I'll say all of the above. They were all major releases. I think when you start, we started out saying we're going to work with 90% of the codes. What that meant was if you're working with a code that was different, okay, great. We're not going to care about it right then. So we wanted to have a solution for certain codes. So we started out, we really wanted to speed out checkpoint restarts. That was a particular problem on our luster file system, going back when the craze were first put in for leadership class computing at Oak Ridge. But then we started to expand. We wanted to be able to read data for more general analysis. We wanted to use data staging. We wanted to use, if you want to say burst buffers, wide area network movement. We wanted to have data which was even more self-described, injecting code into the data streams, injecting workflow in the data streams. So as the technology increases, of course, as the core counts go up by orders of magnitude, we are always seeing ways to improve. There's always bug fixes. So it's a combination of new features, new bug fixes, new types of simulations or experiments which are doing things that we never thought of. Now let me pick on two things you said in there because they really caught my ear in there. You said even more self-described over time. Does that mean you support new types or finer grain types or new patterns? And then the second thing that I heard was you inject code into the stream. Are you talking something like serialization and deserialization? Or what does that mean? Good questions. So we have, of course, Adios is a large project, many people. So I'm now talking about different research that has now gone into production, although we can't say there's that many users for some of these features. But in terms of, let's see, code, the code injection is done through two ways. One, we can actually inject the code for pointers to functions, which can then execute routines. This was a paper that Manish Parashar and some of his collaborators and students published. So that was pointing to different codes and executing it. So both of these are more, how do we move work to data versus always moving data to work? So that was one point. Okay, so in terms of more metadata. So when we think about what's really important, I mean, we get a publication and people are talking about putting a DOI on there. And great, what does that mean when I, for instance, have a simulation? It produced a petabyte of data. We had very complex workflows reduce that data into that contour plot that's in that publication. Is it that you want to just see the XY points of those contour plots? Or do you want to be able to see and understand what was the code? What was the workflow? So we're trying to create this whole metadata so that we can pay attention to all the provenance. So we're trying to put more information and here's another, for example, with that. Let's suppose that you write your code writes and it's some collaborator bars in China. They come and they say, you know, the IO wasn't really good. Scott, help us. Our first question is, what do you mean it wasn't good? So it's a lot of back and forth. So we're trying to have all this performance data put back into the code. We're trying to have the data that you have from an output. How can I create a mini app for the IO? Recreate the IO so we can play that back. So what I mean by more metadata is the ability to use that metadata from an output to understand what was done, not even just for the one simulation, but during the lifetime of that data being used. Is that clear? Yeah. Awesome. Okay, so let's start with the smaller scale user. We've talked about hitting the leadership class machines, but let's talk about the average graduate student who's developing code. Why would they be interested in using something like Adios and what would set it apart from using other IO libraries and data management libraries? Yeah, okay. So now we talk about Adios in the role of IO library that people use to write out data to the file system. Then Adios was designed from the very beginning for parallel systems to be scalable. So the question is, if you have a large simulation, then you usually hit some scalability problem with the IO and Adios gives a good solution for that. If you don't care about that, then what Adios provides is more flexibility in the sense that the API is separated from the IO strategy, that how we write out the data. And ultimately it's separated from the file format that the data ends up with on the file system. So if you write your code using Adios API to output your data, then you can choose different methods for your needs to write out the data in different forms. That means that you don't need to change your source code again. And with all the focus from the very beginning on staging that forced us to even go away from files, you can choose a method that was developed for staging and run your code without modifications and the data will go through some network to other applications. So that's the flexibility that Adios provides compared to other libraries. And let me just inject a couple statements here. For one, Adios is a different type of API than a user who's used to using something like POSIX, say Fortran or C rights or MPIL. The data is always self-describing unlike those. So think of it closer to something like a NetCDF and HDF5. Anything you do in serial, there's no change in your code to go into parallel besides you have to of course put MPI in there. If you want to use different compression techniques, you don't change your code. If you want to put in new attributes, depending upon how you put Adios in, you don't have to change your code. So it's a way to try to simplify it and think about the objects, not think about what's going into the file. So it's a different way of looking at things than many of these other things that were done in the past. So an obvious bias of mine, of course, is MPI. And you mentioned MPI in there. Can you explain how Adios interacts with MPI? Yes, yes. So Adios library is implemented using MPI calls. And a lot of methods, IO methods are based on MPI IO somewhere. We can combine them with dummy MPI version to a separate library to allow sequential calls to use the same thing. But the methods that if they need to communicate between processes, they use MPI calls. And the methods that use MPI IO, they use the MPI IO open write close functions to write out the data. But some other methods use just the simple POSIX API, or other methods use the parallel HDF5 interface to write an HDF5 format file. Okay, so one extra thing with the interaction with MPI with Adios. So as lots of people know, MPI has many different types of communication depending upon levels of concurrency. And there's a lot of optimization. With Adios, we have all these different methods, some which use MPI, some which don't. And depending upon what you're trying to do with the data, we can optimize for all these different types of situations. If a machine, for instance, has a food bar way of doing collectives on very large number of cores, which we do see, we have ways to actually now write with this. So what I'm trying to say is we care about resiliency. We care about different optimizations for different types of patterns. So we use MPI, we interact with it. And depending upon exactly what we're trying to do, we have closer or further away ties than just pure MPI. Okay, now that's very interesting, something that I'm inferring. Tell me if this is correct. It sounds like since you're using MPI or another parallel mechanism underneath, or at least have the capability of doing so, that allows the programmer or the application to think about a parallel object in itself. So I'm not writing just my portion of the data, but we're all writing our portion of the data for a larger object and perhaps is in a single process. Is that correct? Yes, it's correct. So we should have started somehow talking about the API, where basically we try to keep the right part of the API as simple as possible. So there is the API basically a declarative set of functions. You say which process wants to open what data set, then write what variable and then close. And when it's in a write, the whole code say you have an array, a three-dimensional array called X, and it has a given size in 3D space. And then every process can just put its own piece with an offset into the global space. And that's it. You don't need to use 6 or compute where this data should be located. That's what the self-describing data format is for. Every process just declares where its own piece is located in the global space and then audios will take care of the rest. Now you had mentioned earlier though that you can use these different, I don't remember if you call them drivers or what, so you don't have to modify your application and you can move between these different types. When you're working with this parallel setup, what kind of happens when you don't say have MPIO available to you for some reason? Let me first answer it. I think Norbert wanted to jump on it, but I'll first say our fastest methods actually don't use MPIO. So our fastest methods actually do, they use MPI, we do different forms of aggregation, trying to think of what's topologically close to you aggregate. The aggregation can be just simple concatenation if I know how variables are laid out in the domain decomposition, I can aggregate it closer, but we have to pay careful attention to those memory movements and the memory copying. So at the end of the day, if you don't have MPIO, it's okay. We have pure POSIX methods which don't even use any MPI and then we'll write a metadata file, which describes this. So yes, we have to do some sort of collective, but that could be done with sockets. Currently we do that with MPI. So we do have MPI in our mix, but you don't need MPIO in order to use Adios at all. And Norbert wanted to say a few more words. Norbert? What I said is that on the right side, every process simply just declares what they have and how they want to logically organize that in a global space. They don't say anything about the location of bytes on a file system or anywhere. So it really doesn't matter what method you use. All methods will place the data however you want it and the metadata describes how to find when someone is asking for a specific subset from that space. And if you use a POSIX method, which means every process creates its own file, so you end up with a lot of files, we still have one metadata file which points to all the data pieces in those files and you can still read the whole thing as a global data set. Okay, so I think this is a good time to kind of ask what are you'd say your most popular or most powerful I.O. drivers? Yes, there is one. So I mean for file systems, it's proved that the aggregation is that method that everyone is using. So when we hit scale, the point with the audio of different methods was at scale we pretty much just have to avoid the bad practices. And one method that Scott mentioned is aggregation is the one that avoids all of the known bad practices and that's the method which scales very well to hundreds of thousands of cores. So that aggregation does a lot of buffering to avoid all the latency problems with the small writes. It decreases the number of writers, so it doesn't hit the file system with a DOS attack and it places the data on a parallel file system in a manner that the processes don't step on each other tools. So that's called the MPI aggregate methods that we use MPI communication to move data between processes to the aggregators and then use single file writers from the aggregators. And then when staging, we have three different methods. Two sibling methods from Rutgers University and one from Georgia Tech that can move data around between applications and there the choice is more like which one supports a given network infrastructure. Yeah, so you have MPIO, you have this MPI aggregate, you have straight up POSIX which kind of ends up being file per process. Am I interpreting it correct? Right, we also have, I mean we have actually many methods. Recently we got a GRIP2 file output that was written by some of our colleagues at Tsinghua University that's used in weather. We have NetCDF4, we have HDF5, we also have converters from Adios BP to HDF5 and NetCDF3. So we have a fairly easy way to actually have some person develop a new IO method. So we have some methods which are tuned if you want to write one file per all of your processes to the file system. The aggregate method that Norbert mentioned, one of the things is that it can write out sub files. What we've shown is that by writing out essentially one to a few files per storage target and not really using the striping capability to say something like Loster, you can get much higher bandwidth. You avoid things like lock contention. So there's many different methods and it really depends upon your system configuration. So you said something there and let me explore that a little further. How would I write a support for Adios for my brand new awesome file system that doesn't have POSIX or support any of the others, but I've got some super optimal low layer API that I can expose to you. Do you have a modular architecture in Adios that I could extend it to support my file system? Yes, I do. So that is the API that you don't touch and then there is a common layer that just calls the selected methods functions. So for the initialization and open, write, close and read parts, you can implement whatever you want to do with the data and Adios will just call your method if that method is selected. So that's how the staging methods are completely different from the file-based methods because they don't touch any file system, for example. So as I said, the API is really just declarative in the sense that open and write and close doesn't necessarily mean that you open things, write out data and get rid of it and close just closes something. These things can happen anytime. The only thing that's specified is that when you do Adios close your data in the application is free. You can remove it and reuse it to whatever you want. It's gone from the purpose of IO. So you mentioned that Adios can get better performance than MPI IO natively, even though you might use MPI IO under the covers or you might just use raw MPI for sending the aggregate messages around and things like that. Why is that? Because a lot of people have spent a good amount of time trying to optimize MPI IO. Is it have to do with the fact that Adios is a higher level abstraction and therefore you can know more about the overall operation and therefore you have basically just more to optimize with or are there other reasons as well? Okay, so that's a complicated question and you halfway answered it. But no, so Adios doesn't do any magic. So anyone can write their application using MPI IO do the same and reach the same performance and maybe even better because there is not an intermediate layer of Adios calling functions. But what Adios does with the MPI in the MPI aggregate method is really to avoid all the problems that people put into their applications. It's hard to think about large scalable solutions for IO. So that's what we do. That's why we call experts of IO doing it the right way and that's what we are trying to do. But on the other hand, you are right that Adios is a higher level and we have something at hand that just pure MPI IO or the other libraries don't have it and that's that we design the API in a way that when you do open then we force you to tell us how many bytes you are going to write and at that point we know that what variables what kind of variables you are going to write and how much data you are going to write and we can with that knowledge we can optimize what we are doing. But since we are doing just MPI IO or POSIX IO calls someone can replicate the whole work if they have the time and effort to do that. Right and I can inject that, you know, there is no magic. I mean, for instance, if you guys sat down and you said I wanted to get my own communication layer you could recreate MPI. You could do anything that you want with software so Adios just if you want to say has a lot of what we call best practices a lot of them develop by collaborators a lot of them develop with the community a lot of them develop by us. So to think of it as a collection of best practices and a bunch of services like compression different state of the art techniques which from their time for research to get into Adios can be shortened because it is easy for someone to come in and just inject that into the Adios framework. So you talked about like knowing about these variables and stuff do you do any type of like aggressive buffering of small variables because small rights are one of the worst things you can do to these large scale file systems. How do you handle those? Yes, we are very aggressive as with the API too that we only give a declarative hand to the user they cannot control the IO similarly we require that you can buffer the per process output can be buffered in memory before the IO. We know that that's a little bit limited but we work with large applications over the years and that works very well for everyone. So we do just one big right from a process and we want to rewrite this in the future to be more flexible but it actually works beautifully for all the applications. So if you're really aggressive about buffering stuff can I actually kind of well Adios almost let me overlap my actual IO to disk with my computation. Like I say write this it really just moves it to memory and then starts writing but then I can start modifying what I told it to write. So again I think one thing that should be discussed because this is the complicated thing. The answer is of course yes but in reality you have to be really careful. I mean staging does precisely this the data gets written it gets buffered it gets moved to a staging resource that could be a separate set of nodes or even in our research implementations it's the same set of nodes but you know what you have to be very aware and again I think you guys are extremely aware of this which when you have any asynchronous communication and when you're talking to the file system or over the network it's asynchronous. You can block MPI communication and collectives can really behave rather poorly. So one of the things that we've spent a tremendous amount of time is understanding how we actually move data and if you are doing this asynchronously we have to be able to stage pieces around your communication layer. So if you're doing a first in first out then you're going to get killed. So we can do that but you try to put in extra layers in your code so we have APIs that can keep track of here's what if you want to say simulations and time steps producing data and time steps here's the end of a time step here's my computational phase so we know when to push data out so that requires more sophistications so in other words yes we can do this but you have to be aware because if you measure the time for your entire run doing everything asynchronous if you're not doing something very fancy your time even though your IO time may look like it's practically zero your actual total wall clock time may be a lot longer than if you did something which had just synchronous poor behaving IO. Alright now with all that being said there is a huge amount of value obviously in having experts do all the behind the scenes things for you but still even with that even with the most intelligent middleware on the planet applications can still do terrible things and get horrible performance what are some general rules of thumb that you would advise users to do particularly when using audios to get good performance Okay this is probably the toughest question you've asked and it's difficult because there's a lot of things that people have done and I really don't want to start pointing in particular examples but for example what we found is a lot of people do a lot of different memory rearrangements because the assumption is when I write to that file system and it's a Perl file system I have some variable say temperature which is distributed across 100,000 cores so they do a lot of different things in order to make that data look like it's logically contiguous so there's a whole if you want to say community that says write it out logic contiguous and the Perl file system then will take advantage of that we're in kind of the opposite view which is write out these data chunks which are contiguous but those data chunks can then be laid out to the file system for getting a concurrency from it so don't do too much aggregation in memory because you'll pay a high cost so in other words where people screw up and we have to fix is when people spend a huge amount of time doing all the stuff before they do IO because that's what they believe the IO systems that are popular want and we have to come in and rip everything out back to their bare bones code so in other words if you just had your code and you didn't do anything special we in general don't have too much of a problem besides the fact when where the biggest challenge we have is when someone has a large amount of processors just to say a million processors and each processor wants to write 1 kilobyte of data that comes from 10 different variables so it's small data but they're going to do it every 10 seconds bandwidths not a lot but that will kill every system so what we try to do is we try to tell them to aggregate in time or do other things so unless you're one of these extreme cases very small amounts of data and lots of open closes you know then obviously you don't have to do anything special now what if I want to bypass this completely and get extreme performance and just leave it all in memory? Adios it looked like it had some feature for handling this yes so well you have to specify exactly what you want so what do you want to do because leave the data in memory what do you mean by that leave in the in the location where it's generated and do some other processing on it on the same compute nodes or same processor or same core is synchronously or synchronously or you want to move the data out to a memory of another processor so that you can run some other code on that processor to process the data and so with the you don't need to change the application the data is generated at some point you tell audios to get rid of that data I mean do output and that can happen in different ways it can go to the file system or it can load a staging method and the staging method may buffer it locally and offer it to another process using shared memory segments to exchange data on the same processor or use the network to push it to some server and then you can read it lazily from other process like in an interactive visualization and get the data later or you can just push it to the to another processor where an analysis process is running and process the data and so for those solutions you simply just change the method the output method that you are using so I mean is that you can use the API with thinking that you you forever use files and don't care about it and then you have to work a little bit to use the generic API which then without modification can work with staging as well as to files right so your encode has to be able to process the data time step by time step with the simulation to be able to use that on staging that's that's all the requirement that there is alright go on in a slightly different direction this is a question we like to ask a lot of guests who come on the show here what's the largest type system that you're aware of that you can talk about that adios has been used on so largest is a complicated thing I could ask what do you mean by large but let me try to answer several different ways in terms of the largest number of cores I think we run on pretty much the full system on mirror at the ALCF so we've had codes run on that so I can't remember how many cores but you can look it up and I know that the codes we run with some of the fusion codes run the full system along with this quantum turbulence code we've run on off Titan we've run on I think all of TN1A so in terms of like when I say run I mean this is real simulations not us running the code for other people so with adios on most of the largest systems in the US and Asia I believe adios codes were running on it okay another question that I like to ask particularly those who have software based projects on our show which version control system do you use and why? okay so we use github right now and for the why is just simply that we use an SVN repository before inside Oak Ridge and that didn't allow access from outside for collaborators so we had to find a place where other collaborators can get to it and push their changes into it so we just chose github it works fine so we don't need to choose anything else okay so you've had these seven major releases of adios I think it said you were what's kind of plan for the next major release so audios 1.8 is coming out around the supercomputing conference time so we have two releases every year and the new big thing will be the query API to support queries so with the NIO live you produce a lot of data and you can read that data back in but that's usually not efficient because you have to read in everything to find the data that you are interested in so the query API is designed for expressing your queries for that you are looking for some data where other some variables like you look for pressure where the temperature is above a threshold so you express your query then a query engine would evaluate that query and return the regions that you want you will need to read in to get the data and that could greatly improve the read performance if the fines the hits are much smaller than the whole data set let me just inject a couple things first thing is that that work a lot of it has been done by some of the research and developers at Lawrence Berkeley labs and North Carolina State University and some of this works with even compressing the data and indexing the compress data the other thing that Norbert didn't mention a particular favor to mine that's coming out is wider network staging and that's where just again as we say that you can use separate nodes you can go on the same course for doing these in memory operation we've been working with some of the people at KISTI in Korea we've been working with many people to now have it so that when you stage now the data can go over to another system assuming again you go through all the security models but allows you to generate data one place and then stage it to another place on memory okay that all sounds great so if I was either a new user or a new researcher developer someone who wanted to get involved in the audio project and use it how do I do that where do I find you guys we sent you a you put a link on the podcast the link into the binary data and documentation so the OLCF has a webpage maintain the audio software downloads that's where users can go and download audios and developers can find audios on github it's a public project and anyone can download it and fork it and work on it and if someone wants to develop something new that would be worth to include in future audios releases then they just have to talk to us Scott and me and we can start collaborating we do that with several people all over the world okay thank you very much for your time thanks guys appreciate it thank you