 Research Computing and Engineering Episode 3 with Anshu Dubei from the ASC Flash Center at the University of Chicago. Welcome to the show. Thank you. So, first off, could you give us a quick summary of what Flash is and exactly what it does? Flash is a publicly available software that was primarily designed to simulate the reactive flows that are found in astrophysical phenomena. And from there it has grown to be a much more general component-based software that has then found uses in many different wide variety of applications. People do use it for computational fluid dynamics. They use it for cosmology. They use it, of course, for the use that it was originally intended for, that being the reactive flows and so on. And it really is a component-based code whereby it's not a single application code. You can generate many different applications by combining different components and even different implementations of the same component. So it is a general purpose computational scientific software. And that module, that's why you always have to run setup and you kind of make that directory where you say I want this module, this module, this module? That's right. Okay, so it actually really is flexible and you just pick which things you actually want to explore. I actually got exposed to Flash when somebody was using it to actually watch a very high intense laser interaction with a metal plate. Because it basically said it was a very high intensity energy at one point instantaneously. Right. No, it's been used in many unusual circumstances. One of our more funny, we conducted a user survey a few years ago and some of the responses were quite interesting. There was one person at, for example, HP who used Flash every time he swapped in and out his disks because if he did something wrong, Flash would find the problem for him. So it was that the mating of his disk system? Yes. So this must actually be, this code truly requires like what you would call like terror scale resources. I mean really large systems. Flash is not something you run on a relatively small system. That's not true. You can run it on a workstation also. The point is that if there is anything the matter with, so you can run simple small problems on your workstation. You can get Flash, you can unload it, you can use it on your Apple laptop and it'll run on Apple laptop with G4 Trun as long as you have HDF5 and MPI installed on it. But you can only do a very small set of problems in that kind of environment where you need a really large scale. So the scale of the machine that you need is really dependent on the problem you're trying to solve rather than the code itself. So for the kind of problem Flash was actually designed for, how large of a system do you normally see people using? So we do, for example, in Flash Center, we do exploratory 2D runs on a simple cluster that's got 16 nodes. But then the real 3D simulations for one set of problems take about 4,000 to 8,000 CPUs on the Argon intrepid system and another one uses 8K to 16K CPUs on the same intrepid system. So it scales pretty well with problem size. Right, and we've actually scaled the code all the way to the full machine. That's pretty impressive. Yeah, but of course when you get to that point, IO becomes a serious issue so we don't really run the production runs at that level. So the IO was actually something that was interesting about Flash. Flash was my first introduction to parallel IO, specifically MPI IO. And you guys actually implemented that, especially in the new version, you guys had a major release of Flash at the beginning of last year, Flash 3 came out. Yes. And you had actually made IO was a plug-in module just like your different physics systems were. And you had serial HDF5, parallel HDF5, and now you actually have parallel net CDF as well as general Fortran IO. Yes. Was this mostly to deal with the IO problem while you did it this way? Well, at one time we got an access to a really large blue gene L machine at Livermore, 64,000 processes and with a notice of less than three weeks and none of our IO packages scaled. So they did have HDF5 implemented on it. They did have net CDF and we even attempted straight MPI IO, but none, not one library scaled. So if you were doing 1000 processors, you would take X amount of time. If you were doing 2000, it would take 2X. If you were doing four times, it would take 4X. We didn't have a choice. In that short notice, we had to come up with a direct IO, which would be direct writes from Flash. And that was a pretty hairy experience because when you run on 64,000 processes, actually we ran 32,000 processes, you would get a single time snapshot would have 32,000 files. And on the fly, we had to keep modifying things. So we've always kept as the last resort backup available with the code from that point on. That's pretty crazy. So you would have 32,000 files per time step. And like I know when I was running it for time snapshot because we don't output every time step. Yeah, like you say, like write a plot file every so many. Yeah. When we had that going, we were writing four to five gigabyte files. And this was only like on 200 CPUs every 15 minutes. And so it'd be nice to be writing 32,000 files every 15 minutes. So when you said you weren't scaling, you were just waiting on IO all the time. If we didn't do that, yeah. But in this case, the situation, what happened was we didn't realize initially, for example, that a directory can only handle so many files. So we run into a situation where you simply type an LS and 15 minutes later, you're begging the systems guys to kill the job. Yeah. Because it can handle it. But but but as a direct IO itself, it was very fast. So then we had to write tools to stitch it all up together. That's actually still got to be pretty funny. It sounds right. Okay, so flash scales, the computation scales a large number of CPUs IO sometimes you have to be careful what it scales to. What differences have you actually seen between like the parallel HDF five implementation and the parallel net CDF implementation? So for the longest time, we haven't been able to use parallel net CDF in production mode because they had this limitation on the file size. You couldn't write a single file that was greater than two gigabytes because of whatever reason. But it so happened that originally when we did experimentation between HDF five and peanut CDF peanut CDF was significantly faster. Since that time, we realized that we weren't exploiting the collective operations fully when we were using HDF five. And when we do that, the performances of the two are comparable, but we still haven't succeeded in using net CDF in production mode. Because this the fix for the file size is relatively new. Okay, so you would recommend users sticking to the HDF five in flash three sticking to the HDF five parallel for larger jobs? I would say that if their output files are two gig or less, they might try want to try net CDF, but otherwise they should stick with HDF five. Okay, what about in the future? The CDF project has actually switched to using HDF five as this underlying file format. Do you see that affecting the direction you guys go in future? To the best of our knowledge, that's only the CDF project, not the peanut CDF project. Okay, and you don't see those actually getting rolled together into the future? I don't think so, because at least from what I know of my discussions with the IO people here at Argonne National Lab, the people who are behind peanut CDF, there are issues with the way HDF five handles metadata on really large scale parallel file systems, which they have identified it and so they're trying to get around that in their own implementation, but then that sort of speaks against using HDF five as an underlying base. That's actually interesting, because right now using HDF five as the underlying is a option in net CDF four, and so you can either use it or not. So that could actually change in the future, whether they actually continue with the HDF five path. It's interesting. I should actually get those guys on and talk to them. Yeah, I think those two projects are fairly independent of each other in some sense, the CDF and the peanut CDF. I know they're at two different schools. I've not looked into it very closely. So back on Flash, the peanut CDF was an addition that's available under Flash three, which was a new major release you guys did at the beginning of last year. What is significantly different about flash three versus flash two? Give us some feedback on that. So the most important thing that that is to be noticed about flash three is it actually realized the vision of flash two. In flash version two, there was an attempt to go into a fully component based model, but it wasn't very successful because I think when you start out with legacy codes that are very into which have pieces that are very intertwined with each other, it takes a lot of work and a lot of iterations to get it right. And so in flash three, I think we've successfully finally done what has been tried right from the beginning in flash. The other major difference between the two is in flash two, the data management was completely centralized, which means that there was a central database and everybody queried it and got whatever data they needed. And then it was felt that that left the question of who should have the modification rights to what data ambiguous. So we went ahead and we clearly defined the data ownership by different components. And each each code unit now clearly states what data it owns and what data who can modify. So that's at the very basic architectural level. In addition to that, we introduced the concepts of subunits in flash, which allows different functionalities to be more closely. So that allows a sort of hierarchy in how closely you put together certain kinds of functionalities. For example, when you're trying to deal with Lagrangian tracer particles or Lagrangian particles of any kind, all of the function, there are things that you want to have multiple implementations off. And there are times when you want to turn off certain of those capabilities. So we let that happen with subunits as opposed to bringing everything up to the same level of hierarchy, in which case we just have a proliferation of units and the code would not be very manageable. And the third thing is that with flash three, we really started getting exposed to really large scale systems and we found that some of our parallel algorithms had limitations in scaling. So there are several new parallel algorithms in flash three. Okay. So mostly it was a update and to make it really do what you wanted it to do. But a lot of the physics was actually still the same in the system. Yes. There is, for example, the MHT solver, now you have two options and one of them is completely new. And it also changes the time advancement capabilities that flash had. But by and large, the physics solvers that are released in flash do not differ very much from 2.5. So you have all these different modular plugins people can use to set up their own, to make flash solve different types of problems. There must be many combinations then because you can put these things in different orders and do different things. How do you verify that one module doesn't break another module? Do you have some sort of test system? Yes, we do actually. We have a built-in unit test framework in flash three whereby every unit tries to test itself as independently as possible. And wherever that is possible, it does so against an analytical solution or at least a semi-analytical solution. In cases where that's not possible, we have tests that increase in complexity. So, for example, there would be a test that would just test the hydrodynamics in the code and then there'd be something more complicated such as the cellular detonation problem, which would then use the burning capability along with the hydrodynamics. So the tests build upon themselves. And so we have a tool that does regression testing nightly on multiple platforms using multiple compilers which run all of these tests, like I said, combination of unit tests, simple tests, increasingly complex tests, which we try to cover the entire functionality of the code. And there is a web tool that then lets you see these results. So we take the verification of the code very seriously. That's a pretty in-depth system. Well, if people don't have such an in-depth system, that's quite neat. Is this a system that, like, could I actually get a hand copy of this system? It's freely available. Unlike Flash, you do not even have to sign a license agreement. And as long as you follow the model of configuration, build, and comparison, it isn't tied to Flash. Okay, so I could pretty much make it work for any other piece of software I had. Yes. As long as you had these three steps, a configuration step, a build step, or actually for a run step, and what you have to do is you have to have a benchmark that some human being has actually looked at and approved as being correct. So this test suite then does regression testing against approved benchmarks. Okay. So then all these modules, do you guys write them all? How do you pick what new modules should be written? Do you get feedback from users or do you pretty much have a goal of, we want to solve X type of problem? It works both ways. So the highest priority is given to the modules that are needed by the center for the work of the scientists within the center. If they need something new in collaboration with the scientists, we develop the module. Sometimes the scientists themselves develop a module and give it to us to clean up. Then we have done these surveys a couple of times. And also in the meetings, when we go and we present flash to people, we get feedback in terms of what capabilities they'd like to see. And if enough people want to see a capability, we consider developing them. We are absolutely open to and welcome contributions from external users. And there are quite a few capabilities in the code that have come from external users. For example, there is a self-gravity solve that uses a multigrid algorithm that has come from Paul Ricker at Urbana. There is a module that came to handle ionization that came from a group in Italy. And there are several other such contributions. So actually something from my background, my background actually comes from nuclear engineering. And I have a question about something Flash does. So you say hydrodynamics. Would that be something that's related to like a traditional CFD type problem? Computational fluid dynamics? So there are certain kinds of computational fluid dynamics you can do with Flash. And there are certain others for which the capabilities is somewhat limited because you can handle compressible fluids with Flash, the ones that have shocks and stuff. It isn't very good at doing incompressible fluids, for instance. But in some of these bodies you're simulating, I would assume that relativity comes into play. So do you have the ability to do CFD while taking relativity into consideration? No, not yet. Not to the best. I couldn't say that because it hasn't come up so far in any of the discussion with the scientists. Oh, no. Actually, let me take that back. That was my fault. No, we have a relativistic hydrodynamics implementation in the code as well. I'm sorry. That was a blooper on my part. I would assume in some of these cases things are moving pretty quickly so relativity should come into play. Yeah, there is absolutely a relativistic hydromodule in the code. Okay. Now actually something that Flash does is kind of interesting. It uses a piece of software called Paramesh, which is an adaptive mesh refinement code. That is an external library that someone else wrote, correct? Do you briefly work here at the Flash Center, too? Oh, I did. Okay. So it's actually quite neat. It creates your mesh and it refines or de-refines the mesh as needed to save computation time. Yes. That is just really neat. So many things are just like a static grid, right? I mean, it's like you create your mesh, you partition it, you throw it on, you solve your thing, and you come back off. This actually, the mesh actually changes as the problem progresses. I just find that ridiculously neat. Yes. Well, actually though, the AMR codes have been around for a while. We are not the only ones. Okay. Okay. But one thing about the AMR and like other parts of Flash, would you say Flash requires a low latency network or a high bandwidth network? All of those things completely depend upon the problem. For example, there is a uniform grid also supported in Flash. If you have a problem where things are going on all over the place, then there isn't any point to encounter, to tolerating the overheads. And there are substantial overheads from running in the AMR mode. So in those cases, the bandwidth is the most important thing because actually neither matters very much because you can go into strictly nearest neighbor communication. But then when you're doing AMR, there are two kinds of communications that happen. There is the communication that fills the guard cells because Flash is an explicit code. And then there's a communication related to distribution of the blocks as they come into existence or go out of existence when the mesh refines or derefines. Now, if you have a problem where things change very rapidly, latency might be an important issue because if changes happen very rapidly, the blocks need to go far and wide. Whereas if the changes are relatively less rapid and relatively more localized, then you will have more blocks changing in a small region of the space and then bandwidth would be more important. So really, it's problem dependent. In the best of both worlds, you want best of both. You want low latency and high bandwidth. Everybody wants that, right? Yeah. So it's actually quite neat because as you're refining and derefining your mesh, the balancing across CPUs actually changes. So you mentioned you're actually moving around blocks. You actually have to repartition stuff almost as it runs, right? To try to keep computation balanced across CPUs. Yes. So we use a space filling curve to figure it out. See, it's not just that you want to keep the workload balanced, but it is also that you want to maintain the spatial proximity of blocks because you need those guard cells filled from your nearest neighbor blocks in the spatial domain. And if those blocks go away to a very far away processor, then you encounter more expense in getting those guard cells filled. So it has to balance the need for spatial proximity and a load balance. And that's done using a space filling curve. Okay. So Flash actually comes with a number of IDL scripts so people can visualize output files from Flash. Also, I've used a really good tool, which we will have on here. Visit from Lawrence Livermore has a Flash plugin reader. Yes. So most often for debugging purposes and that's more recent. Plus, if you just have a 2D visualization, the IDL routines are very useful. But in the center also, the scientists have almost exclusively switched to using Visit for really analyzing the results, looking at them. Because most of the simulations we do now are 3D. And in the IDL routines we provide, you can only see 2D slices of 3D data. You can't really visualize it in 3D. So would you say that Visit is the more common tool used? For analyzing data, yes. So we already touched on the HDF5, but one thing I noticed when building Flash is that your HDF5 routines, this was true in version 2 that they were all C files instead of Fortran. Was there a specific reason why you chose to implement the IO routines in C while leaving everything else in Fortran 90 and not use the HDF5 Fortran bindings? Because our first foray into HDF5 was HDF5-4 and even though I wasn't directly working with Flash at that time, I seemed to have a recollection of the fact that people found... Either it didn't have Fortran bindings at all or people found them very hard to use. And so the wrappers were written in Fortran and the C functions were used and we've just continued with that. Okay, so there's no specific reason like they're not... They could be perfectly fine now. They could be something that someone could look at as an option. Yes, and we wouldn't know because we've never attempted to do it directly with Fortran bindings. Okay, so what is the future of Flash to Flash Center? That is a difficult one to answer. We are on a ramp down funding as far as the alliances go, but of course we've written several proposals and we are looking for funding and I'd say with a user base of 300 plus and growing, I think it should be possible to convince people that Flash should survive. Many people have vested interest in keeping Flash going. And I think the code itself will continue to be supported one way or another. There is a commitment from the university, but of course it would be nicer if one of our various grants that we have written comes through and we have funding for it directly. Okay, well thanks a lot. Thanks for taking some time out to talk with us.