 You can find us online at rcdashcast.com. You can find the links to the blogs, the Twitters, the entire back catalog of every single episode we have ever done in which we now have more than 100. Once again, I have here Jeff Squires from Cisco Systems and one of the authors of OpenMPI. Jeff, thanks a lot for your time. Hey Brock, it's always good to do another podcast. We've been a little slack recently, so good to get back in the groove here. We got this one up today. We got a few more queued up, but it's getting into the busy season for HPC. Yeah, so we have a supercomputing coming up in November. So we're at the very end of September here. So we're only about a month away from supercomputing, a little more than that. I will be there. University of Michigan will be having its joint booth with Michigan State University again this year. I will be floating around there and floating around other places. Jeff, Cisco always has a booth. Yeah, we have a booth. Unfortunately, I have a fairly terrible position on the floor this year due to some silliness, but we'll be there. I got my normal open MPI State of the Union boff with George Basilica, my co-host, another one of the open MPI core developers. Also helping out with a libfabric tutorial. Libfabric is the new up-and-coming low-level network API. We got that on, I believe it's Sunday. It might be Monday. I'm afraid I don't remember offhand. But yeah, those are the big things that I'm doing at supercomputing this year. What else you got going on? So what we've got going on is we will be announcing a couple of interesting things that we got going on at Michigan, that you can come by and talk to us about at our booth. We have our new Institute for Data-Driven Computational Physics, which we recently got funded to create. But the big, big thing that's going on is the new Data Science Initiative coming out of the University of Michigan. This is gonna be really the kickoff event for the Michigan Institute for Data Science. We're gonna be creating a graduate program, which we already have students enrolled for the fall. And the university is investing about $100 million and we're gonna create 35 new faculty positions. We've put into a bunch of medium and big data infrastructure, which is really kind of stretching the sorts of things that we've traditionally done here at Michigan, and we're really excited about some of the things we can do. We're already getting our health system lined up for health data. We've got transportation data. We've got all sorts of crazy things going on. We're gonna try to bring it all together and look at finally filters for all this data. So come by and talk to us about that. $100 million. Right, it's real money, real money. All right, well, what are we talking about today? So today we're gonna be talking about something that I know actually very little about and I hope to get some clarification about my understanding of this. We'll be talking with a researcher who's created a piece of software called Conduit. So why don't we have our guests go ahead and introduce themselves. Yes, hi, so I'm Cyrus Harrison from Lawrence Livermore Lab, and I'm known primarily for my work on Visit, which is an open source visualization tool, but today I'm here to talk to you about Conduit, which is a smaller effort we've been working on for the last few years. So can you give us an idea of what Conduit is? All right, so Conduit's something we've been looking at to sort of simplify our daily life and data exchange between simulation codes as a whole. So people who write big, big physics simulation codes, they usually have to deal with IO and deal with mesh aware IO, for example. And there's a lot of pain in dealing with static APIs and things of that nature. So Conduit was designed to sort of tug at that and make things a little bit easier with respect to passing data to and from, say a visualization tool or in a simulation code. So this was a little bit confusing when I was reading through stuff. A name like Conduit almost sounds like it's a communication or some sort of networking thing, but then you talk about, you know, describing data structures and stuff and you talk about communicating between other things. Can you give a little bit more clarification? Does it, is it one? Is it the other? Is it both? It's both. So at its core, at its core, Conduit is really just about describing hierarchical data. So the best way to think about it is sort of if you cross, you know, JSON, JavaScript object notation with NumPy. So NumPy has been used very successfully in the scientific community, the scientific Python community for describing arrays and operations on arrays and JSON's been used for dealing with hierarchical data structures and all kinds of data exchange on the web. So at its core, Conduit's about describing things in core, so in your memory, and then on top of that, you can build services like communication or IO or serialization. All right, due to unforeseen technical difficulties, apparently the internet fell apart in California. So we had to call Cyrus back in on a regular old POTS phone and we will now return you to your interview already in progress. Okay, so, Cyrus, you said something a second ago that confused me a little bit. You said, you talked about how static APIs were bad and assuming that is in contrast to dynamic APIs. Could you explain that a little bit further? Yeah, so I guess in a lot of other areas like web development or even in scripting languages, people are used to having a lot of flexibility with how they describe data, how they pass things around. And in HBC, typically, we have a lot of preexisting APIs for say describing what a mesh is or something like that or we try to solve the problem with code generation. So we're building a big code, we have code generation that describes our complicated objects. And the problem with code generation is that, well, it gives you good performance. It becomes software engineering-wise pretty burdensome. So if you look at a tool like Visit, if we were to accept code-generated stuff from all the different codes that we want to read in to visualize, it would be really unbearable, an unbearable process to get all their code and coordinate their own code generation schemes. So when we're talking about this stuff here, we're really not talking about how static and dynamic are mortal enemies or just different benefits to doing a dynamic approach. And by dynamic here, I really mean runtime. So you can do full description, run time and have introspection and things like that. So can you give us a little bit more history about how this came about? Was this trying to solve the needs of a specific code or in that code, visit or some scientific code or is this just a general problem you keep running into? Yeah, so it started out with some of the pain points with describing, so with Visit, we deal with mesh data and we use VTK for example for describing our meshes. There's a lot of other things though that go along with it that are a little bit more complicated like in a distributed domain parallel context, how those meshes on all the processors connect to each other, for example. This is hierarchical nature, hierarchical data by nature. And we have different kind of, probably 10 different mechanisms for doing this in Visit and some of them use packed binary data from specific codes and some of them use APIs that were specifically designed to deal with this but then we'd find, oh no, we have one corner case that kind of undermines our whole API. So what we're trying to do is we're trying to let the simulation code describe this data structure to us and have enough introspection so when it gets to Visit that we can actually pull out the pieces and get what we care about and what we need. And that's where it started out and it's also an important aspect of trying to deal with sort of in situ visualization where it's easier to pass things zero copy like a mesh zero copy from another code. So a lot of jargon there. So I'll let you taste me down on those things. So you keep mentioning this is really for a tool like Visit to be able to kind of figure everything out on its own. I don't necessarily want to go all the way to saying self-describing but it sounds pretty similar to that. Is this already in Visit today? Like is there a version of this in Visit today? So it's not released in Visit yet but what we've done, so over the last, so again this project started in 2013 and actually over the last year, we had a set of students from Harvey Mudd College working on it with us where we used it to re-plumb aspects of different simulation mini-apps. So these are common proxy applications that are used in procurement instead of just regular benchmarks. So what we did is we actually, it was pretty interesting we went in and had the students kind of re-plumb the IO or re-plumb the MPI and re-plumb from connecting your collecting performance data from the simulations all just using conduit which is kind of a simple building block. So we did these experimentations and that gave us the confidence that hey, this is gonna work well. So now we're starting to figure out how to roll it out in Visit and solve the specific problems of the code we were looking at starting a couple years ago and then also expanded out to be used for lots more things I think. Well let me ask you then about, Brock was just asking you about self-describing and you were talking about the APIs describing them the data itself and so on. These are also terms that are typically bandied about in the current hot-sexy of unstructured data and big data and things like that. Is this related to that at all? Cause you've also been describing this in a hierarchical type of description which implies more regularity that you could potentially use a much smaller description to describe a huge amount of data. Is this the same kind of playing field? So I think it could be. I think the focus we really had though here was on numeric types. So that's why I talk about JSON cross with NumPy, right? So we do, we could describe things in the big data realm as well but the important thing for us here is to make sure that we have a double, if we have a double array we wanna make sure we know it's 64 bits and we know the right Indianness and all these other things. So there's kind of extra special sauce that's specifically for scientific computing. So this is kind of interesting because I feel like when I first started learning HDF-5 and then later on Adios it seems like it already does a lot of these things. So what's the real difference between these two systems? Yeah, so conduit's simpler than both of those things and it's specifically built for data description in core and then other things are built on top of it. So it's not actually at its core in IO library. So it won't be dealing, for instance it could be used with Adios, basically Adios or HDF-5 could be used behind the scenes to do some IO functionality but it's really for in core. So if you wanted to pass between in situ basically between one part of a simulation, another part of a simulation or a simulation, a visualization package this gives you a very simple way to do that. And that's sort of where the line's drawn. Now there's obviously a lot of the same concepts are involved with hierarchical nature and keys and all these things but at its core conduit's about describing things as they already are in memory. So now you mentioned that JSON is used as well or JSON or whatever the, however the kids pronounce it these days. Is that used when you do, you're passing from point A to point B so like if you have a one gigabyte array of doubles you pass it in a JSON format or do you pass the description in JSON and then pass like just a native array of doubles? Yeah so this description would be passed in JSON and so JSON allows you a really easy way to describe the data in core and then basically we have a schema that it doesn't always hold everything in JSON but from going from point A to point B you would create a JSON schema that does the description and you could send the binary data separately for example. So does this, I'm trying to think about so do you allocate your arrays using new conduit array and I'm trying to figure out exactly how I would work this into my application or if this is something I kind of write afterwards almost like XDMF or something like that where I kind of write it afterwards so someone can understand something I wrote someplace else or is this actually in my application being passed around all the time? So it would be, it could sort of be both I guess that's a non-answer but the way you can use it is there's two ways. So if you already have data in your application that's basically described through the basic set of types again think numpy like floating point arrays or integers or strings. You can create this conduit object that just describes the data that already exists and owns nothing. So a conduit object doesn't own anything but a description of it which is nice because that allows you to zero copy data but it also has a dynamic API where you can just build kind of trees of, trees of hierarchical data on the fly and just sort of through the magic of some, especially in C++ some overloading you can really easily build up these data structures and have your own copy of them I guess. Once you have it sort of codified into a conduit object and this hierarchy then the use of them really looks exactly the same I guess. So this sounds relatively similar to the MPI data type concept and I see that you have some conduits that speak to MPI types of entities. Do you, is there a mapping between the JSON data types that you create and MPI data types or optional data types mapping if you're talking to an MPI entity? So we've talked about this a little bit. It's something that we would do in the future but we haven't got there yet. So for MPI operations now, basically we know how the data is described and we know how it's laid out in course so we can do serialization. So we do serialization for our MPI features currently but we could do a more robust coupling with the MPI data types in the future. So what kind of data types do you handle? So the MPI data types are both great and terrible at the same time in that they can represent anything and that's both good and bad because there's a couple of shortcuts for describing simple things but when you wanna describe complex things, you can but it takes a little bit of work to do that. How does one describe data in conduit? Is it an easy thing? Is it a hard thing? What's the interface like? Does it just figure it out via introspection? How does that work? So fundamentally it's meant to be very easy and it's meant to sort of limit the data model but basically there's three different things you can have. So you can have an object which is really an associative array so that's how you get some hierarchy. So that object would contain names of other things like a pressure field or something like that. You can have a list which is just a linear ordered non-named thing so just a blob of a whole bunch of heterogeneous things or you can have a leaf type which is one of these concrete arrays. So it could be an array of floating point numbers, it could be a robust string and there's different striding swizzles and stuff like that on those in order to allow you to basically if you're doing structs of arrays or arrays of structs of tricks you can play with how you do the indexing to make it all look nice. But if you're just dynamically creating an object you can just go and use almost like a, it's basically like an associative array or a dictionary syntax even in C++ so you can just give it a name and you assign it a value and it'll figure out the right type for you and then there's some more special sauce for dealing with complicated arrays if you have striding and things like that. So I wanna think about something like using external libraries like BLAST libraries or FFTW or something like that and say sometimes have special things like well we know this is going to be symmetric so I'm only gonna return half the data. Can I describe things like that so that it knows that it's symmetric or is that kind of not really the intended use of this? So it's not meant to solve some of those higher level problems. It's meant to just solve sort of things in core but that said you can go a long way with having this hierarchical notion to give a lot of context for things. You can I guess describe things starting at any offset you want to as well. So there could be a way if you envision that it only returns half an array or returns a pointer somewhere strange from an offset in the beginning of where you started you could do that. You could encode that in a conduit node and just return that back and it would have a description to where that started and how to access it all. All right, so if I'm using a conduit in my application should I really be describing every single of my data structures using a conduit describer object or whatever you're calling them or should I really only be doing it for the things I plan to either communicate to another application to do something like in situ virtualization or to do IO to disk with to be able to again describe it to another application. It's really about portability of data structures between applications and not within an application. Yeah, I think the fundamental focus is really between applications. There's, it's nice to have all of your data self-described but there's gonna be a performance penalty from doing that, right? So this is sort of about coupling. It's about getting data from A to B with context. Now, if it's for talking between applications does that necessarily always mean that there is a disk in between or can other flavors of IPC be used? Oh, no, definitely not. So I mean, the easiest way to talk in between applications would be the library, right? So that's actually like an in-situ visualization use case. Now, why is that different from just passing a pointer? Well, again, it's the context. So it's easy for you to pass a pointer from one library to another but this gives you a little bit more context. So as simple as passing an array from one part of your code to another that could be the case. The difference here is the context. So if two quotes of those code are sufficiently complicated then you might wanna have some context between them. There's also, you could use MPI to send things, right? IO doesn't have to be part of the picture. All right, now, it doesn't have to be part of the picture. Can it be part of the picture? Can you write conduits out to files and then read them back from files? Yes, yes, you certainly can. So right now it just has a basic capability which saves out a schema and saves out a kind of a binary representation and has some experimental capabilities for writing actually mesh aware data out to silo which is a library we have here at Livermore. In the future, we're probably gonna deal with putting HDF-5, kind of having a fundamental IO capability that uses HDF-5 on the backend in order to get things out in a better, more performant way. Okay, so you have your own format and you're also looking at using some other underlined formats but there's some things where there's these file per process and you have to describe everything back or there's methods like MPIO or some other types where you can kind of have something that's a little bit easier for human to manage, a single file for a large partitioned type problem across processors but has other issues involved with it. So does conduit favor one of these or does conduit not care or does conduit not even bother with it? So I guess it doesn't care at a fundamental level but one thing it will make easier, I think, is the sort of in the middle case which is where you're not doing one giant file or not doing one file per process but you wanna be able to collate some data sort of hierarchically and get it out to a file system. So using many processes per file, I guess and I think Mark Miller from the Visit and Silo team has a multiple, I forget what it's called, there's a term for this, right? But because it can do things with MPI and because it can serialize and help you serialize and things that helps you get data around so it'll help you be able to make choices like, well, I wanna collect this data together and maintain context. That's something it can help out with but it's not going to fundamentally pick, you must do one file per process or you must do MPI IO, one file, one giant file. It's not gonna fundamentally try to take a stand there, I guess, I'll try to avoid controversy. So what languages are supported here by Conduit? You mentioned NumPy, so are we talking primarily Python? So it's actually, it's all developed in C++ and C++ is the kind of where all the features come from so it's underpinned by C++ but there's also a fledgling C API, a Fortran API and there's a Python MPI as well. So the goal is to have sort of, it also helps you solve this sort of language confusion issue where you have to, again, a lot of times you have to do a lot of code generation for multiple languages. But since we're describing things in core with this kind of hierarchical model, the APIs are pretty sane and all of these different, they look very, very similar. Modulo, syntactic sugar, C++ probably and Python look the best but the C and the Fortran APIs are actually pretty usable for doing this stuff. So that's another goal of the project is to try to have, once you have these things described make it easy to use in these different languages. Well, I think you get a gold star, sir, for being probably the first person on the planet to mix Fortran and JSON. So I had to learn Fortran to do this project. So I avoided it up to this point but actually it's been very, very, very good. So some of our simulation codes, obviously Fortran's immensely important for HPC and we had some success of coupling a Fortran mini app in order to do some in-situ visualization stuff and that was a big win because they were looking at this API customers and saying, yeah, this is actually usable. So it was good. All right, take a quick cut here, break. I sincerely hope you are using strongly type definitions because I hate to admit that I know quite a bit about Fortran now because of my MPI stuff but the newer modern like Fortran 08 is actually even stronger types than C++ and you can do some actually really nice things with it. And I hope you guys are going that direction rather than like the MPIF.h totally implicit specification of everything kind of stuff. If you want, I'd be happy to talk to you about it in five or 10 minutes when we're done. So we're actually, we're using ISOC bindings to do a lot of what's working here. But I guess- That is good. Yeah, so that- On the Fortran interface side though, I hope it's all explicit and all that kind of good stuff. I think I might be committing some sense there so I won't lie. But again, this is, so it's sort of one data model for all these languages and it's a limited data model. So I think people can use it without shooting themselves in the foot. All right, well, let's chat after recording because I think Brock probably want to shoot himself in the head if we talk anymore. Okay, so you said this started getting developed only, what was it, two years ago? So is the API of Stables is something that like we just really start, if we have somebody's needs where we want to be able to pass between applications, maybe do some in-situ work, coupling standalone applications together, is this something we should go ahead and start using? So I think, so the core conduit library itself is pretty fleshed out in terms of functionality. The C++ API in particular, we're still filling out the Fortran APIs and still filling out the C APIs. There's a couple of other things that are new in there. For instance, like the conduit MPI and some of the IO stuff, which isn't fully fleshed out. So I don't think that would be ready for primetime yet. But I think the ideas and kind of the basic library are ripe for using now. So actually, I had another idea that came up during this. So you talked about using MPI to communicate and it doesn't have to be a file to communicate between stuff. What are some of the things that, does conduit worry about any of that communication stuff for when you have, like say you have Visit running with LibSim over here. Well, maybe that's a bad example, because that's specifically you're working on. But say I have standalone visualization application running over here and I've got my simulation running over here. Does conduit allow for that kind of like pause, expose data and it's available via TCP, shared memory or verbs, something like that for this other application to read and conduit kind of provides all of that. So right now it does things over MPI only with the MPI library. And then the other swizzle we have on this is it basically has an embedded, you can optionally have an embedded web server that allows you to get things to the web browser, which is kind of interesting. So via web sockets. But we haven't done any other transport layers for like IPC or anything like that. If we did, that would sort of be at a higher level functionality. It wouldn't be built into the core of conduit but it would be sort of a service built on top of it. Maybe using NANA message or some other sort of IPC thing or some other exotic network that would work well. Now as long as we're flashing back in time in the conversation there is a question that I forgot to ask too. Why JSON? Why not something else? How'd you end up there? So pick JSON because it sort of infected my brain a few years ago and it's been immensely useful for me for dealing with a lot of Python programming. I guess it's just very intuitive and I think YAML would have also been a good choice. I think XML becomes more complicated because there's so many ways to interpret it. So JSON, there's one way to interpret it and it works really well across multiple languages and I've seen that with my Python stuff and with how it's treated in C++. So that's why I picked it. I think and basically all the print things for human readability all shoot out JSON stuff. So I think I've convinced myself that it's fairly intuitive and I think I've convinced a lot of other people it is too but it may not be the absolute 100% perfect choice but it was a good one. Do you notice any performance issues in having to convert back and forth to JSON or is the JSON always metadata and so therefore relatively small typically in comparison to the actual data that's being passed around as a blob? Yeah, that's the kind of sweet spot for where we wanna use this, right? The metadata should be smaller and we're dealing with kind of bigger arrays and larger data sets. If we were to wanna target more kind of fine-grain things where the metadata approaches the size of the actual data we're passing around we probably would switch to some specific binary packed format or something which would probably look something like message pack or Bison or something like that but would just be, you know, would be a different kind of, a different schema description protocol. But again, the sweet spot is you have, it's sort of like XDMF, you have the metadata description which is small in comparison to this big data that you're hoisting around. So let's talk about the scalability of this a little bit. So we can describe partition data across processors so let's talk about the biggest simulation that you'd know of or can talk about that's used this as well as probably an example of a code that does lots and lots of descriptions like it has lots of different data formats and has lots and lots of descriptions because you did mention that there's a small performance overhead when dealing with this. Yeah, so we haven't characterized it very well yet. It's sort of, it's a dynamic sort of runtime API so a lot of the things that you're doing are going to sort of feel like what you would get from a scripting language, right? It's probably not as painful as all scripting language if it's sort of feel like that. We've used it to pass data so we've used it to pass data for an in situ context and I think we did it up to already 4,000 cores and the description of the data actually doesn't really, we're passing it actually zero copies so the description of the data really isn't very punishing on the performance at all. As far as what we're aiming to do, so one of the things we're aiming to do is back to this original sort of description of how processors talk to each other we're aiming to try that on like a 500,000 core simulation so and we will learn a lot when we do that, I'm sure we will but we haven't had the chance to dive into that yet but it's on the horizon so we'll see, we'll learn how well it works out. Now what kind of other applications do you kind of foresee on the horizon, right? You've talked a bunch about, HVC style of applications with well hierarchical structured data that can have small metadata conduits describing them and passing between libraries in the same process and things like that. What else do you see as being a useful scenario for this conduit type of passing? So we're trying to sort of, another thing we're playing with and we hope will be successful is sort of blowing open how to connect our simulation codes to the web and particularly sort of client side technology of the web because every supercomputer usually has a web browser on it and sometimes that's easier to build than X11 or it comes from the vendor, right? So there's a lot of great stuff you can do with a web browser now as far as analytics or visualization and we would like to have a good way of describing data between the two and getting data even if it's not big data, getting data between it. That's also another key reason why JSON was used because JSON will arrive at the web browser and sort of is at home there. So some of the things that we were doing with the new in situ capability actually is using conduit to talk over a web socket to a web browser and just send the image from a simulation back and it just makes things a lot easier if you can do this and you have sort of a natural context that also works on the web. So actually as we've been ramping up for this data science initiative at the university we've been having a lot more requests for people who wanna access data at our hospital and data at our U of M transportation research institute and stuff. And a lot of people are like, oh well just use web APIs but we've actually been finding difficulty and I'm curious if we can run to this too that a lot of these web APIs are not good for handling like if you want the entire history of the entire stock market it really doesn't work that well. It's still much better to kind of like offline dump that and move it around. And that's also manually intensive. So I'd be really curious what you come up with over the next couple of years. Yeah, so I think for the web stuff we're very much at the beginning here. We're trying to send smaller sets of data back and provide people context. But again, so if you can describe things with conduit whether you go to the web or whether you go to C++ it sort of looks the same way. So there's gonna be some usefulness there I think and how you can curate data with a solution like this. Let me ask you one question and I ask all developers who come on our program what version control system do you use for your software developments and why? So we're using Git for this. So we are publishing the code on GitHub. That's one reason for using Git because it's fairly common but we also have an internal sort of at Livermore we have an internal Git repository system and we use the Alasian tools for managing bug tickets and managing, for Wikis for data exchange or for ID exchange and stuff like that. So as far as Git versus Mercurial or something like that I think the more important thing is probably this distributed version controls. It's kind of freed up things to where it's easier to work on an airplane when you're flying across the country and push things and get things right. But so Git was a solution and I could talk about some of the more of the software engineering stuff that we also selected for conduit if you're interested. So if it's on GitHub I assume this is an open source what license is this under? It's a BSD style license. So then that's our preference so it could be used in commercial things as well without some of our commercial partners particularly on visit would be turned off by a copy left thing so we use BSD. So how does one Google for conduit? It's unfortunately a very common word and unfortunately has a name conflict with some kind of virus or something too. Yes, I noticed that like last week when I was dealing with my grandfather's computer that was quite unfortunate that I discovered that. So its presence on the web is really new. It's only arrived on GitHub and arrived with GitHub documentation and stuff over the last two months. So it's probably hard to Google for conduit itself. If you search for scalability LLNL you'll find a GitHub group that has a set of tools and conduit sort of lives under there right now. So scalability dash LLNL LLNL for Lawrence with a more national lab and that's where the conduit repo is and they're also linked to the doc current set of documentation is there as well. Okay, we'll put links to that stuff in the show notes too. So if assuming this is all open source and stuff how does one actually contribute and communicate with the group? So we have, so obviously through GitHub would be a good pause. So through GitHub would be obviously good opening up issue requests there and things. We track also things separately internally at Livermore right now just based on how things are working but you can email me, I will be the one to respond. So cyrush.lllnl.gov and that can give you more information on how we could attack things. Cyrush, thanks a lot for your time. All right, thank you. Thanks for having me.