 Welcome to another edition of RCE. Again, this is Brock Palin. You can find us online at rce-cast.com where you can find the entire back catalog of over 100 episodes about high performance computing, research computing, and other topics. Again, I have Jeff Squires from Cisco Systems and one of the authors of OpenMPI. Jeff, thanks again for your time. Hey Brock, how's it going? It's getting to be hot in the summer here. I don't know if it's hot up there in Michigan, but it's certainly hot here in Kentucky. Yeah, the humidity has been the issue here. It's like swimming outside sometimes. You would think it's hotter than it really is. Fantastic. Well, let's distract ourselves by talking about something interesting then. Okay, so today we're talking about something that was actually on the proposed list for a long time, but we only reached out to them recently about NetCDF. And we have Ward and Russ here with us to speak to us. So guys, why don't you take a moment to introduce yourself. Okay, this is Russ. I was one of the original authors, developers of NetCDF at Unidata with a guy named Glenn Davis. And he was tragically killed in an airplane accident in about 1999. So after that, I supported and maintained the NetCDF library and utilities for several years and developed a proposal with NTSA guy named Mike Folk to NASA to develop NetCDF4 and recruited and hired some NetCDF development team members, including Ward, and wrote some blog entries about NetCDF. And I am passionate about it still. My Colorado license plate has been NetCDF for the last 10 years, which is one of the geekiest license plates anybody has. Great. This is Ward. I'm a computer scientist. I worked with Russ for several years on NetCDF before his tragic retirement. NetCDF has become a passion of mine. My background is computer vision and machine learning. But my work with NetCDF here at Unidata has been very interesting. It's something that's easy to be passionate about. So what is NetCDF? Well, NetCDF was originally developed to kind of provide a standard interface between data providers and data users for scientific or a oriented data and metadata and for portable data that was machine independent and platform or and application independent. So the simple view is it's a file format and a data model and some APIs and freely available software that implement the APIs. So you can read and write NetCDF data. And together, those support the creation, access, and sharing of scientific data. There are some complications to that. There's lots of different APIs and NetCDF is involved over the over the three decades of use so that it's now actually got some variants. But users usually don't have to worry about all those complications because of version compatibility and transparency. We've always developed NetCDF to keep it backward compatible with previously written data and previously written programs so that when new versions are released, things don't break. And so I think, well, let me just mention a few of the APIs that, language APIs that support NetCDF access originally C and Fortran and then C++ and Java and more recently Python. And then there's also R and MATLAB and Ruby interfaces. And there's lots of third-party software and utilities that can sit on top of NetCDF for data analysis and visualization and management. So what does the CDF part stand for in the name? Okay, well, the whole thing is network common data form. It was not really format because we weren't really emphasizing the format. We were emphasizing the API originally. We wanted to be able to change the format underneath without people having to change their programs, but we wanted to still support all previous versions of the format. But people often called the CDF common data format. And actually there was an original software from NASA called CDF and theirs really did stand for the common data format. And we met with them and used some good ideas. What they had was a Fortran-only library that only was meant for Vax and VMS machines. And we thought there were such good ideas in that that we wanted to extend it to C and make it portable for other machines and also create a single file format for it. Because the original NASA CDF was a multiple file format to store multiple variables in different files. So that's where it originally came from. Now you mentioned network is part of the name there, but in the same breath you also say files. So which is it or is it both? It's really both. Files are containers for net CDF objects that are real simple variables, multi-dimensional variables with their dimensions and some attributes. But the network first of all means that there is a network format originally based on Sun's external data representation, XDR, so that you can access data, the same data on a network with machines that have different architectures in different ways of storing numerical and text data. And also there's remote access to net CDF data using what's called OpenDap protocols, OpenData access protocol. That's been developed quite extensively with net CDF so that you can access data out of huge archives remotely, small amounts of large data sets efficiently through OpenDap protocol requests. That's all underneath the API so it's really no different than accessing data on your local machine except you give an URL instead of a file name. So net CDF is probably best known in the climate and other earth sciences community. How did that, what historical artifact existed that caused that to come about? Well, I'm in the historical artifact guys so I'll take this question too. When it was first released, it became an ad hoc standard for sharing scientific data and metadata among modelers in climate, ocean and atmospheric science communities because it was at the right level for representing that kind of data. It had simple abstractions for variables and dimensions and attributes and those three things were very important because you have a variable like temperature on a three dimensional grid and dimensions like latitude and longitude and time and attributes like what units or the data in were natural abstractions for the output of data models and for earth science data. So it was a good fit to represent multiple variables on shared grids and even had the right abstractions to represent shared coordinate systems. There were other reasons for its popularity then mostly these things were written in Fortran these early models and C was becoming more popular. But you know these Fortran and C users didn't, we're seeing the disadvantage of using Fortran I.O. or bite oriented C libraries to write scientific data because it made their data not portable across platforms and languages and that's the F just provided an efficient portable language independent I.O. APIs for Fortran and C users and it had some other desirable properties too. The data was self describing and it had metadata about the data in it. The file included ways to represent metadata. Of course, it was portable. It was scalable, which means that a small subset of a large data set could be accessed efficiently. You didn't have to read through all the preceding data. You could append data to a net CDF file without copying the data set or redefining its structure. So that was efficient adding a little bit of data to a big data file. It was remotely accessible as I've mentioned through these open depth protocols and this guarantee of compatibility with backward versions of the software made it a good thing for thinking about keeping archives of data. So I think those those were the most important things later on. There was this development called CF conventions for net CDF metadata that became an international standard for representing metadata in output of models and forecast models and simulation models. So that was also very important. Okay, so what exactly then is the relationship between unit data and you car and net CDF? What was the cross pollination there? So I'll jump in and answer this one. So you car is the managing organization for the National Center of atmosphere atmospheric research. You car being the University Corporation for atmospheric research. You car maintained several community programs, the UCP programs of which unit data is one and we have several other sister organizations, all of whom support science and scientists in our community in our particular ways. So unit data is primarily supports the community through development and maintenance of open source software net CDF being the most prominent software package that that unit data maintains. Yeah, let me just add that that unit data has been around for about 40 years or 30 years, sorry, providing data software tools and support to this community of the unit data community, which is a bunch of universities also. Okay, let's continue technology a little bit and I'll jump right into the probably maybe a little bit controversial one, which is you mentioned backwards compatibility, but then there's net CDF four and its relationship with HDF five. Can you talk a little bit about what the thought process there was and what you're trying to do there? Net CDF four, add some of the features of HDF five in a backward compatible way because it's a layer on top of HDF five that also supports the previous versions of the format before HDF five was used through APIs that simply have extensions there's no there's no incompatibility with previous versions. Basically, we saw that HDF five from Illinois had developed several advanced features like compression and and data chunking and we really wanted those and so we didn't we didn't really want to develop yet another format. We thought well, why don't we try to make try to kind of do a merger of net CDF and HDF five by adding some more APIs and using their storage layer underneath and that way we could get some of the advantages of HDF five without creating yet another format and all the all the work that would involve and it sort of worked. I mean, we had HDF five group worked with us and they had to add a few things that weren't there and we had to represent some some things that weren't there with kind of artifices that were built on top of HDF five. But the result was pretty successful. The net CDF four preserves the common characteristics of those two formats and takes advantage of the, you know, the widespread use and simplicity of net CDF and the performance and generality of HDF five. Yeah, I would add that in my experience, the net CDF four, it refers to the enhanced data model and enhanced file format. It doesn't necessarily mean that net CDF three has been deprecated or has gone away at all. net CDF three is now the refer to as the classic file format and classic data model and is still actively maintained and developed. So the numeric naming convention can occasionally be a little misleading I've found. Yeah, I think that's right. I think they're actually about equally popular now, even though it's been 10 years since we developed net CDF four net CDF three is still very popular. So given a lot of the functionality exists in HDF five. Why would someone choose using net CDF versus using HDF five directly? Very good question that basically many users of net CDF four think that its data model and programming interface are simpler. So it makes using net CDF and programs shorter and easier to understand than the equivalent HDF five programs. And that's not because it does exactly the same thing as HDF five, but has better interfaces. It's because there's this tradeoff between simplicity and and power and net CDF four intentionally doesn't implement all of HDF fives complexity and power, but only a subset of the most important features. But but there is another important difference. That's not the only thing that is simpler and easier to understand. It's it's it's net CDF support for named shared dimensions. This is an abstraction which was never part of the HDF five data model. And so net CDF variables that share a set of dimensions have this way to represent a shared grid or shared coordinate system that's that's not anything that's naturally provided in HDF five HDF five is more serves as a container for all kinds of things and doesn't have the conventions for for representing shared shared grids or shared coordinate systems. So that's responsible for probably one reason people use net CDF four, or even net CDF three, instead of HDF five when they when they want that capability and they want as simple as interface as possible and they don't need all the stuff that HDF five has to add to that answer. The other thing that the HDF HDF five libraries lacking is the ironclad backwards compatibility or archiving promise where we will never release, you know, net CDF will never release a version that cannot read data written by old versions of the library. And that is not a promise you get if you're if you are using HDF five directly. In fact, we encountered something along these lines. I want to say to middle of late last year with HDF five where we the the two current net CDF developers had to scramble to mitigate some changes in the HDF five library which would have potentially broken backwards compatibility that that was their highest priority for for several weeks working around this change. So in addition to everything Russ said this net CDF provides this promise to give scientists confidence in archiving their data in that CDF directly. Okay, so just to put that completely plainly, if I download net CDF today and install it on some modern OS with a modern application, whatever, I can read with that one installation of net CDF data sets that were written 1015 years ago with net CDF version one. Is that correct statement? Absolutely. Okay, it's not just that you can read the same data. It's that if you have old programs that that created or read that data, they will also work it, although you may have to recompile them and relink to the new library to to keep them working. Sometimes they'll work without. I mean, if it's format change that underneath, you definitely have to relink to the new library, but you don't have to change a character of the program. All right, let me go on a slightly different direction here. Being an MPI guy, have an MPI related question for you here. There is a project out there called parallel net CDF or P net CDF. But there's also an MPI enabled version of net CDF. Is there what's the correlation between the two? So I'll jump in with P net CDF if that's okay, Russ. Sure. So parallel net CDF is an independent third party project maintained as a collaboration between Northwestern University and Argonne National Lab. And it works with net CDF three, the classic library data model and file format, and it provides parallel IO, which was not native to the net CDF three code. So I assume you mean the native MPI with net CDF and parallel lib HDF five. So when the HDF five library has been built with parallel IO enabled, the net CDF library at configure time before compilation will actually probe your HDF five library to see if it contains the parallel IO operators. And if so, parallel IO is just enabled and available through net CDF. Your which which is great because it lets your program which relies on that CDF achieve parallel IO without really having to change your code. It is just inherent. It's used automatically because the underlying lib HDF five IO is parallel enabled. Okay, so this is MPI underneath the covers to affect the parallelism. What about the other way around? Has anybody done the MPIO APIs with net CDF underneath? Not that I know of. Same here. Not that I've heard. I think MPI is kind of a lower level library than net CDF. It doesn't. It doesn't deal with abstractions like variables and dimensions and attributes and so I'm not sure an MPI program could make that greater use of net CDF underneath. So a file format is only as good as the ecosystem that can read it. What are some of the other common tools people use with net CDF going from their simulation code to their visualization to archiving? What are common tools that understand a net CDF that people use? I'll take a stab at this. The library from the software that comes from Unidata comes with three important generic tools that have lots of uses by themselves for conversions and abstract extractions. These are called the NC Dump, NC Gen, and NC Copy. But there's lots of other tools as you can guess from the format that's been around for this long. If you look up net CDF software on the web, there's the list of, I think it's over 80 freely available packages now that have been adapted to access net CDF data and visualize and analyze and manage it. And there's some commercial packages too. There's about 25 or so licensed packages that use it. And that's really too many for new users to have to choose from. But they can look at the descriptions and try to figure out what might be useful. But there's a few large third party collections of tools that are especially suited to NCO. And I'll just name those now. NCO, which are the net CDF operators from Charlie's Ender and his group at UC Irvine. NCL, which is the NCAR command language. It's a bunch of really good graphics and analysis tools and a kind of a interpreter language that that deals with the variables and such net CDF variables. And then there's one called CDO, which is the climate data operators from a group at the Max Planck Institute for meteorology in Germany. And they each have their own particular strengths and the large collection of users. So it's hard to say much more about them. You'd have to have to use them to see what or look at them more carefully to see which one's most suitable. There's lots of other single app applications for doing browsing and net CDF data. And NASA has has some packages that are very good general mapping and analysis packages. That's about all I want to say right now about that. Well, I sorry, go ahead. I would also say just from talking with our users and community members, you know, for non developers, people who just want to work with net CDF data, the big three packages company out of California, Esri has software that is commonly used for visualizing data stored stored in net CDF format. MATLAB is another commercial software. We get a lot of questions about or just that comes up in conversation. But then also free tools like R and Python, both of which have net CDF hooks, as well as the inherent visualization capabilities of those languages are also very broadly used. But as Russ said, we maintain a list of just dozens and dozens of commercial and open source packages that speak net CDF. Yeah, and you're mentioning that Python, I have to throw in one more thing here too, because Python's model for multi dimensional data is well, it's very compatible with net CDF data. And this package in Python called X array X, A, R, R, A, Y, X array developed by Stephen Stefan Hoyer is an open source project that that really brings a power of pandas. Using net CDF data pandas is another popular package in in Python. It provides in dimensional variants of the core pandas data structures and and in member it provides in memory representations for net CDF files. So it's really quite quite a good package to look at if you're going to be doing your programming in Python and you want net CDF access. Well, so that brings up a related question here. You listed off a whole laundry list of languages that the net CDF APIs are available in. How did you go through the typical quandary of exposing functionality in different languages? Are the bindings as close to identical in each of the languages or did you take an effort to, you know, like support Pythonic things in Python and see things in C and, you know, try to emphasize the strengths of the particular languages? And could you cite an example? Well, for the modern interface, so for the modern interface is the modern API bindings. Unidata maintains three directly, the core C library, then the Fortran and C++ APIs, which are just separate libraries with hooks back into the core C library. We also help maintain the Python bindings, though that is not a project we spun up from scratch. All the other languages of which there are many like are, as previously mentioned, Ruby, Perl, if you like, any number of other languages actually come from the community. These are bindings that we had zero involvement with creating. And for the most part, they exploit the features of the languages in which, for which they're intended. So we don't try to make everything look like the C interface. We originally tried to do that with a Fortran 77 interface, but later on, for example, when Fortran 90 came along, a user contributed a binding there that really exploited features of Fortran 90 that weren't available and was much more comfortable for Fortran 90 users. And similarly, the Java interface and the mental model you need to use it is quite different from the C Fortran or Python interfaces. And it's very javonic, if you want. It's not like Pythonic. It was written by a sophisticated Java user. And so it knows about the idioms of the language and the way you represent things. I should also apologize to the Java team for forgetting that the Java bindings are also maintained internally. Right. So what's coming into future for NetCDF? Okay. Unless Russ wants to jump in, I'll answer this. I know nothing. So right now, the next step that we're looking at in the short term is extending the compression capabilities. Currently, we leverage Zlib through hdf5 to achieve per variable compression in the NetCDF enhanced file format. But libhdf5 provides a an interface for for adding in additional compression plugins, so to speak. My colleague, Dennis Heimbigner, has written an API that will let us leverage this. And we're also designing some experiments to provide compression results to our community so that they can kind of see what they can achieve with different compression schemes. Beyond that, with cloud computing having exploded the way that it has, block storage is something we would like to be able to leverage with NetCDF to be able to read from and write directly to block storage such as that provided by Amazon and other cloud providers. And beyond that, largely we will be responding to the needs of our users because that is our user community is who we serve and what they need is we try to get there at least before or at the same time as them. I guess I'd also say see the NetCDF GitHub site. I think NetCDF jumped on GitHub sooner than hdf5, for example. I think they still may not use it, but there are so many good developments and going on there and so many users who have been contributing that the future is somewhat being driven by what people contribute and how it proves to be useful and how popular it is. So I think there are some plans out there. Pull requests are welcome and encouraged and any reasonable feature that is pitched and implemented and submitted via pull request will be given full consideration. Let me ask another forward-looking question which you may or may not have answers to, but what do advances in hardware mean for you? So faster CPUs, the advent of SSDs, faster access to storage, faster networks, do you use native network APIs, all these kinds of things that give acceleration possibilities to the underlying hardware. Are there opportunities to use that in your implementation? Well currently, the faster the underlying storage is to access the quicker the API that the library can retrieve data locally. If we are talking about data stored remotely and accessing it via OpenDAP, the OpenDAP API advances in network speeds and the underlying technologies and hardware there, we will see better throughput. The NetCDF library is a storage medium. It's not an analysis medium. There aren't any operations to go, for example, request a matrix decomposition on data stored in that CDF and because it's really primarily just file I.O. and a data model, there's nothing for increased CPU speeds or GPU accelerated programming. There's nothing for it to really do that would benefit NetCDF at this point. I would point out though that the SSD availability is actually kind of important if you're using, if you're doing compression and chunking. Because when you rechunk, if data is written in a certain order and you want to commonly, most commonly, have people read it in a different order and there are huge data sets. For example, you have something that's that's stored with all the data at each time and you actually, the users actually want to take out time series at each point. It's often that's about the worst case for accessing data that was written one way and you want to read it in a different way. And SSDs turn out to be very helpful for that rechunking of data to try to get it into a way that's that's not really, really fast in one direction in one order and really, really slow in another order, but that is kind of pretty fast for any way you want to access it in along any dimension. So I wrote a blog about some experiments with SSDs and how they could improve very, they could, you could create huge improvements by rechunking your data and the best way to rechunk it if you knew how it was going to be accessed was to use SSDs rather than spinning disks just because you get much better performance for the kinds of things you need for rechunking. If you have lots of memory and you have SSDs, but that's about as far as we went with that. So what about licensing? What license is this library distributed under? So I'll let you answer that one, Russ, for historic purposes and then because and then I'll have something to add. Sure. Okay. So Unidata NETTF software from the start was NETTF-C and Fortran and Java interfaces were under a simple MIT-style license. We actually wanted commercial applications to be written with NETTF just to support it as an ad hoc standard. So we didn't really want to put any restrictions on its commercial use. It should just be open source and that's the sort of thing that MIT-style license gave us for open source. Later on there were some issues about whether to use a GNU library license, various versions of that, and a thing NETTF-Java for example is available under multiple licenses including the live GNU license and the MIT-style license. Okay. So adding to that, so yeah, NETTF as Russ described is currently licensed. Open source in the sense anyone can use it for anything which is how we would like that to be and that is how many other Unidata products and projects are licensed as well. There has recently in the last 12 months been a push to adopt one of the big licenses, one of the more commonly known licenses, like a BSD3 clause license instead of what we have now which is effectively a BSD3 clause license, but whatever the license changes to the spirit will remain the same. It will be free to use for anybody be it commercial or open soft projects with really no limitations on how anyone uses it. Yeah, I guess an Apache license was another consideration which I'd forgotten about too because of the patenting issues but we don't think there's any patenting issues with NETTF and so we as far as I know decided not to use any of those Apache licenses. Yeah. Are you going in a slightly different direction again here? What's the largest data set that you have heard of that NETTF is used for? So I sent an email out to our community mailing list when recently asking this very question and the response I got was someone who had a single single-digit petabytes, it was two or three petabytes of data stored in that CDF. I think that was even single files that were stored in there were there were several petabytes right? Yes. Because there are archives like for example the IPCC climate data is multi petabytes I believe from the fifth IPCC report but that's stored in millions of files it's not all just one unit. So this person was just having a single container for petabytes. Yes, I was impressed but that is that is correct Russ it was in a single file. And then an offshoot of that question that we like to ask a lot of our guests here too is what is the strangest or the most unexpected use of your software that you've seen? Something that when someone tells you that they're doing it and I go okay wow we never thought that would be a use case? I have an answer for this but I'm curious if Russ has one as well because he's got the broader view. Well I know that Ed Hartnett who's one of our developers always used to claim and so did Rich Signal actually that they did their taxes in the CDF because it was so convenient but I'm sure that was a joke. There is some use of NetCDF in some standards for storing what is it some there's an instrument that does spectral analysis of chemicals and they use nothing to do with meteorology or climate and the standard is based on NetCDF but I guess that's not very strange. So the use case I'm thinking of is several years ago we had a supporting mail from a gentleman who wanted to store all of his Linux system configuration data in that CDF and had some very good questions about that and I was happy to help although he really never answered my question of why would you do that but I'm that's none of my business I'm I was happy to help him. In that same vein there have been people who became enamored of NetCDF and said well why do I need relational databases I'll take my relational data and try to store it in NetCDF and that really kind of contorts the data model and it's not NetCDF is not ideal necessarily for the kind of stuff you store in relational databases it doesn't follow that data model at all and and you really have to contort things to do it to do that very well I think generally if something is well suited to relational database management system go ahead and use those or but for something that's closer to scientific data observational data or model data NetCDF might be the way to go. Okay so you mentioned before we started recording here that you were one of the original authors of of this package here could you give us like the two or three minute history of NetCDF how did it come about and how has it gotten to where it is today? Sure in 1988 actually we had some meetings among folks from NASA and the CDF format I may have mentioned and then some people from the University of New Mexico who had developed something called Candice and and a guy from a image processing company all to talk about issues with developing something like CDF that for Unix and for for other languages anyway the out of the meeting came the desire to develop our own software for this and not try to use the NASA stuff and we have support from the National Science Foundation who were the primary funder of unit data so we just developed this in 1988-1989 and that's when the the NSCF version one came out in 89 it was beta version 88 and that gained a lot of popularity until about well actually all through the 2000s we talked about getting together with the HD of five folks but we were certainly competing with each other and cooperating with each other but we still were two separate developments and then in 2010-2011 we thought of maybe making a proposal to I'm sorry I have to go back and change the dates in 2003 we actually got together with the folks from NCSA that developed HD of five and tried to submit a proposal to NASA to develop this kind of merger between net CDF and HDF five that would put an HDF simple layer on top of the HD HDF five underneath as a storage layer and that was funded by NASA and it's a it's supported basically four years three or four years of development that never would have happened without that that grant which involved work both from unidata and from HDF folks so we'd like to thank them then there's just been so many contributions from the community of users everything from bug reports to actual big code contributions like for the Portran 90 or the Python software some of the other language software so we're very grateful that the community has provided so much of what net CDF is and why it's still useful okay well thanks a lot again for your time guys where can people find out more information about net CDF and get involved well a great place to start is at the unidata web page which is unidata.ucard.edu from there next you can go to our github page which is github.com slash unidata slash net CDF-C and from that landing page you can find links to the Fortran C++ etc other landing pages as well as a lot of information about net CDF at the high level the philosophy and then the nitty gritty API details finally through the unidata web page we maintain several mailing lists and joining a mailing list or browsing through the 30 years or so of archives is a great way to find out more okay thanks a lot for your time thanks guys