 All right, thanks Ming for the invite and thanks John for the talk before. My name is Nuno, I'm a research data management specialist at the National Computational Infrastructure at the ANU in Canberra and I'm going to be talking today about using NetCDF in Jupyter notebooks. My background is essentially geophysics, specifically magnetocalorics and that's from a field that generally doesn't use these high performance data formats. I come from a background of using a 40-year-old data formats like ASCII or EDI and so I've just been working at NCI for a year now so I'm all going to go through how to use NetCDF, maybe not as much detail as John went into with the HDF5 but yeah so I'll sort of walk through what NetCDF is, some softwares that you can use it in and how to access it in Jupyter notebooks and the like. So I'll begin with the introduction to NetCDF. So a network common data form, it's a data format and a set of libraries to read and write the format and in many commonly used programming languages and it's a de facto standard in the climate and marine science communities due to its simplicity of use robust software and portability and that supports the creation, access and sharing of array-oriented scientific data. Oops sorry, it was developed and maintained by UniData which is part of the Utah University Corporation for Atmospheric Researchers which is like a conglomerate of 120 or so universities in the States and it's funded by the National Science Foundation. The project started in 1989, version 3 which was released in 97 is used widely, version 4 which was released in 2008 allows the use of the HDF5 data file format and version 4.1 adds support for C and 4-tranclined access as well as the use of open-dap services which I'll go through in a second. Platform independent and data are stored in a fashion that allows efficient subsetting, just some of the formats over the history, the classic format, the 64 bit offset format and now currently the NetCDF4 HDF5 format and so the NetCDF4 project uses HDF5 as a data storage layer and it also allows for parallel IO for high-performance computing, has chunking and compression options as well and it provides both read and write access to all earlier forms of NetCDF. So the structure of a NetCDF file what's made up of three basic components variables, so this is where you store the actual data, dimensions, they give the relevant dimension or access information, things like latitude, longitude, time, depth and the like and attributes. So the attributes provide auxiliary information for the variable, so metadata about the variable, what's its name, how is it recorded and the likes as well as global data, so what was the reason you collected the data, possibly an abstract about the data set etc. So NetCDF have various conventions associated but with them I'll just go through one of the conventions at the moment today which we often use at NCI which is the CF metadata convention or the climate and forecast metadata convention and these are guidelines and recommendations as to where to put information within a NetCDF file and they allow the creator of a data set to include information about the data and the data set in a structured way and this makes it easier for others to use and retrieve information. So yeah some of the properties of NetCDF they're self-describing so they include information about the data it contains, they're portable, they can be accessed by computers with different ways of storing integers, characters and floating point numbers and they're scalable so you can subset a large data set and that can be accessed efficiently. Appendable, so data can be appended to a properly structured NetCDF file without having to recreate the data set or copying it over. Shareable, so one writer and multiple readers can simultaneously access the same NetCDF file and finally they're archival, so access to all earlier forms of NetCDF data are supported by current and future versions of the software. So some example applications that use NetCDF, there are a lot of them I'm just going to go through a couple just to give a general example. So NCO which is the NetCDF operator suite it's a Unix command line utility providing a range of commands for manipulating NetCDF files you can do things such as concatenation, eraselizing and averaging fairly easily. NCVU which is a visual browser so you quickly visualize what a NetCDF file looks like you can use the NCVU or another option is PanApply which is another NetCDF file viewer developed on NASA. Additionally you can use Python or MATLAB or R and for the examples I'll give in a second I'll just be using Python you can read the NetCDF files there are many more if you take a look at the Unidata website they have a lot of different applications that you can use on your NetCDF files. I'll just quickly go through some examples of how to use this so I'm just going on to the NCI's virtual desktop infrastructure and so it's essentially it's a desktop so users have essentially a desktop in the cloud an eight core computer and it's backed by 10 plus petabytes of research data so you can access yeah all different sites of research data in the cloud. So just some of the examples I just went through how you can grab what's cool if you open that link of a specific dataset and you can load that in and it will show you without having to download anything you can look at what's contained in the file and you may be interested in a specific variable you can see some metadata associated with the variable or some more associated with the whole file and if you want to visualize it you can quickly create a plot and then you can sort of zoom in and see roughly what the data is going to look like what area it covers and the light so you can get a quick visualization of what's contained in the NetCDF file. Additionally if you just want to quickly visualize the metadata via the command line you can just use a function called NCDump and it will dump some of the metadata so here we're just looking at some I need a time series data so here we're looking at the dimensions and the variables and the variables have metadata associated with it such as the units use the long name the sampling rate dipole length and the like additionally there are some global metadata so you can include a type title of your survey summary so some sort of abstract who recorded the data that they created the conventions used as well as things like for example in MT they have a typically time series processes done with a program called BIRP and so that creates how that takes various inputs so you can put the exact inputs you use to create your data set and someone else can reproduce using the variables you use so yeah it's kind of good for reproducibility having all this metadata. Anyway I'll get back to the talk so yeah this is the NCDump we just went through so the collections we have at the NCO like I said we have 10 plus petabytes of research data so that access you to a variety of options so direct access on the file system so if you have access to the RAGE and super computer you can just access data directly or using web and data services so we just talked about the threads data server and this allows for browsing and accessing of data as well as metadata so there are various tools associated with the threads data server opened up where you can grab a link and plug it in to say Python or R and go and work on it. Next you have subset it's just a way you can subset the data and potentially use a smaller subset of a larger file there are some OGC web mapping service and web coverage services I won't go through it but I'll just quickly show you what I mean by the threads data service so at NCI I'll just go from the top and this threads data server and we have a whole bunch of data sets from different communities whether it be weather, geophysics, satellite data and in order to access data set we can go to say whatever we're interested in and we can see here these sort of files are roughly 2.3 gigabytes so we don't want to be downloading that so we want to use some sort of our service on it I'll just go through a couple if we want to quickly visualize the data we can use the godiva2 data view and here we have all the different variables contained in the file and we can click on one and I'll give a quick preview you might have to change the scale here but it can give you if you pick the right scale a quick preview of the data so some other tools we have the net city of subset so here are all the variables contained in the file we may only be interested in this variable we may only be interested in a smaller bounding box so maybe say an area around here we can find a bounding box that we're interested in we may only be interested in a specific time range we can adapt time range and then make a request so your 2 gigabyte file might be reduced to say a few megabytes using a subset service and the opened app service well this is just you can you can see all the global attributes the variables their local attributes and utilizing this link you can plug it into say python as an example and then extract the data just just on the fly there's no need to download it all I'll go through an example in a second back to the talk so we can also access through data portal so there's various data portals for example the virtual geophysics lab laboratory or the e-reefs online analysis analysis portal there's a few data portals here at the nectar website I encourage you to take a look at that and we can access the data using the virtual labs which is the virtual desktop I was just in so I'll just go through some example net city of notebooks just a quick one Jupiter it's an open source web application that allows you to create and share documents that contain live code equations visualizations and narrative texts and it makes data analysis easier to record understand and importantly reproduce has supports over 40 programming languages including python Julia and I many others the examples I have are in python and we're going to open the files using the data set function so typically you would import from net city if you didn't put the data set function this would be your open that link and then you could essentially open using the data set function from there you can look at the metadata so what are the dimensions in the file what are the variables and what like so I'll just quickly go through a couple of examples that I prepared earlier so this is just a geophysics example so like you said you just import the net city of uh data set the library you'll grab your open that link from say the threads data server as an example open the data set and then once we've opened the data set we can browse basic information about the file so here we're looking at the dimensions and the variables in the file and once you've looked at what you want you can start extracting and plotting data so here we're just extracting some data and plotting a global map of Australia so this is a gravity map of Australia the great part about this is everything's documented and there was no need to download anything and there we have it a gravity map of Australia and if we want a subset we can create some criteria for a subset in and plot that so here we just have a small subset and you can continue to do things like put transec ones and the likes lots of things you can do so that's one geophysics example but it's essentially the same with other disciplines this is one using uh what is it ocean forecasting data and satellite data and so once again we just grab the opened up URL open it up using the data set command and we can view the metadata and we can extract and plot data here are just some examples we can do the same for a small subset once again and we can do the same with some satellite data which we've done here and we can do a RGB band of the satellite data and in the end we can combine the two datasets it's just it's just somewhere down the bottom here yeah a manipulation where we combine the ocean forecast data and the satellite there's a couple of examples um I'll just get to the conclusions now yeah so the advantages of using netcdf well it's an open format and open source tools so accessing data is easily done through common libraries it's self-describing so you don't need supplementary metadata files netcdf for allow storage of indimensional data it can store all-size data at once it can be optimized for hbc data settings so it offers the chunking and compression of options we have the threads data server which gives you many tools for accessing the netcdf file so we can move away from the download era has strong usage in many research communities climate and marine science especially but there are other communities that are adapting it now so the earth sciences in particular like geophysics communities are starting to adapt netcdf it supports parallel read write access to netcdf files some disadvantages well if you're like me and you have never used it before there's a steep learning curve involved so yeah the documentation takes a lot of time and effort to understand and users we need to update their tool kit so they have to learn new tools so that they can view contents and visualize the data from netcdf files so here are some links to some websites some of the artwork provider was from an XCI employee named Jonathan McCabe interested here's his clicker and that's it thank you