 So, today we're presenting on some of the work we did in the Osno initiative, particularly looking at land and water and the data that we have in CSRO and how to make that interoperable accordingly to some of the principles that FAIR espouses. But as we will talk about some of the implementations that we have explored around the FAIR principles into actionable questions to address how FAIR your data is. So if you haven't come across Osnome, this is a CYROLED initiative aiming to connect information ecosystems around Australia. The Osnome name was coined echoing the genome project, so Os being Australia and the Nome being a genome kind of inspired project. But really what we're looking at here is tools, services, products, methods, approaches and practices and infrastructure to support having more connected information infrastructures. And in the previous year, as Keith mentioned, we focused on environmental information infrastructures, so a couple of links there you can follow. And today we'll be talking about an example in the water space. Okay, so as part of establishing the Osnome architecture, Osnome infrastructure, we felt that we needed to assist our potential data providers to understand what good data was, what in the context of this seminar series, what FAIR data is. We all call it Osnome data, and basically we developed a set of rating criteria and a tool to allow assessment by data providers of the data that they're providing. And this is just on the right hand side of the screen here. You can see a screen capture of the kickoff page of the tool. You'll also notice that we've got a slightly adapted version of the FAIR criteria, Findable, Accessible, Interoperable, and Reusable. But we also add in the last line there, Trusted, which appears to go a little bit beyond what has been conceived in FAIR until now, but we suggest it would be a useful addition. We're kind of bundling the Interoperable and Reusable together. We see those as being very closely related. And obviously, it's teasing out some of the issues around what it is that makes data interoperable. Keith's given a sort of a high-level overview and indicated what some of the concerns might be. We've done our own take on this a bit, actually fairly strongly leaning on our experience over a number of years, more than a decade now, actually, of working in the data standards communities, in particular the geospatial data standards communities. And some of the learnings that we've got from there, which were applied directly in here, obviously environmental data, which is what our heritage is, where we've largely been working. A lot of that is geospatial, so it makes sense to be building on that. Just a bit of a reminder, the 411 FAIR Principles, this is a summary slide from Michel Dumontier, who's one of the original authors of the papers and the developers of the FAIR Principles. They got these guiding principles with the four key words and teased out into three or four sort of sub-principles in each case with the FAI and our letters. We're looking at the interoperable set here, which Keith has already shown. It's interesting that Michel has recently done a study evaluating a number of repositories, particularly in Europe, but some of them are broader than that. But here's the list of repositories that were evaluated and scored those on the FAIR Principles. The data is available in this form, actually this table shoots off to the right of the screen. There's lots more going on there. But looking at the summary of the results, it's fairly notable that the tallest red bar here is in the interoperable category. So what this is saying is of the FAIR Data Principles, this is the one which is hardest to meet, the one that's hardest to conform to. And so really that's the focus of the approach that we've taken which is to kind of lead people through how they can make their data more FAIR, more osnomic, more interoperable. And the particular way in which we've broken out the question of interoperability is on, if you look at the numbered terms here, is it loadable, is it usable, is it comprehensible, is it linked, as well as is it licensed? And just going to go through some of the details of those and you'll see this, you know, a sense is fairly repetitive of some of the concerns that Keith explained at the beginning. But we're putting some more concrete examples onto these criteria to indicate to our data providers that when we say a standard data format, we mean something like CSV or JSON or XML or NETCDF. These are all important file formats towards the left hand and then there, you know, they're kind of general, but NETCDF is one that's used a lot in the remote sensing and environmental science communities. So we've got a bit of a ladder here of different levels of conformance which you can reach about whether a data set would be loadable. Is it in a unique file format? Well, that means that you've got to have some unique software to load it, or is it in a standard data format, and normally that would be denoted by one of the standard MIME types. Best of all would be for data to be provided in multiple standard formats, giving a choice to the user so that whichever their favourite platform for loading data they can, they can use. Next question, even when you've loaded it, can you use it? If the structures within the data set, even if it's loaded, if the structures are unclear, then it's not going to be very usable, and that comes down to the matter of, is there a schema that's provided which makes explicit the structures within the data sets? A lot of sort of traditional data, yeah, there's a structure in there, but the schema is not available independently off the data, if you like, the schema is implicit. It's not formalised, the schema maybe is different every time. A lot of spreadsheets are done that way, spreadsheets got a lot of boxes, but if every time you use it you add different columns and use the pages in a spreadsheet in a different way, then it takes a little while for the user to get the heads around what's going on before they can use it. So there's various explicit schema languages like DDL, which is used for relational systems, XML schema, something coming out of the Open Knowledge world these days called data packaging which allows you essentially to describe a schema for a CSV file, then you've got in the RDF, the semantic web space, RDFS, and Al, Jason even has a schema language these days, although it's not broadly used. So it's nice to provide data with a schema, but best of all would be to say the data I'm using, I'm using this community schema, this community and for example the Open Geospatial Consortium provides a number of community schemas for observations, for time series, for hydrology, for geoscience, and if you're publishing or attempting to share data in any of these disciplines then best to go off and find a community schema. Again, even when you've got it loaded and you understand what the structures are, you've still got the question about what the words and numbers are inside the boxes. Do the column headings, are they explicit enough to understand? Are they just shorthand for something which the project leader when he was developing the data knew that he or she would understand it the next week, but even he or she if they came back to it the next year may not understand it. Best of course is if the field labels are linked and do have explanations probably in plain text, better still is to use standard labels, for example the universal code for units of measure, units codes or the climate and forecast conventions coming out of the fluid earth community. So the ladder that we've got here says are you using standard labels? Is it just some of the field names are linked to standard externally managed recoveries Those are all the field names linked to standard externally managed recoveries and you get this ladder better and better and better. And then the question about how well linked is your data? Well, if it's just a file sitting on a server somewhere, there's no links in or out, yeah, you're lucky to find it. If most of the data sets that we're in this community would be expecting is that they're indexed in a catalog or they're available from a landing page and that's the situation where you've got inbound links to the data set. Best of all is when there are outbound links embedded or implicit in the data structures in a data set, which says exactly how it's related. And this links in with some of the previous concerns that we had there about field names and these kinds of things. So I'm going to hand back to Jonathan to tease through a case study that we've got here really based on the Aura-L, Australia Water Resources Assessment data sets. So Jonathan. Yeah, so as mentioned earlier in the Osnum project, we looked at a practical example and case study in the Aura-L data set. This is a continental cell data set that has historical time series from 1911, the Bureau of Publication Operational Version online. You can find that on the website. But often, scientists have to basically deal with this data set by knowing where it is and knowing how to use it implicitly and knowing how to reference the requisite geospatial features and understand the field name values. So I've got an example in the, sorry. So the next slide shows the assessment of it using our tool and just focusing on the interoperable side of things. We have rated it as a web service. So it's, you know, we can get it by the web. However, the reference definitions are text only and they are localized in the data set itself. And I'll give an example in the next slide. So this is coming out from the NETCDF metadata that this data set, you can access this via online through threads or by the NETCDF tools. But this is a summary of the metadata that comes along with the data. So we've got long name here, potential evapotranspiration. We've got the name, which is a label for the field, E0 underscore AVG, units MM, and a standard name, which is a convention in NETCDF to refer to the actual observed property, which is E0 underscore AVG, which in this case is a part of the CF convention that's often used with this format. So if you are an expert in this area and you've used this data set many times, you'll know what this is. If you are a newcomer, you have to do a lot of work to, well, a little bit of work to understand what actually this data field means. And in the AUSNOM project, what we did was enrich this with external variables. So if you go to the next slide, Simon, so this is the same field. And you know these added lines at the bottom here. They tease out what this particular data field means in the context of externally defined vocabularies. So we've now enriched this with a scale-quantity-kind identifier, potential evapotranspiration. And it's a HTTP URI where you can resolve it and get a definition. So similarly for substance attacks and unit ID and feature of interest. And just talk about what they are. So this is what a part of the project was to explore. Could we define vocabularies for these from which we could reference outbound links from the data to the definition? And this is just a summary of what we did in the context of the RIL data set. And this is an example of potential evapotranspiration. We've got a conceptual model here where we've got broader notions of potential evapotranspiration. And we've got linked relationships out to things like feature of interest, object of interest, and unit of measure. So this view provides a vocabulary entry for potential evapotranspiration. Not only the identifier for it, not only the description for it, but a richer model than you would get from if you just had something in line. So you've got outbound relationships from this concept to its related concepts, essentially. So this is a demonstration of defining the concepts externally, having them quite richly explained through this medium, but having the ability to link that from the data set itself to this definition to make it more interoperable. So that if we have another data set that talked about potential evapotranspiration, it could potentially be linked and interoperable. A revised Osnome maturity estimation using the Osnome five-star tool, and just focusing on the interoperable field. We see that it's for using the same tool and assessing it based on the criteria, we've gone up from two-star to more than four-stars in the interoperable space. And the reason for that is that we now have reference definitions as linked data and externally hosted observed property, vocabulary definitions, rather than just inline labels of what it is. It provides more interoperability, and if the vocabulary was standardized, then we would have a high estimation in that field. But it's just a demonstration of how we went about making something more interoperable through the Osnome project. And yeah, I'll just pick up at the end here and just comment that when we were starting this data ratings exercise, we actually didn't look at FAIR at the beginning. We developed our own set of criteria, these keywords here, and then subsequently correlated them with the FAIR principles. One of the interesting things was there was three lines in this table here, the ones in red, which didn't correlate with concerns that had been identified within FAIR, and the first one might be seen as trivial, but we thought it was a question that was worth asking, particularly when working with research scientists and talking about making their data available, which was the first question, is your data intended to be used by anybody else? There's lots of data generated, which is never shared. Now, it's not necessarily a good thing, and to a certain extent, having the question there highlights the fact that there is a question to be asked and that some scientists, researchers need to be encouraged to think about making their data available, about publishing it. So I think in terms of the FAIR principles, this one was the kind of implicit starting point. If it's published, yes, it's implicitly FAIR. A couple of other rows. One concern which comes up particularly, we've worked a lot with agencies that have sort of systematic data collection processes with systematic curation and maintenance revisiting. A data set is refreshed every day or every month or every year or that. That concern didn't seem to be particularly addressed in the FAIR principles as they stand, and so we'd say, you know, the concern about whether the data is expected to be updated and maintained, maybe a bit more than FAIR. And the bottom row there as well was the concern about, you know, this is a, if you like, an elaboration of the assessment of data that you might do, which is to get some information about how well trusted it is. Now, a lot of that's about who else is using it. How much, well, that's often the criteria you'll use. Who else is using it? How many times has it been used? What other products has it been generated from this data set? And so can I trust it? So just emphasizing that row there is the interoperable correspondence with the interoperability, which is what we've really been focusing on today. The use of standards, I guess. Standards is a funny word. You have to be a bit careful with it. Capital S standards. Sometimes people think that's just to do with ISO or Australian standards or whatever. Really, the point about standards is that they are community agreements. They are community agreements which are available for additional members of the community to join in. But it's important to think of them as agreements. Agreements to do things in a common way. So finally, just to slide with some links to some of the material that we've been showing today. We'll say thank you for listening.