 Just to introduce myself, my name's Keith Russell. I work for the Australian National Data Service. I'm your host for today. My colleague Suzanne Sabine is behind the site scenes co-hosting the webinar with me. Just a usual little bit background. The Australian National Data Service works with research organisations around Australia to establish they will have them trusted partnerships, reliable services and enhanced capability in the research sector. We work together with two other increased funded projects, Research RDS, Research Data Services and NECTA, to create an aligned set of joint investments to deliver transformation in the research sector. So this webinar is part of a series of activities we are undertaking to, we're aimed to support the Australian research community in increasing our ability to manage our research data as a national asset. So as I mentioned earlier, this is the third in a series of webinars around FAIR. So we've already had the webinars on findable and accessible and today interoperable, next week reusable. So today I will give a brief introduction about what is interoperable as described under the FAIR data principles in force 11 and then I'm very grateful that Simon and Jonathan have available to talk about what they did in practice in the Osnome project to make their data interoperable. I think it's a great example to show how this quite complex topic can actually be carried forward in practice. So this is what force 11 says about interoperable and first of all a few things to keep in mind. So just reiterating a few things I mentioned in the very first webinar. So when they talk about data and as you look at these headings you'll see that they talk about data and metadata so interoperable applies both to the metadata describing the data collection and the actual data itself. Another point to keep in mind is throughout the FAIR principles they think a lot around not only data being usable for humans but also for machines and that provides a huge benefits in bringing together disparate data sets in bringing together bits of knowledge that are distributed over different data sets and interoperable is a key element there to make sure that data can be brought together and actually can be you can we can get those benefits out of bringing data together which will enable new knowledge discovery new relationships to be discovered new patterns to be recognized all those all those pieces of work. So as we look at these three headings that they've listed under interoperable first one there is that data and metadata use a formal accessible shared and broadly applicable language for knowledge representation to keep in mind there is that not only for you as the research or the researcher that has created the data but also for another researcher that wants to understand the data and use the data it's useful that they understand the language you've used and that that is a standard standardized language something that other other other users can also pick up and use. So ideally that is the case for the metadata sort of that is definitely the case for the metadata and ideally that would also be used in the actual data itself very basic example if a researcher has observed that there's they saw a magpie they can write in I saw a magpie but it's much more useful for a researcher somewhere else on the other side of the world that you write in that it's an Australian magpie and that is a Craticus tibogen that means that a researcher on the other side of the world has as a uses using a standard language will actually be able to better understand what you meant and what that description is about. Now it's not just in the actual wording used in the vocabulary used but it's also in in in the it's useful to have a framework around that which will allow the data to be also be machine readable and picked up by machines and used and interpreted. Now one obvious example which gets mentioned quite a lot is using RDF and ontologies that is quite common in the life sciences and life's a number of life science researchers and that were quite active in the force 11 group. But one thing they they emphasize is that it's not doesn't just have to be through RDF and ontologies there might be other solutions for this and they don't want to don't want to make it exclusively through those technologies. So that's something to keep in mind regarding the making of data interoperable that's what I've invited Simon and Jonathan to come and talk about and they'll be able to talk about it in much more detail. The second point here is around vocabularies and using vocabularies and they emphasize that if you use a vocabulary will first of all try and use one that already exists or is agree and is agreed on by the community. If you have a terminal if you have terms in in there that are not in that vocabulary but otherwise it fits try and get them added to that vocabulary and finally if if that is not possible then only then and only then start creating your own vocabulary so please don't go out and create vocabularies for everything rather look if there is already a community agreed to a category. Also make sure that that vocabulary itself is fair so findable accessible interoperable reusable. So if in your data set you should have a reference to that vocabulary you are referring to and make sure that that vocabulary can be found just as long as your data set can also be found. Final point they make is that the data and the metadata should include qualified references to other data and metadata. So what they mean there is that shouldn't just be a reference to another data set for example but also an indication what that relationship is so it's not just it's related somehow to this other data set but perhaps it is a subset of another data set or it builds on another data set using standardized terminology. A little more on qualified references from the perspective of the metadata especially it's valuable to not only refer to other players or other elements around the around your data set but to do that using identifiers. So for example if you are describing your data set and saying well it was created there was somebody was involved in creating that data set provide a qualified identifier that person was for example the author of that data set and if possible also use an identifier to identify that person. That allows other relationships to be made and it allows for the connections to be made and that information to be picked up and used especially for machine when being analyzed by machines. So just a list here of possible identifiers these are just examples there are more identifiers out there but for example if you're referring to an author include their orchid if you're referring to a publication use the DOI that is related to that publication if you are referring to software nowadays you can assign a DOI to a software package and refer to that DOI etc. Well I think I've rambled on enough for now. So I'd like to hand over to Simon and Jonathan and I'm very grateful that they've made their time available. So just a brief introduction. Simon is a research scientist at CSI Rowe Land and Waters Environmental Information Systems Research Program. He specializes in distributed architectures and information standards for environmental data focusing on geosciences and water. Jonathan, Jonathan is a research computer scientist specializing in information architectures data integration link data semantic web data analytics and visualization. He's part of the environmental informatics group in CSI Rowe Land and Water. So together they've been very active in applying their thinking around making data interoperable in the Osnoom project. Now one thing I want to point out is that in the Osnoom project they did a whole series of work around the fair data principles in all different aspects. Today I've asked them especially to focus on interoperable but please keep in mind that they've also done a whole bunch of other work. So without any further ado I'd like to hand over to Simon and Jonathan and I'm very intrigued to hear as how they've picked up interoperability and use that in the Osnoom project. Okay thanks Keith. So thanks for the introductions as well. So today we are presenting on some of the work we did in the Osnoom initiative particularly looking at land and water and the data that we have in CSI Rowe and how to make that interoperable accordingly to some of the principles that fair espouses. But as we will talk about some of the implementations that we have explored around the fair principles into actionable questions to address how fair your data is. So if you haven't come across Osnoom this is a Cyrolet initiative aiming to connect information ecosystems around Australia. The Osnoom name was coin echoing the genome project. So Os being Australia and the known being a genome kind of inspired project. But really what we're looking at here is tools, services, products, methods, approaches and practices and infrastructure to support having more connected information infrastructures. And in the previous year we as Keith mentioned we focused on environmental information infrastructures. So a couple of links there you can follow. And today we'll be talking about an example in the water space. Okay so as part of establishing the Osnoom architecture Osnoom infrastructure we felt that we needed to assist our potential data providers to understand what good data was, what in the context of this seminar series what fair data is. We all call it Osnoom data and basically we developed a rating a set of rating criteria and a tool to allow assessment by data providers of the data of the data that they're providing. And this is just on the right hand side of the screen here. You can see a screen capture of the the sort of the kickoff page of the of the tool. You'll also notice that we've got a a slightly adapted version of the fair criteria, findable, accessible, interoperable, reusable, but we also add in the last line there, trusted, which appears to go a little bit beyond what has been conceived in fair until now, but we suggest would be a useful addition. We're kind of bundling the interoperable and reusable together. We see those as being very closely related and you know obviously this is it's teasing out some of the issues around what it is that makes data interoperable. Keith's given a sort of a high level overview and indicated what some of the concerns might be. We've done our own take on this a bit, actually fairly strongly leaning on our experience over a number of years, more than a decade now actually of working in the data standards communities, in particular the geospatial data standards communities, and some of the learnings that we've got from there, which were applied directly in here, obviously environmental data, which is what we're largely work, what we've, what our heritage is, where we've largely been working. A lot of that is geospatial, so it makes sense to be building on that. Now just a bit of a reminder, the 411 Fair Principles, this is a summary slide from Michel du Montier, who's one of the original authors of the papers and the developers of the Fair Principles. They got these, the guiding principles with the four keywords and teased out into three or four sort of sub-principles in each case with the FAI and our letters. We're looking at the interoperable set here, which Keith has already shown. It's interesting that Michel has recently done a study evaluating a number of repositories, particularly in Europe, but some of them are broader than that, but here's the list of repositories that were evaluated and scored those on the Fair Principles. The data is available in this form, actually this table shoots off to the right of the screen, there's lots more going on there, but looking at the summary of the results, it's fairly notable that the tallest red bar here is in the interoperable category, so what this is saying is of the Fair Data Principles, this is the one which is hardest to meet, the one that's hardest to conform to, and so really that's the focus of the approach that we've taken, which is to kind of lead people through how they can make their data more fair, more osnomic, more interoperable, and the particular way in which we've broken out the question of interoperability is on, if you look at the numbered terms here, is it loadable, is it usable, is it comprehensible, is it linked, as well as is it licensed, and I'm just going to go through some of the details of those, and you'll see this, you know, in a sense is fairly repetitive of some of the concerns that Keith explained at the beginning, but we're putting some more concrete examples onto these criteria to indicate to our data providers that when we say a standard data format, we mean something like CSV, or JSON, or XML, or Next CDF, these are all important file formats towards the left and then there, you know, they're kind of general, but Next CDF is one that's used a lot in the remote sensing and environmental science communities. So we've got a bit of a ladder here of different levels of conformance, which you can reach about whether a data set would be loadable, is it in a unique file format, well that means that you've got to have some unique software to load it, or is it in a standard data format, and normally that would be denoted by one of the standard MIME types. Best of all would be for data to be provided in multiple standard formats, giving a choice to the user, so that whichever their favorite platform for loading data they can use. Next question, even when you've loaded it, can you use it? If the structures within the data set, even if it's loaded, if the structures are unclear, then it's not going to be very usable, and that comes down to the matter of is there a schema that's provided, which makes explicit the data and the structures within the data sets. A lot of sort of traditional data, yeah there's a structure in there, but the schema is not available independently off the data if you like the schema is implicit, it's not formalized, the schema maybe is different every time. A lot of spreadsheets are done that way, spreadsheet's got a lot of boxes, but if every time you use it you add different columns, and use the pages in a spreadsheet in a different way, then it takes a little while for the user to get the heads around what's going on before they can use it. So there's various explicit schema languages like DDL, which is used for relational systems, XML schema, something coming out of the open knowledge world these days called data packaging, which allows you essentially to describe a schema for a CSV file. Then you've got in the RDF, the semantic web space RDFS, and Al Jason even has a schema language these days, although it's not broadly amused. And so it's nice to provide data with a schema, but best of all would be to say the data I'm using, I'm using this community schema, this community and for example the Open Geospatial Consortium provides a number of community schemas for observations, for time series, for hydrology, for geoscience, and if you're publishing or attempting to share data in any of these disciplines, then best to go off and find a community schema. Then even when you've got it loaded and you understand what the structures are, you've still got the question about what the words and numbers are inside the boxes. Do the column headings, are they explicit enough to understand? Are they just shorthand for something which the project leader, when he was developing the data, knew that he or she would understand it the next week, but even he or she, if they came back to it the next year, may not understand it. Best of course is if the field labels are linked and do have explanations probably in plain text, better still is to use standard labels, for example the universal code for units of measure, units codes, or the climate and forecast conventions coming out of the fluid earth community. So the ladder that we've got here says, are you using standard labels? Is it just some of the field names are linked to standard externally managed recoveries, or all the field names linked to standard externally managed recoveries, and you get this ladder better and better and better. And then the question about how well linked is your data? Well, if it's just a file sitting on a server somewhere, there's no links in or out, yeah, you're lucky to find it. If most of the data sets that we're in this community would be expecting is that they're indexed in a catalog or they're available from a landing page and that's a situation where you've got inbound links to the data set. Best of all is when there are outbound links embedded or implicit in the data structures in a data set, which says exactly how it's related, and this links in with some of the previous concerns that we had there about field names and these kinds of things. So I'm going to hand back to Jonathan to tease through a case study that we've got here really based on the Aura-L, Australia Water Resources Assessment Data Sets, Sir Jonathan. Yeah, so as mentioned earlier in the Osnon project, we looked at a practical example and case study in the Aura-L data set. This is a continental cell data set that has historical time series from 1911. The Bureau published an operational version online, you can find that on the website, but often scientists have to basically deal with this data set by knowing where it is and knowing how to use it implicitly and knowing how to reference the requisite geospatial features and understand the field name values. So I've got an example in the, sorry, so the next slide shows the assessment of it using our tool and just focusing on the interoperable side of things. We have rated it as a web service, so it's, you know, we can get it by the web. However, the reference definitions are text only and they are localized in the data set itself. And I'll give an example in the next slide. So this is coming out from the NetCDF metadata that this data set, you can access this via, you know, online through threads or by the NetCDF tools, but this is a summary of the metadata that comes along with the data. So we've got long name here, potential evapotranspiration. We've got the name, which is a label for the field, E0 underscore AVG, units MM, and a standard name, which is a convention in NetCDF to refer to the actual observed property, which is E0 underscore AVG, which in this case is a part of the CF convention that's often used with this format. So if you are an expert in this area and you've used this data set many times, you'll know what this is. If you are a new comer, you have to do a lot of work to, well, a little bit of work to understand what actually this data field means. And in the AUSNAM project, what we did was enrich this with external variables. So if you go to the next slide, Simon, so this is the same field, and we've added, you can, you know, this added lines at the bottom here. They tease out what this particular data field means in the context of externally defined vocabularies. So we've now enriched this with a scale quantity kind identifier, potential evapotranspiration, and it's a HTTP URI where you can resolve it and get a definition. So similarly for substance attacks and unit ID and feature of interest, and just talk about what they are. So this is what a part of the project was to explore. Could we define vocabularies for these from which we could reference outbound links from the data to the definition. And this is just a summary of what we did in the context of the RIL data set. And this is an example of potential evapotranspiration. You know, we've got a conceptual model here where we've got broader notions of potential evapotranspiration, evapotranspiration, and we've got linked relationships out to things like field of feature of interest, object of interest, and unit of measure. So this view provides a vocabulary entry for potential evapotranspiration, not only the identifier for it, not only the description for it, but a richer model than you would get from if you just had something in line. So you've got outbound relationships from this concept to its related concepts essentially. So this is a demonstration of defining the concepts externally, having them quite richly explained through this medium, but having the ability to link that from the data set itself to this definition to make it more interoperable. So that if we have another data set that talked about potential evapotranspiration, it could potentially be linked and interoperable. A revised Osnome maturity estimation using the Osnome five-star tool and just focusing on the interoperable field. We see that it's for using the same tool and assessing it based on the criteria, we've gone up from two-star to more than four-stars in the interoperable space. And the reason for that is that we now have reference definitions as linked data and externally hosted observed property vocabulary definitions rather than just inline labels of what it is. It provides more interoperability and if the vocabulary was standardized, then we would have a high estimation in that field. But it's just a demonstration of how we went about making something more interoperable through the Osnome project. And yeah, I'll just pick up at the end here and just comment that when we were starting this data ratings exercise, we actually didn't look at fair at the beginning. We developed our own set of criteria, these keywords here, and then subsequently correlated them with the fair principles. One of the interesting things was there was three lines in this table here, the ones in red, which didn't correlate with concerns that had been identified within fair. And the first one might be seen as trivial, but we thought it was a question that was worth asking, particularly when working with research scientists and talking about making their data available, which was the question about the first question, is your data intended to be used by anybody else? There's lots of data generated, which is never shared. Now it's not necessarily a good thing and to a certain extent having the question there highlights the fact that there is a question to be asked and that some scientists need researchers need to be encouraged to think about making their data available, about publishing it. So I think in terms of the fair principles, this one was the kind of implicit starting point. If it's published it's, yes, it's implicitly fair. A couple of other rows, one concern which comes up particularly, we've worked a lot with agencies that have sort of systematic data collection processes with systematic curation and maintenance revisiting. A data set is refreshed every day or every month or every year or that. That concern didn't seem to be particularly addressed in the fair principles as they stand and so we'd say, you know, the concern about whether the data is expected to be updated and maintained maybe a bit more than fair. And the bottom row there as well was the concern about, you know, this is a if you like an elaboration of the assessment of data that you might do which is to get some information about how well trusted it is. Now a lot of that's about who else is using it, how much it's, well that's often the criteria we use, who else is using it, how many times has it been used, what are the products as it has been generated from this from this data set and so can I trust it. So just emphasizing that row there is the interoperable corresponds with the interoperability which is what we've really been focusing on today. The use of standards I guess. Standards is a funny word, you have to be a bit careful with it. Capital S standard sometimes people think that's just to do with ISO or Australian standards or whatever. Really the point about standards is that they are community agreements. They are community agreements which are available for additional members of the community to join in. But it's important to think of them as agreements, agreements to do things in a common way. So finally just to slide with some links to some of the material that we've been showing today and we'll say thank you for listening. Thank you Simon, thank you Jonathan. That was really interesting and a really really useful way to to see what it actually means in practice because I think interoperable it can be quite a complex difficult subject. Sometimes also one that requires much more knowledge of the actual field of research that's going on that you're talking about. So I think this is a great example of where you've been where you've been working in a specific field to try and make that data more interoperable. Thanks very much for your time and this is really interesting discussion and really starting to tease out a number of the issues and a number of the things that probably will need developing further. I just put up a slide which is links off to a number of resources and some of these Simon already mentioned. So ANST has a service researcher of Cabrera Australia which anybody in around the country or actually internationally also can use if if you don't have your own tool to set up a vocabulary that is a possible way of doing it. They're also already existing vocabularies in there so have a look at that if that's of interest. We also have an interest group that works in this space. If you're looking at the metadata and having qualified relationships within the metadata and using identifiers there's a few links there to places where you can find information about possible identifiers. We're also trying to pull that metadata describing datasets together and sharing that internationally through a number of hubs that's taking place through the SCOLICS project. Research data Australia is sort of an Australian hub contributing into that international hub international effort so have a look there if you're interested. We did 23 research data things last year and two of the things are relevant for our discussion today and have a look if you're interested in digging into it a little further and discovering a little bit more about it and discovering what vocabularies mean in practice. Have a go at Thing 12 or if you're more interested in the identifiers and link data have a look at Thing 14. Finally I'd like to first of all thank Simon and Jonathan again for their time and for the excellent presentation and the insights that they brought to the table. Finally we would like to acknowledge ENCRIS, the National Collaborative Research Infrastructure Strategy Program that provides the funding for ANS. So thanks again and look forward to see you all next week.