 for the opportunity to talk to your group around what we've been doing with some of our understanding of our data quality and particularly applying fair principles to some of our data. You mentioned the GNS is a little bit like CSIRO in that we are a government owned research organization and we specialize in the geosciences sector of the New Zealand science landscape. So, yeah, we have a mandate to work in things like geological resources, geological hazards, environments and climate related themes and underpinning that is a whole group of work which goes under this land and marine geoscience on the bottom left of the slide, which is essentially underpinning geoscience for some of these more applied ends of our research. Anyway, so that's kind of a very quick presay of what we do at GNS. I'm a geologist, I worked in the geological mapping space for a long time and it included a tour in Australia from the late 80s until the mid 90s. And I worked in one of Geoscience Australia's former geysers for a number of years, and a lot of what I brought back to New Zealand was had its origins and what I learned in Australia. So the work that we're doing here is largely the work of myself and my close ex-colleague, now Maria Mavridi, who was hoping to join this meeting, but can't, unfortunately. So just in mind, I'm going to be presenting this, but please, this is also very heavily Maria's work and that I'm presenting here. So I'm sure you all know what fair principles are and I don't really need to sort of explain these to this audience here, but I guess the key things to get out of it are that it's actually about machine to machine understanding of these principles. It's not just about a human relationship, but it's also about machines being able to find things, access, et cetera, et cetera. So I won't dwell on this slide too long. And I guess in the wider context of why we undertook a particular analysis here, well, in New Zealand, we've got some rather old principles relating to data and information management. And these essentially are predating the fair principles, but they're talking about things that need to be open and readily available and reusable, et cetera, et cetera. So a lot of these concepts existed well before fair principles and are expressed in government requirements. In the around 2020, we were designing a new set of business plan measures. We do this every year and we often want to try and find ones in the data space. And so what we're trying to do with this particular one was we will say, well, fair principles exist. How can we actually use them to get an understanding of our current state of fair compliance and our potential to improve that over the years? So that's essentially the driver. So we've just got embedded in our business plan. And once you do that, of course, you have to have to comply and go for it. So, but it did raise a few questions. So what is fair principles compliance? How do you know when you are compliant and can you actually quantify that in a way that you can then say, well, in a few years time, I want to be even more compliant or have a greater degree of compliance. So this is kind of the drivers between why we went through this process. So we were looking for a practical solution to try and give us a measure of how good our fair compliance was. So as a research organization, we invest heavily in our geoscience data sets. And we've got this sense of what high value data sets versus ones which are maybe not so high value. And you can see all the criteria on the left hand side there in terms of what constitutes a high value geoscience data set. And when I say data set, you know, it's a pretty elastic term. A data set is just a collection of related information, but that in itself could contain many other data sets. So this 120 number that we're pulling out is actually probably just the tip of the iceberg when it comes down to really breaking down to the finest level. In the example, there is a geological map collection, the geological map of New Zealand. And you can have basically a product line underneath it, map seven. And then you can have one of the parts of the product is a geological map or the GIS version of it. And then you can also have that GIS broken up into whole bits and layers. So each one of those technically could be regarded as a data set. But in this exercise, we're treating the whole lot as just a single data set. So the reality is we've probably got many, many thousands, if not into the tens of thousands of data sets to work with at GNS. So for this exercise of those 120, we've taken, I think it's quite my own side here. But I think it's in the order of, did I say 50 at the top? Anyway, it's and we've got two different classes that we will groups that we're dealing with what we call the nationally significant collections and databases. So these are these are data sets which have got an official government recognition status. And they are actually well funded and and reasonably continuously funded. And they haven't had much or no reassessment over the last 20, no, it's getting to 20, 23 years or so. So the the data sets are basically right now, 33, I should say. So these data sets have been locked in stone for a long time. They have grown internally and their scope may have changed. But each one of the on the left hand column, what is what we call a data resource? Let's say and then within that a bunch of data sets. So the geological map I was referring to here is just one of those at the top. Can you see my cursor when I'm leaving it me? My mouse. Yes. Yes. OK. Yeah, excellent. OK, so I can use my mouse for pointing if I need to. And so we've got things like a rock collection and database. We've got the National Seismic or earthquake database, etc. etc. And so these are all of the geoscience data sets that we're going to apply this fair analysis to the other data sets or the group here is our natural hazards ones. So these replant represent the data sets, which are not in part of the nationally significant ones, but they are still incredibly significant important. And these are generally not as well funded and may contain these data sets here. So this is basically what we're running the assessment over these ones. We do have a whole bunch of other ones in the natural resources space, the environment and climate space and so on. But these are the ones we concentrated on for this exercise. And again, when you break down those fair principles, and this is coming out of the Go Fair initiative, each one of the the principles can be broken down into a number of sub-principles, if you like. And so these are things by which you can measure how well you are meeting that particular principle. And so I won't read them out again. I'm sure many of you are already familiar with them, but but these are certainly available from the Go Fair site, if you want to look at them. And some of you may may even be authors of this for all I know, but the ARDC have made a tool, a fair self-assessment tool, they call it. And effectively, this is a slightly old iteration of it now. It looks like it's changed since I looked at it last year. But anyway, it's essentially the questions I showed on the previous slide here, all these F1 to F4, etc., etc., are all expressed in these questions here. And then there's a bunch of answers and the answers have a certain level of influence on where this green bar will slide across the line. So essentially, this green bar for findable is solid, so that's indicating like a hundred percent score for this particular set of four questions and accessible slightly less intervals, etc. And then you can add it, they've got a total down here. So the purpose of this tool and and and, you know, we've been we've probably ridden a little rough shot over its original intention. The intention of this thing was educational and informational, according to the website. So basically, it wasn't designed to be used in a quantitative, comparative way at this stage. However, as soon as you put something out on the internet and you have given it scores, then it's almost open slatter and that's exactly what we've done. We buried into the HTML code and we've extracted the essentially the guts of how this works. And the reason why we've done it, use this one is because it's first of all, it's available. It's one of the most clear and effective ones that we saw. But it's also because people who know more about fair compliance have actually thought about how they wait to scores and things. You know, somebody else has done that work and some people who actually know what they're talking about. So we're more than happy to borrow that expertise. And so that's kind of what we've done. So if you look under the hood of that, you can you can see the answers to all of those questions here and you can you can see the scoring. There's 8310 and of course, it's not a linear scoring. There are the best answers or the most complete ways of answering tend to be give you a higher scores, obviously, but sometimes it's not a linear fashion. And so these are the scores that are applied through the under the hood in that particular application. So what we've done or the mirror year, I should say, is put that into something which we could utilize in a more efficient way for our purposes. So basically, we built a spreadsheet version of that. So the idea there is for every question, you can nominate one of these questions and that gives you it tells you what score you're going to get for that. And then it basically adds them all up and then you get a percentage of of compliance for each one of these questions. So that's what we did. So for all of those 50 data sets that I showed you earlier, we have basically run the spreadsheet over and we've attempted to score each based on all of these questions. Does it have a license? Is it has it got a piece of metadata? Has it got a DOI attached to it? That's sort of so all of those questions were asked of each of these. It was done in a very manual way. So so Maria would would go to the data set manager. So this is the person who understands most about that data set. And she would go through them with this bunch of questions. She could, at some level, anticipate their answers or give them guidance about what they should be answering, because not everyone is OK with all of these sort of terminologies and stuff. So it was very much. There was quite a manual time intensive project process, but ultimately was actually very good for the data set manager to be involved in this process, because they got a sense of what is really important from a fair concept point of view for their particular data set. So it was actually very good educational exercise. So here is the results for the nationally significant collections and databases so you can see five columns on the right here. You've got all the all the data sets listed on the left. And so the fair and the accessible, interoperable, reasonable, and then a sort of a combination of those expressed as a percentage here as a total sense of fair compliance for each one of those. So the whole raft of numbers there is not particularly digestible in this form. So what I'll do is I'll show you something which is maybe a little bit simpler to follow. So this is both sets of data sets as not all of them, but just a selection of both. So we've got the nationally significant ones up here. And then on the lower part, we've got the high value natural hazard data sets. So green is where the each data set achieve the best possible score for that particular question. And so you can get a sense of where it's green. It's it's good. That's it's really fair compliance by those standards and where it gets a pink score is the lowest possible score. So you can see that some data sets within the nationally significant ones are scoring green quite consistently, whereas some showing a few pinks, which means that they're not nearly so compliant. Look at the natural hazard ones down below. There's a there's a lot more pink showing. So basically, this is saying that the the natural high value natural hazard data sets are not as fair compliant as the other ones. Another way of looking at it. And you can see the individual question components coloured in here, but essentially the findable by these two groups here, the one on the left is the nationally significant databases. And this is the natural hazard ones. And so you can see that the natural hazard ones are always lower in terms of their their compliance score. I'm just going to divert a little bit into the importance of metadata because this has been absolutely critical to getting a sense of their compliance. And it doesn't, in two ways. The presence of metadata, of course, is in a in some in something that's harvested and so on, is that it's findable. You know, if you've got metadata, that makes it findable. So that's a great start. But metadata is also the way that you can express how accessible it is. It's interoperability. So you can reference how it is interoperable, whether the data are referencing vocabularies, for example, or following data models and the reusable components. So the embedded in the metadata are statements around the licensing and and and the provenance lineage of the data. So so the top some metadata is absolutely critical to the data quality story. So, you know, I look at some of our sister organizations in New Zealand around in the science space. And some of them don't have well developed data set catalogs. And so I think they're going to be struggling from from a point of view until they do have something like this in place. And you can see some of those elements. So I won't go into the details of these. Anyway, so getting back to the results. So we're saying that the the NS, the nationally significant databases are consistently scoring better than the natural hazard ones. The findability is as high in both data sets because we do have that metadata catalog. Accessibility is a little bit variable. Some some of the natural hazard data sets are not needing to be made public or open, whereas the nationally significant ones, that's a that's a requirement. So good accessibility is really important there. The interoperability is lowest in both categories. And I think that reflects the lack of standards which are appropriate to some of those particular types of data sets. I mean, I work in the geological mapping space and we've actually got some really good international standards that we are applying. So that's the GSML data model. And it's this the the international vocabularies that we bring to the score. But not every data set has that sort of resource available to them. So hence the interoperability scores are generally on the lower side. Our lower reusable scores. In part due to not so great practices within our own organization around defining licenses. So we've been a little bit ad hoc in that space. And I think that's reflected in some of our reusability scores. And to be fair, some of the metadata content around the describing the provenance of the data, the lineage is not as fulsome as it probably should be. So I think, and that's an area that we can make some improvement in. And of course, I think it's pretty obvious that the National Institute of Data Basis because of their sort of consistent and funding and pretty good funding. It makes a big difference in being able to support fair or data quality in general. And so these scores, you know, we ran this process a couple of years ago. It's certainly marked areas that we could improve. And we've actually gone a long way to doing some of that improvement. For example, even through the process of doing the assessment, we realize, well, hey, if we create DOIs for all of these data sets, then then our scores go up instantly. So, you know, that was almost something we did as we were running survey. Whereas improvements in terms of licensing and wordage around lineage provenance and so forth is something that is sort of ongoing. But through that process, and we were to run this using the same parameters as we did two years ago, then we would expect to see a climb and scores across the board. And so this is sort of feeding into our wider data management maturity model work that we're doing. So fair is an important component of that. And it's also giving us a sense of, okay, how good is our data and where do we need to get our data in the future. So I guess, and I've just been expressing some of this, but if you want to be pragmatic about what are the things that make a difference in terms of improving fair compliance? Well, these are the things that refer to within each of those. So I mentioned DOIs under the findable category. That makes a big difference in terms of the score. If you don't already have those. If your metadata are harvestable, you know, again, that works really well. Some of the stuff is, it doesn't really change very much. If it's on the server, then it's probably, it has to be under these conditions anyway, but the APIs are important. And so the interoperability, if there are standards out there around data models or terminology and so forth, and you adopt those, then those are going to have them influence on your compliance levels. And then reusable, as I said before, it's about making sure that the licenses is clearly expressed, you know, by default, you just put in creative commons by attribution, then you've gone a long way forward to ensuring that. And then just increasing the how clear and you are around how the data have been put together, the provenance, it makes a big difference too. So, GNS is one of what we call the Crown Research Institutes and there are another five Crown Research Institutes and a relatively recent initiative is to try and get some of our data into a common platform, at least findable through a common platform. And this is the platform here. We're calling ourselves the National Environmental Data Center. It's not unlike, I should say, it's a step in the direction to what you guys, the Australians, have already got in terms of your research data Australia platform where you have the ability to search for the spectrum of science data in the Australian Science Cape. So, this is our early steps towards that and effectively we've got a bit of an online catalogue for doing that. And the reason I'm mentioning that is that we're, as a collective group of Crown Research Institutes, we're also going through a data quality exercise and this is some work that we've been doing on that and effectively it's built upon fare. So, virtually all of these terms and this checklist here relate to things which are addressing fare principles in one way or another. So, it doesn't have a lot of terms of use. Is it up to the license, for example? Common format addresses and for approval to some extent. Reusable is referring to the data lineage and so on and so on. So, what we've been doing is each CRI has put up all of its important databases that can be accessed into the portal. And now we're running the ruler over them and say, well, do they actually, do they have data quality that is acceptable for them to be there or do we actually have to improve the quality or take them down? So, this is what I've done through a spreadsheet here. So, this is on the left-hand side of some of these data sets that we have now on this shared portal. And again, the colour is much the same green as something where the checklist thing is present. So, it has a license or it's got addresses, a model, or it's got a provenance statement, et cetera, et cetera. Red, pink is where it's not there. And orange is where it may be there or it's partially present or it's maybe a bit difficult to find. So, but anyway, just giving this spreadsheet here is just giving us a quick visual about what is the status of the data sets we're pushing up and, you know, you could argue that maybe the Cenozoic, Maluska of New Zealand one has got too many orange or pink scores and needs some work to make it acceptable to lodge on the site. Or if we can't do that, then we simply take it down. So, this is, yeah, so this is a sort of a parallel and related, it's not quite as prescriptive as the scored version of fair that we're doing. This is much more of a qualitative assessment. So, this is my last slide, I think, I think it is. Anyway, so, I guess having gone through the exercise and on the high value ones that we looked at, I think by and large we are measurably fair. So, we can actually tick the box to say we are pretty compliant. And what's more is we can say almost how exactly, how compliant we are. And so, I think that's been a good step forward. As mentioned before that our data managers have actually learnt a lot through this process. So, I think in terms of data management being exposed to ways of understanding data quality is actually really important for them. So, it's not just about, you know, the accuracy of the numbers or anything like that. It's actually about the quality of data. The things that are important to the end users in terms of being able to use that data down the track. And then the last statement is really just that your money does make a difference in terms of that compliance. So, there's a couple of links there. This report is publicly available. It's a GNS internal, it's not internal but it is externally available. So, that is available for the downloader. We probably don't have this presentation and there's the link to the cross CRI one. So, with that, I'm going to stop.