 Good morning everybody welcome back to the Fair Data 101 webinar series I would like to begin by acknowledging the traditional owners of the land on which we all are today for me in Perth That is the Nunga Wajuk people. I'd like to pay my respects to their elders past and present Thank you for coming back on our irregular day I hope everybody who had one enjoyed their public holiday yesterday and you're all bright-eyed and bushy-table to learn a bit about interoperability So this is the first of two webinars on interoperability and the second one will be delivered by this Stokes tomorrow at the normal time So a quick reminder this course is governed by code of conduct and if you think Sorry, if you observe any bridge in the code of conduct, could you please let the ARDC know by the link in the Code of conduct itself and So today as I said, I'll be talking about interoperability Interoperability in terms of fair is a bit tricky to talk about and I will get into why that is over the course of this talk This talk will not be as technical as mine was two weeks ago but it should give you a good overview of what the The I1 through sorry through three not just one and two Guiding principles me and how it is that we can try and address those to the best of our ability So let's get right into it So my question is what's the most expensive mistake you've ever made In the late 90s There was a pretty big mistake made by NASA and one of the subcontractors. So This here is the Mars climate order Mars climate orbiter. Sorry an ill-fated Satellite that was sent or spacecraft that was sent to Mars and was intended to relay Messages from various other instruments on and around Mars back to Earth When the Mars climate orbiter got to Mars It was meant to perform a series of breaking maneuvers. So this is a pretty standard thing for spacecraft that are trying to orbit Mars They will go around it in big loops that get smaller and smaller and smaller and the idea is that the spacecraft will go through the very upper atmosphere and Use that to slow itself down because it tries to get from Earth to Mars as quickly as possible But that's too fast to maintain a stable and close orbit to Mars. So it really needs to slow down to bring itself closer now The problem what happened is that well NASA lost contact with the climate orbiter Nobody knew right away. What was going on and then eventually They worked out what the problem was So the issue was between two components of the data processing software that was used to control the climate orbiter one piece of software was written by NASA and The other piece of software was written by a subcontractor and They were meant to exchange data or rather the subcontractors component was meant to send data through to NASA's component NASA's component Was expecting to receive measurements in the metric system So I believe it was units of pressure in the metric system and the subcontractors component sent it through in imperial units instead So the subcontractor and NASA they both made assumptions about what kind of data was being transferred and Therefore the software they built also made those assumptions This is really what kicks off the first of the interoperability guiding principles in that Metadata data should use a formal accessible Shared and broadly applicable language for knowledge representation Now this statement is quite I Suppose you could interpret it in lots of different ways if you really look closely at it. There's there's lots of words But when it gets down to the nitty gritty What they're actually looking for what though what they're suggesting sorry the the original authors of the paper and what this principle is after is Is to try and address issues like this So What we have here is a table of observations We have a date and we have a temp and Under temp is a series of numbers Yeah, as a human I look at that data And I go okay, we've got a series of dates And I'm going to assume that's temperature. So there's the first assumption that that is temperature The second thing My second assumption is that that date is written for me in Australia as Day month yeah So what we have is a series of observations made from the first to the 6th of January in 2020 Then with the temperature I Actually have no idea there are three different possible Units of measurement that I certainly know of with temperature. We got Celsius we got Fahrenheit we've got Kelvin and They would be but widely different values depending on which unit So, you know if all of those are in Kelvin then those temperatures are quite low those temperatures are in Celsius those temperatures are really high And so What would be really really nice is if the data told us What it contains so it's using a particular schema here to record values and present them to us But we don't know what that schema is So we can improve this data set a little bit and we give a bit more context So now we know with with that information the data set is telling us okay The date is my assumption was correct that the date is from The 1st of the 6th of January 2020 Because of course it could have been recorded by somebody in the United States In which case it would have been the 1st of January the 1st of February 1st of March and so on and the temperature there We now know is in Kelvin. So they're cold. They're low temperatures below freezing point not high temperatures well above boiling point and The reason why this is important And is especially from a machine readable point of view is that when it comes down to it computers are not very smart And even when you think about it as a human looking at that data set for the first time I was making these big assumptions about what the data contained and I had to be sure that my assumptions were correct if I wanted to process that data and use it effectively And computers aren't any different They accept they can't actually assume They require a human to make assumptions on their behalf to program them in a certain way to expect certain inputs And if a computer gets inputs that it doesn't expect And it hasn't been taught how to handle those incorrect inputs gracefully computer crashes or fails or stops working in some way or Possibly a little bit worse. It keeps working but in really unintended ways So, what can we do about this? Um, no, sorry before we get to what we can do about that Here is um A quote from the original fair guiding principles paper Saying why we really want data to be able to communicate about a self because every time a new Data type is created or a new way of recording data. It's created Somebody needs to write a piece of software that could interpret that data And that's a parser something that parses the data and makes it available to a program And anytime somebody creates a new kind of a new data format We need a new parser and that parser will often only work with one language Or they've written that parser in a particular language say in python Say in python If somebody then wants to use that new data type in r or in the c++ or in fortran they need to write a new parser for that as well And there are literally hundreds of programming languages if not thousands And anybody who uses one of those programming languages wanting to use this particular data type would need to have a parser available to them And when it comes down to it given limited time limited human resources It's not simply not sustainable to keep writing new parsers for new kinds of data Or new formats of data um So what would be really nice is if data Used a standard way to describe itself Who whoever was trying to access it or to the piece of software trying to access it You could say But there's some speed data in going on All right, I apologize for that pun. Liz egged me on. I normally don't like puns at all. I'll put the blame firmly on her What we really want to do is we want to determine the attributes of data rather than assume the attributes of data Because if we determine those attributes, we then know what valid inputs are We now we know for example that temperature in kelvin can never go below zero kelvin And in celsius, we know that uh temperature can never go below minus 273 degrees celsius so If a data set told a piece of software. Hey, this data field it's temperature And i've measured it in kelvin By the way, one of my temperatures is minus two The software will know that that piece of data is invalid and it will be able to handle it gracefully Rather than falling over and creating unexpected outputs now unfortunately There aren't actually too many different self-describing data formats That are in common use around the world, especially for science What i've got here is two examples We've got ddi the data documentation initiative Which is an organization that has a couple of different formats. It's working on And there's also net cdf The network common data format So first up ddi the data documentation initiative It has a data standard called dd ic data documentation initiative code book And what dd ic a dd ic does is allow somebody to describe A data collection instrument for the social sciences, which is generally a survey So for those who aren't very familiar with surveys A good survey will have a code book created for it And what the code book describes is Each question in a survey What the possible values Of answers for that survey question are so Is it a multiple choice survey? Is it a lack at scale where people indicate, you know, one to five how Sorry, pleased. Were you with the service you received today? Or is it something more open-ended? And a fully developed code book Will possibly also record the actual responses to the survey in them so you can download and sorry when I These code books originally human readable code books So you could grab a document has all the questions all the answers that were given You know what the possible responses were, you know, how many questions were asked and things like that What dd ic lets you do is record all of that in an xml based language Which means that you can create a code book in a standard file format in dd ic using xml and By describing The data that it contains the survey responses, but also the questions and the valid answers to the questions Any piece of software that understands dd ic can pull your code book in Understands all the questions. It knows all the valid valid responses data types and things like that and you can work with it that way And so that Standards actually been around for a very long time. It was first developed in the late 90s, I believe Now that's not the only thing that ddi do. So ddi also has a different metadata standard or data standard ddi l life cycle and This is certainly from my point of view a very complex data metadata standard that lets you track and describe research data throughout its entire life cycle um, it's something I definitely need to learn more about um, but That's out of scope for this webinar We then also have the network common data format like dd ic It's been around for a very long time since the late 90s and what the network common data format does Is it lets you create a self describing data set That contains geospatial data Now net cdf First came out of atmospheric research But these days it's also being used in other disciplines that need to store geospatial data And like to use self describing data sets. So for example the aodn the australian ocean data network It is a fantastic source of data About australian oceans all sorts of readings salinity temperature Anything you can think of that's been collected by instruments that um, they stay still or there might be ocean gliders moving around or data from boats so It collects and presents data from the integrated marine observing system imos Now the aodn makes all of the data on it or stored in it available as net cdf Now on top of that They also have a bunch of web services So this is what I was going on about in my last webinar in accessible aodn has made their data available through a series of uh different web services different interfaces protocols communications protocols And you can use any one of these or several of them to get data as net cdf and then um Your data is not only machine accessible, but also machine interoperable So they're doing some really good work when it comes to having fair data All right now the problem with self describing data or certainly these are some of the problems that I perceive you might disagree Feel free to do so in the questions or during our discussions next week Or in the slack as well These data formats certainly ddic and net cdf have been around for a very long time So they're old but at the same time As researchers in domains that are not GSP for based or you know, not atmospheric Or they're not in the social sciences struggle with this idea of interoperability Um, we have these examples that are great for a couple of disciplines But they're not necessarily that useful for other disciplines in the state they are now Now I'm sure work is being done To make these standards more viable for for other disciplines but Because the concept of interoperable of self describing data is so new to so many people um Trying to understand these very old formats with their very long legacy Can take a lot of brain power um Also These self describing datasets. They're not as straightforward as a spreadsheet And excel remains to this day the most popular data processing and data analysis tool Used in the world And it is unlikely to be under seated anytime soon And by default These data standards aren't necessarily Compatible or usable or something like that in excel Although why you would want to edit a survey Codebook in excel. I don't understand but there are tools available that let you use both ddic and Uh and net cdf But they might be slightly harder to access then or slightly harder to learn or take more time to learn than excel All right moving on so principle two Data and metadata use vocabularies that follow the fair principles So This one, um I actually think is relatively straightforward Um or at least to me, but I have the privilege of coming from a librarian background So this idea of vocabularies aren't isn't very new to me um so for those who um Still would like to learn a little more about vocabularies. They'll throw up an example Okay Here's some more fake data What we have is we have some dates. Okay, great. This time we know that it's on the 1st of the 6th of January 2020 Uh, I've got a species written down magpie. Okay, great And then number one two three two one two My assumption here we go again assuming not to terminate maybe this data could be a little more interoperable I'm assuming That somebody sat down on a series of days and counted the number of magpies they saw Fantastic Now when I say the word magpie What comes into your head? Is it the bird on the left a european magpie? Or is it the bird on the right an australian magpie? Because if you're in australia and you're observing a european magpie Which is a species of corvid so quite closely related to crows and ravens So if you're in australia and you saw one of those in the wild Might be cause for alarm. How did that escape from wherever it might be? And conversely if you're in europe And you suddenly heard the the dulcet tones of an australian magpie wobbling its call Or possibly hearing the sound of a magpie trying to swoop you Again, of course for concern. How did that australian magpie get to europe? Australian magpies are also not actually related very closely related to crows and corvids They're songbirds passerines so The the magpie on the left The european magpie is more related to our australian crows Than the australian magpie is to australian crows Okay, so now there is thankfully a well-established method for uniquely identifying species So here we go We've swapped out The common name magpie With the species name I'm probably going to make a hash of this gymnoreiner tibican And that is the species of australian magpie So by swapping from the the colloquial name of magpie To the established vocabulary of the binomial classification system for naming species We know Without a doubt that we're talking about australian magpies here Unfortunately, there are nine subspecies of australian magpie, but I will not get into that So I knew there would probably be some bird watches in here trying to say hey There's more than one subspecies and it does matter which one but here we go. I'm acknowledging that they do exist. So Sorry for a jump off on that. So the idea is wherever possible Rather than coming up with your own vocabulary to describe things. So Let's say you're describing colors of something rather than Defining your own series of colors for people to choose from Consider finding a pre-established list of colors Or that others might also be using The real benefit of that is that it lets you for example here Compare similar data with similar data more easily So if I was trying to um compare observations of australian magpies And I found this data set. I wouldn't know whether it would be useful to me But I do know that this particular data set would be useful to me when comparing australian magpies Liz will be going A bit deeper into vocabularies tomorrow including some Places to find vocabularies Okay Metadata sorry for skipping ahead of it too quickly include qualified references to other metadata Now this like the first interoperability principle Is a situation where I think That the principle is quite Aspirational But in terms of Technology and understanding and and culture change. We're probably not quite there yet So what this guiding principle is trying to ask us to do is to make all data linked data Okay, so here comes a very quick primer on leaked data So the idea is In linked data Everything is described as a triple And that triple Consists of three pieces. It's triple. There is a subject There is a predicate And there is an object If you're a big fan of grammar, you might already know what I mean here, but uh, I think I had to look up predicate myself But here's an example of subject predicate object Matias Is employed by The ardc So I am the subject The predicate defines my relationship to the object Which is the ardc You can also flip that around and you could say ardc is the subject Employers as the predicate Matias as the object I'm fine being an object. Um, yeah, something you have to live with occasionally And then uh, we can also say the ardc employs lis Um, and and that back reference lis is employed by the ardc could also be made Uh, and then lis and I could also have a relationship Matias works with lis or lis works with matias So the idea of linked data is that everything has a link to something else Now, uh, I Not everything has a link with everything that's a bit meta everything's connected Everything has a link to something else even if it's only one thing there. There are very few things that exist in total isolation Except those of us who have been working from home, uh through the coronavirus so lis works with matias as I said How can we do this with our data? So to really fully follow principle i3 We would turn every single observation into a triple of some kind or even What what you can see here in this table is actually several observations at once so On the first day we recorded one observation of the australian magpie And on the second day we recorded two observations of the australian magpie So there's actually two separate observations there might be a little bit better if we had a time as well But that can't be helped with this example So what we would like to do is say something like An observation is of a bird So this observation is of the australian magpie this genus and species Now with linked data What linked data really really really wants Is for things to be linked via their pids So back in my uh description of matias works with lis We would really like to use pids for lis and me So we could possibly use our orchids So that my orchid works with lis and we make that assertion Now this is unfortunately where things start getting a bit tricky for linked data And that is what is the pit of a bird? So we've got our australian magpie or our subspecies How can we uh, sorry, how do we find the pit of that bird? Is there a pit for that bird? Maybe somebody's put a dly on it So I tried and find tried to find some pids for the australian magpie And on the very trustworthy and high quality source Wikipedia I found this list of pids at the very bottom of the page about australian magpies What we have here is Many different possible pids For the australian magpie Many of these I've never heard before heard of before I do know of wiki data Wiki data is a great project How to scope for this talk sorry but have a look into it So and and there's also the wiki species database So somebody who's trying to build or a group of people sorry community is is building A database about species and giving them pids of some kind Unfortunately, we have several other initiatives that are also Trying to assign identifiers to birds And so you might ask all right then we've got all these identifiers Why don't we try to build some kind of new? New identifier That actually then unites all of these identifiers and then a talk about standards is not a talk without an xkcd comic So oftentimes the reason why we have so many standards Is because There are a number of competing standards Somebody says oh my god, there are too many standards available. What we need to do is we need to combine We need to create one standard that covers everybody's use cases And then so we have 15 competing standards And then a new use case comes up and yeah, it's uh standards all the way down so There is a bit of work to be done by the community in working out how to address some of these guiding principles so the The ddi and the net cdf Fantastic exemplars And in fact there will be a webinar shortly I think it's in the next week about how ddi is working on more cross disciplinary initiatives And we will share the registration URL. Uh, in fact, sorry has already been shared Um, so sign up for that if you're interested. I'll certainly be watching that one myself to learn more about that Um Now I've been talking about linked data And some of you might have heard of linked open data But I thought I'd quickly mention uh, what the distinction is between the two so tim bernersley when he came up with this very pithy description linked open data is linked data which is released under an open license Which doesn't impede its reuse for free So we'll be talking much more about licenses and reuse uh In the week after next but the idea is That linked open data is linked data But the key thing differentiating it is the openness and how available it is for others to reuse Now I think I'm close to running out of time. Uh, in fact, I have run out of time for speaking So I will ask if nicola is around because nicola will be facilitating our q and a session today Okay, now in order for me to do this, I'm afraid I'm going to have to ask ash to make me an organizer so that I can see the questions Go to webinar through a small tantrum Uh, so we'll just give that a moment. Um Yep, thanks for posting those uh links in the chat um Well, we were a last presenting um Okay, it looks like you're an organizer now Yes, I think it is just rebooting for me. Sorry everybody And we're off questions. Um So we Don't actually have Any questions at the moment. So if anyone has any questions, uh, we have we do have a few links that were, um posted by Steve and by Catherine Catherine Howard, but uh Nothing that Nothing for Matias to answer. Does anyone have any questions for Matias on this topic? I know it's quite a complex one So maybe not one where an easily formulated question comes to mind um but if There are some harder to formulate questions that come to you later, then we will be monitoring the slack So we can um get into some more nitty gritty difficult discussion there We will also be asking the hard questions during the community discussions next week So as with the the previous modules, we'll have the questions and activities available for you Uh, sorry the the quiz questions the activities and the community discussion questions available for you later this week After this is webinar Have I killed enough time nicola any questions come in? No, you're no questions yet Okay Well, I mean we Uh-huh. We have a couple of questions. Uh So is there a cheat sheet we can give researchers to get started with interoperability Are there any um good beginners resources? Uh Not that I know of off the top of my head unfortunately um There is an ARDC webpage on the fair principles that we have just updated Um, so I advise having a look at that one but the The struggle with this interoperability business is that in many disciplines It's that the problem hasn't been solved Sorry, the problem of making data interoperable hasn't necessarily been solved And therefore it's not possible to write a cheat sheet that provides that solution um Certainly say for example linked open data advocates will say oh, it's easy Just turn all of your data into linked open data and then it's interoperable And that's uh, they say just Implying that it's a really easy thing to do and look I'm sure for a linked open data specialist and advocate It is easy them to do But when you're used to working with tabular data converting that tabular data into a map of connected nodes That linked open data represents It it's this new paradigm of thinking about data and Well, it's a paradigm shift It's a big change in how you need to think about your data in order to represent it that way Yes As someone who um, definitely it was working with tabular data. I can see that that is a difficult leap um Now I have two related questions here. Um, are there any examples of linked data with measurement sciences Where the units are sent along with the data stream? And then we also have do you have any some good examples of the use of linked open data for supporting interoperability? So, uh, good examples So, um I do not I'm afraid. Um Yeah, it's although no, sorry, that's that's not entirely true. So I I didn't um I didn't want to go into too many examples during this talk because uh, when you look at Um, certainly that the source of linked data the the xml the rdf or the jason that describes that linked data you look at it Uh, um, so you can be a bit overwhelming at our first glance. Um But we can we'll share some examples with you for the for the activity um But for example wiki data Is a source of linked data that can be used for research And what the wiki data project is trying to achieve Is it's trying to describe the world in terms of linked data? So for example wiki data has a page about the city of perf Or rather has has a a datum an entry about the city of perf And then linked to that it has all of these other things that are about perf The suburbs of perf the streets of perf the people of perf the buildings of perf things like that And so if you take a look at wiki data You'll see that it is representing and creating this this knowledge network this knowledge graph Of of the world and a representation in an attempt to make uh this Certainly this information more useful to people who use linked open data Um, awesome. I have another question. Uh, this one calls back to last week's discussion Can all data necessarily be presented as linked and open? thinking about accessible Um If you ask a linked open data advocate Then yes, all data can be presented as linked and open, but remember linked data doesn't have to be open So you're you could have a data set or a data collection that is represented as linked data But harking back to the accessible side, it's not open because it might be sensitive health data So you have represented that data in a linked format, but you are keeping it closed. Therefore, it's linked data, but not linked open data And in fact, we don't uh And we aren't advocating that all data be linked open data. Sorry. I was highlighting linked open data is linked data That's open. It doesn't have to be open accessible does not require things to be open Brilliant. Um, and I have a question of my own um Just because, you know, we have a few minutes. Um, I'm quite curious about in terms of the triple that predicate Predicate is that the correct word? Um, they themselves, uh, they would need to be defined terms, right? We need to have they need to have their own vocabularies Yes, exactly So, uh, one example of that, um Is the rift cs metadata standard. So rift cs is used by research data australia um for the exchange of The metadata about australia's research data and the people who create that data the people who organize look after that data And how that data is related to workflows to services to activities um and so the rift cs behind the scenes Is is based on this linked data model and those predicates are defined And they are part of the schema. So when you are creating rift cs records There are only certain predicates you're allowed to use for it to be valid rift cs Because again a computer can't make assumptions the way that a person can Yep, exactly. That makes sense. Um, well, that's all of the questions that we have for today Uh, so unless anyone has anything they want to drop in in the last minute or so, I think that we are done Okay, great, uh, so thank you very much for coming And I would like to uh remind you that the next webinar will be tomorrow at the times on your screen This will be going a bit more deeply into some of these things. I'm certainly talking much more about metadata Uh, rather than me. I was trying to focus more on the data side of things Although in this area it gets really messy because when you think about it metadata is just data Uh, and I'd also like to remind you all. Uh, there's that link posted to the chat about this webinar from ddi about their cross-domain integration cdi initiative um And uh, we can also share that link in the slack because I Suspect this chat will be vanishing once this webinar ends Um, otherwise, thank you very much for coming. Thank you for fielding questions for me. Nicola. I appreciate it And uh, I will see the people in my community discussion groups next week And the rest of you will see me in a fort night time in a fort night's time When Liz and I talk about uh reusable Thank you