 My name is Martin Schweitzer. I'm currently seconded to ANS and I'm from my substantive position is at the Bureau of Meteorology where I work in the climate section and I'm responsible for the climate database. So I work extensively with data and particularly with climate data. The moderators today will be Andrew White, Frankie Stevens and Keith Russell. So thanks very much to them. So last week we heard about FAIR and that it stood for findable, accessible, interoperable and reusable. Today we're going to look at the last two, the interoperable and the reusable. So what is FAIR about? This paragraph that I've quoted here comes from the preamble to the FAIR principles. And it talks about the Grand Challenges of Data Intensive Science is to facilitate knowledge discovery by assisting humans and machines. So it's both about humans and machines in their discovery of, so it's about discovery, access, integration and analysis of task-appropriate scientific data. So what it comes down to is it's a set of principles about, from my perspective, increasing the value of scientific data and in particular of research data. And so they go on to say, yeah, we describe FAIR as a set of guiding principles to make data findable, accessible, interoperable and usable. I'm going to start off with a little story. So I listened to podcasts and this was purely coincidence, but I happened to be listening to one called Revisionist History by Malcolm Gladwell. And this was about two weeks ago while I was busy preparing these slides. It was called the Basement Tabs. The top is a link to the podcast and just below it, there's a link to a transcript of the podcast. And while I was listening to it, I started thinking, wow, this is exactly about FAIR. So I'm not going to play you the whole 30 minute podcast, obviously, but I've extracted some of the vital bits. So it was about a doctor back in the 60s who was very interested in diet and who posited that cholesterol caused heart problems. And that by eating saturated fats, it pushed up people's cholesterol and therefore they didn't live as long as people who lived on a diet of non-saturated fats. So there were two ways in which one can do these studies and one is to observe the population and ask them what they eat, et cetera. But the second method is a controlled study. In other words, find two groups of people, give one group diet A, give the other group diet B. But it's very difficult giving people a particular diet unless you've got control over them. So what this doctor did, friends, is he went to a number of mental hospitals around Minnesota and signed them up to do this trial. So basically there were six hospitals at one nursing home. He had more than 9,000 research subjects. The trial, putting people on different diets, ran for five years from 60 to 73. And then there was a long follow-up period to observe their health afterwards. It was incredibly meticulous about the way he ran the study. They give an example that at the time there was a law that if people served margarine, it had to be in the shape of a triangle. And because this was meant to be a double blind study, he had the law changed so that the margarine would look the same as the butter. So this ran and they talk about it, but ultimately the results of his study weren't particularly convincing. However, 20 years later, another researcher called Ramston had a hypothesis that people who are eating vegetable oils, and particularly a lot of vegetable oils, are ingesting a lot of linoleic acid. And this apparently in very small amounts is good for us. However, in large amounts, it causes inflammation. So he felt that by substituting vegetable fats for animal fats, people were causing the linoleic acid in the body to go up and therefore it was actually affecting the heart. And this was a reason possibly why the results weren't as expected when people had gone off these saturated animal fats. However, in order to do a controlled study, as it says here, he needed a lot of subjects, but what was he going to do? Somehow convince the government to give him hundreds of millions of dollars for the study and then find thousands of people willing to turn their doubts upside down. Then he realizes, wait, those studies have already been done. So this goes back to the son of the original researcher, Richard Fans and Malcolm Gladwell. And so this person said, and this is going back about 20 years, universities make a decision to throw things out that they think are just taking up space and nobody's going to do anything with them. So what happened was it turned out his father didn't like that idea. And so he kept all his research notes, all his data, et cetera, in his basement. And towards the end it says they go back, then look in the basement for this data. And finally, and sure enough, there they were on magnetic tapes in a mould green box in the back of the funds basement. I think it's an incredible use case that talks about research data and about its reuse. What it doesn't then go into, which I'd be interested in is how easy it was to get that data off those magnetic tapes. But that's another story. So just continuing this recap on fair, some things that fairies not. So fair is not a standard. Fair does not say if you don't follow these rules, it's not fair. It's a set of principles or guidelines that says these are the best practices. Try to strive towards these goals. Fair is not equal to RDF, a resource description framework, linked data or the semantic web. Although it makes use of these things and posits the use, it's not equivalent to using these things. Fair is not just about humans being able to find access reformat and finally use data. A very important part of fair is that it's about machines being able to also access this data, but machines being able to operate on the data that's being used. And finally, fair is not equal to open. Fair acknowledges that there are sometimes restrictions on data. So it doesn't say that data has to be open. Of course, with open data, it does make reuse a lot more easy, a lot more simple and given a citation at the bottom. So the first one we're going to look at is interoperable. And the preamble to interoperable. To be interoperable, the data will need to use community agreed formats, language and vocabularies, the vocabularies being the pertinent one. The metadata will also need to use a community agreed standards and vocabularies and contain links to related information using identifiers. And last week we saw about things like Perl's Permanent URLs and DOI's Digital Object Identifiers. And it's a very strong thing throughout fair that all the data that we're using is somehow linkable that we can get to it that we have this permanent links. And the other thing is that it applies both to the data and to the metadata. So interoperable has three principles. One, I1, metadata, so data or metadata, use a formal, accessible, shared and broadly applicable language for knowledge representation. Two, that metadata use vocabularies that follow fair principles. And three, that metadata include qualified references to other metadata. So we'll work through each of these in turn. The first one, metadata use a formal, accessible and shared language. The first thing we're going to look at is ontologies and I have a definition here of an ontology. And ontology is a formal explicit specification of a shared conceptualization. And it goes on in this to explain exactly what each of those terms means. So by formal we mean machine readable or formally specified. Explicit specification means that it explicitly defines concept, concepts, relations, attributes and constraints. And I'll give an example in the next slide. Shared means that it is accepted by group. So it's not good enough if I come up with my own ontology and think, okay, I've got a ontology for my genetic data. I'm just simply going to use that. It needs to be accepted by group. And finally, conceptualization is an abstract model of a phenomenon. So there are a lot of ontologies out there. And I think this source I was looking at had at least a dozen for biological type data. But I've taken for gene ontology consortium, also known as GO, the sequence ontology, the generic model organism project and the ontology for biomedical investigation. And what I did is I went to a GO gene ontology and I pre-read insulin regulation and it came up with this term. So basically, the ontology is a set of terms and a set of relationships for that term that give us a well-defined context for a particular term. So it has this accession number through which we can always access it. The name is regulation of insulin secretion. It comes from the ontology biological process. In this particular case, they don't have synonyms or alternate IDs and there's a number of links. So this is one way of looking at the term in GO. There's a second method and it's a graphical method showing it as part of the graph. So at the bottom, we've got regulation of insulin secretion and now we've got these links. So this yellow link means regulates. So regulation of insulin secretion is a regulation of insulin secretion. It's also ISA, so regulation of insulin secretion is part of protein regulation and it's also part of regulation of peptide hormone secretion or in other terms. Or in other words, a more specific term for each of those or a more specific concept for each of those. And we can see how we can actually climb up this tree either looking at regulatory frameworks or just the more general terms for different concepts. Another type of vocabulary we can use is what is called a controlled vocabulary. A controlled vocabulary is slightly narrower than the ontology and it reflects the terminology used by a particular community or particular domain. And ANS has a vocabulary service and the diagram on the right is a visual representation of the ANS vocabulary service. So the second principle was that metadata use vocabularies that follow fair principles. So what this is telling us is that our data and our metadata should be fair but also the vocabularies that we use for our data and metadata should be fair. Meaning, of course, that they should be findable, accessible, interoperable and reusable. And the important ones there are actually findable, accessible and interoperable. In other words, things like machine readable that they available easily etc. And the third one was that metadata include qualified references to other metadata. And so it's nice having a reference to other data. But a qualified reference is a reference that expresses its intent or relationship. So in the insulin example, we saw that X is a regulator of Y which is a lot more informative than X is simply associated with Y or X is a kind of Y or C or so Y. And in this case, all datasets need to be properly cited. So earlier when we saw those boxes under the code, the gene ontology, each one of those is also a term that appears in the gene ontology. And we can look at each one of those and we can look at their graph. So once again, there's a little use case that came up in the last couple of weeks with the work colleague. And I think this gave quite a nice idea of interoperable. What you see and what my colleague presented me here come across this data, which actually is from the CIA World Book. On the left hand side, we've got the names of countries. And on the right hand side, we've got exports. So what this is saying is that Chile exports 28.6% of Chile's exports are to China. For Croatia, 13.5% of their exports go to Italy, 12% to Slovenia, et cetera. And he thought it could be rather nice to plot this on a map with arrows going from one country to another. So from China, we'd have a thick arrow going to the US, a similar sort of arrow going to Hong Kong, et cetera. So the first thing about this data is it was a web page. There was no obvious way to download the data in a format that was machine readable. But there was a second problem. So copying the web page and putting it into a format that we could use took maybe half an hour, 20 minutes. But there was a second bigger problem. And that was, well, we've got these countries on the left, China. This is Democratic Republic of the Congo. This is the separate country, Republic of the Congo. But we don't know their length. Well, it wasn't too hard to find a web page that had the names of each country and the geocenter, the lat-long of the middle of that country. So this was great. So all we have to do is combine these two datasets. However, we go back here. What we see is Congo Democratic Republic. In this one, there's nothing like that. It's clearly Congo DRC. But we had to edit the data by hand to make it the same. The same with Congo Republic of the, well, in this one it's Congo and then in brackets, Republic. There's a more subtle thing. And I'll see if... Cote d'Ivoire or Avery Coast, if you look carefully at that one and then look at this one. I don't know if people have noticed, but in this one there's no accent on the O and in this one we do have an accent. So a machine doesn't recognize those two names as being the same. However, on the left-hand side here, we've got country codes and these are the international country codes. Also known as ISO 3116, the International Standards Organization has created two letter codes for every country. They've also created three letter country codes. So the first thing we did was to go back to this original set and for all these countries replace the name of the country with the country code. We also had to do the same thing on the right-hand side. And as I said, putting it into a format that was reusable for a machine took about half an hour. Replacing the country names with country codes took a lot longer than that. If the original authors had used a standard vocabulary, in other words, use the ISO 3116, we would not have had that problem. Okay. So the second thing I mentioned was that data was available as a web page. So accessing data is also very important to both accessibility and interoperability. So one of the best ways nowadays to use the use of APIs or what I prefer to call web APIs, application program interfaces and there's a number of APIs you can draw to put them into two. The first one is a standard web service. So a good example of that is the OGC or Open and they publish standards saying, okay, we have particular services. An example of one of these is the web mapping service WMS. And this is exactly how you will send a request and this is exactly how the data will be returned. A second one, a broad category of the non-standard web services. So I decide that I've got some data. It's about, let's say, the birds that I've observed in an around Melbourne. I'd like to be able to present that through an API. So my data will present the latter name, the common name, the data observed, the bird and the lat-long. So there's no other service offering that particular thing. So that's an on-standard web service. It can be documented and there's ways of documenting it, which we won't go to, but there's a documentation standard called OpenAPI if people are interested. The next method is by file download. In other words, my data is available as a file. Just download it. And once again, there are different ways we can download files. The first way is files that use an open standard and things like PNGs, CSVs, NetCDF or all examples of files using open standards. The second one is files using proprietary formats. And this is where it gets a little bit gray because some companies have published the format, but the licensing is a little bit murky. However, if one thinks of, for example, old Word documents that was using a proprietary format, if you did decide to write something that paused that file that looked at the information inside, you were pretty much on your own. Microsoft preserved the right to change the format at any time and not have to inform you. They made no commitment to staying true to the format. And then the third way that data can be accessed is by special arrangement. In other words, we don't give you any way to download this data. However, if you're interested in this data, send us an email or phone us up and we may send you a tape. We may send you a piece of paper. We may send you a file in a standard format. So in general, this goes from the most fair, APIs in particular using standard web services to the least fair where you have to find the data and then contact the company and then ask them for the data. And it may not be in any particular open format. Okay, we're going to get on to reusable. And this comes from an ANS document and reusable is about making data available so that it maintains its initial richness. For example, it should not be diminished for the purpose of explaining the findings in one particular publication. It needs a clear machine readable license and provenance and the key words there are license and provenance. And so we'll look at the three principles of reusable data. Firstly, metadata are released with a clear and accessible data usage license. And we'll get to that in a minute. Metadata are associated with their provenance. So we know where and how that data came about. And finally, metadata meet domain relevant community standards. So full license, some people assume, okay, if the data's got no license, anybody can use it. In Australia, this is not the case. In Australia, if there's no license on your data, then this regarded as all rights reserved. In other words, that is copyright. You can't use it unless the data explicitly says you can use it. A very good, if you're trying to make your data open and you don't want to get involved in all sorts of legal issues, etc. The group known as the organization Creative Commons has done a lot of work in order to make licensing data or the open licensing of data as simple as possible. So they've created a number of legal documents, a number of licenses to as fast possible allow people to find a license that suits them and that will be suitable for their purposes. And we'll come to a bit of those in a moment. So the first slide I've got simply shows what components are needed to apply Creative Commons license. And so this is an example of the Creative Commons of the work that's under Creative Commons. The work is a photograph of a birthday cake celebrating their 10th birthday. So it has a title, it has an author, it has a license and very importantly, this license is linked to the actual license and it's also linked to a machine-readable version of the license. It also has a source, so the title is linked to the source. And here's an example that I just copied from one of the examples. This is the MIT OpenCore square and I copied the license from the bottom. So what the CC tells us is that it's a Creative Commons license. This symbol is the attribution. So it says basically this is license about Creative Commons. If you use it, you must give the author attribution. In other words, show the source of it. It can only be used for non-commercial use and if you share it. So you can also modify it or make derivative work, but if you publish or share that derivative work, it must be done. SA means share alike. The conditions under which you share it can be no more restrictive than these conditions. In other words, you can't put extra restrictions on. So you'll say text version of what I've just said. So share means you're allowed to copy and redistribute the material in any medium or format. Adapt says you may remix, transform and build upon the material for any purpose, even commercially. The attribution is you must give appropriate credit. And share alike is that if you remix, transform or build upon the material, you must distribute your contribution under the same license as the original and this goes a bit further. There can be no additional restrictions. You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits. And yes, a very nice diagram, which, so as I mentioned, there's seven, one, two, three, four, five, six, seven different licenses. At the bottom, we see the most restrictive and at the top, the most permissive license. So if it's got across through the dollar, it means it can only be used for non-commercial use. These ones said that if you release it, you've got to share it like. All of these say you need to give the owner attribution or the original author attribution. And at the top one, we've got that it's equivalent to putting it in public domain, which is essentially saying there absolutely no restrictions on this work. Anybody is free to use it. It's in the public domain. So the next topic we're going to look at is provenance. The term comes from things like the art world where provenance on a painting meant if you buy a painting and spend a few million, you want to know that it's really the original. And so when paintings are bought and sold, they keep a track of each person who's owned that painting so that you can go all the way back to the original artist. And with data, if we're using some data, we want to make quite sure that the data we're using, we want to understand where it comes from, how it was derived. I was looking at some data a couple of days ago. It was how much money the top, let's say 50 companies, how much tax the top 50 companies in Australia were paying. So for each company, it had their annual earnings and their tax. One of the problems with this data was there was no reference back to how this data had been collected, who collected it. So it makes it very hard. Is it collected by the tax office? Is it collected by the companies themselves, like from their annual reports? Is it collected by some third party who has got some sort of agenda? Unless we know that it's very hard to make a decision. So this diagram is a model that I'm familiar with. This is very roughly how we get forecast from the Bureau of Meteorology in. So if we have a forecast that today it's going to rain and there's going to be gale force winds, that forecast comes from a model. AWS is our automatic weather stations. So they collect it every minute. There's 600 of them around Australia. Each of those is quality controlled and is fed into the model. There's also manual observations. Each one of those is quality controlled and fed into the model. And then yesterday's run of the model will be fed into today's run of the model. So the model gets fed back into the model. Now, for us, it's very important that if we make a forecast that we know where every piece of data went. So what kind of provenance are we interested in? Well, we can go back to the automatic weather stations and the kinds of things that we're interested in. We've got a station that's measuring, let's say temperature. What kind of thermometer is being used? It makes a difference even if the difference is a hundredth of a degree. What kind of rain gauge? Certain rain gauges have a bias if there's very high volumes of rain. Some rain gauges may have a bias if the wind is blowing while it's raining. So in strong winds, they may slightly underread or overread. When was that equipment lost calibrated? Have there been any systemic errors with any of these? When were the batteries lost replaced? Because sometimes as the battery starts running low in an automatic weather station, there could be spikes in some of the data. When it comes to the quality control, there are questions like who was the person who did the quality control? If a value was flagged as being suspect because they never change values, they just say if some value, let's say we've got a temperature of 10 degrees and a minute later the temperature is 12 degrees, that value may be flagged as being suspect because temperature very rarely rises to two degrees in one minute. What determination was made? When it goes into the model, questions are, well, which version of the model are we running? Which software? Which version of the operating system is this model running on? So that finally when we get to the forecast, we can always work backwards and try to recreate all the conditions that gave rise to that forecast. It also allows us, for example, if we see a lot of systemic errors in the quality control for a particular station or for a particular dataset to say, okay, is there any commonality with the person who was making these changes, who flagged these as like worrisome? Is there anything about the equipment? Are there any commonalities with the equipment? Is it a particular type of an anometer that keeps failing, etc.? So yes, an example from one of the files that we produced, this file is available openly, part of the provenance record that's actually inside each one of these files. So this is a daily temperature file and so it says on what date there was a conversion. It was converted to the next CDF format, a credit format, quite common in meteorology. This is the name of the script that converted it. This is the name of the file that the raw data came from. This was the intermediate file that was created and over here these were the parameters that we used with the intermediate file. This was the configuration, the general configuration file and this was the specific configuration file that we used. And then the provenance continues. This was the data created down to pretty much beyond the millisecond. This is the version of the software that we were using and this is the version of this particular piece of software that was being used. This was the version of the next CDF that we're using, etc. This one's interesting. This is where the keywords come from and this goes back to that previous issue where we said that the vocabulary should follow fair principles. The keywords that we use in our data go back to what's called the GCMD, Global Change Master Directory which is administered by NASA and is available at this URL. Once again, another story about provenance and this time it was a blog that I came across in the last week and as I started reading through the blog, the blog was understandably about provenance. It was about limits of artificial intelligence as you can see from the title and I'll quickly run through it because I started reading through this and thinking, wow, this is all about provenance and then the author had intervention provenance. So what the story was about, the author's wife was pregnant, he was a statistician and so they went for a scan, an ultrasound scan and there was a geneticist at the scan and the geneticist said, look, there's some white dots around the heart and these are often symptomatic of Down syndrome and typically what we do under these conditions is we refer people for an amniocentesis and if there's an amniocentesis, I forget the exact figure but I think it's something like this, a 1 in 300 chance that the fetus can be killed by the process, by the intervention but an amniocentesis is the only way to get a more positive reading of whether there's Down syndrome. So the author, obviously being a statistician and also being aware of these issues was a bit reluctant so he started having a look at trying to get more information and in particular got some information about the ultrasound machine that they were using and one of the things that he found was that this was a newer machine and it was a much higher resolution machine than the previous machine and he surmised that maybe what was happening was that it just happened to be white noise because of the algorithms for rendering it and he spoke to the geneticist and she said, yes, they had noticed an upturn in the number of cases that did come positive for Down syndrome since getting the new machines so where the story ended was that they decided not to have the amniocentesis and they had a healthy baby. However, I think the important thing is if people had simply been collecting this data but they hadn't been collecting the type of machine or the operator or a whole lot of other provenance they may not have actually been able to work back and actually found that these machines or the algorithms etc. were giving false positives. So once again, with any kind of data and particularly research data, provenance becomes hugely important. So I think we've got a few minutes left so to end on a lighter note we will see Neil Armstrong and Albert Einstein discussing data management. Hello Neil Armstrong, how are you today? Albert, good to see you. I am not good. NASA wants me to create a data management plan for my next moon mission but they'd rather watch cartoons. But Neil, data management plans are not difficult to make and they are important. More important than watching cartoons? Yes. More important than eating macaroni and cheese? Yes. More important than posting pictures of my cat on Facebook? No. But thinking about how to manage your data now will help you later on. Okay Albert, tell me how data management will help me later on. Well, my friend, it is quite simple. Data management plans help researchers manage, share and preserve their data. Good data management practices ensure the data is well organized and well documented so that it can be shared with others. Why would someone else want to see my data? They may be curious about a detail of your experiment or they might want to use it in one of their projects. Many funding agencies now require that you have a plan for sharing your data before they will give you money. I like money. I use it to buy macaroni and cheese and cat toys and a rocket ship to get to the moon. Managing your data well adds value and makes it easier to keep and share with others. If other researchers find your data useful it enhances your scholarly reputation. That sounds good. Then NASA will give me more money and I can buy another rocket ship. Yes, exactly. Just remember, you need to think about data management early in the research process. There are important things you need to prepare for like capturing accurate documentation and storing your data in a safe place. Okay Albert, I will go and start working on my data management plan now. Check out the DMP tool at dmptool.org. It will help you with your data management plan. That's great. I need all the help I can get. But before you go, let's dance. Okay, so that's all I'm going to say. I've prepared a little quiz so people can test the knowledge of INR. It's at this. I don't know if people want to spend five minutes now. I'll hand it over to Andy. Okay, we might go straight into questions. Yes, let's see if there's any questions and then people can do the quiz if we've got time. I'm guessing what Gareth is asking about is if you present your data with a share or like license being sort of you can do things with it. You can create derivative works, but whatever you create has got to be got to be released under the same license. So you're not allowed to create something that's derivative, but then keep that to yourself and not share it. The expectation is that whatever you create as a derivative piece of work will be shared under at least that same share or like license. Now that's straightforward if there's you're using one piece of work and that you're using that as a basis and you're creating a derivative work and then you say, okay, it was originally licensed under a CC by share or like. That means the derivative work I'm creating will also have to be released under a CC by share or like license. This also happens if you're actually bringing together two pieces of work, two different pieces of two different data sets with different share or like licenses. One is CC by share or like the other CC non-commercial share or like, then you mingle the two, you create a new derivative work, then it would be licensed under CC by non-commercial share or like. So there's like the more limited one of the two. Is that sort of help? Okay. So I hope thank you. Yes, let's answer the question.