 Hi everyone. Welcome to the fourth and final module of the Fair Data 101 course. My name is Liz Stokes. I'm from the Australian Research Data Commons and I would like to acknowledge the traditional owners of the land on which we are meeting today. For me, based in Sydney, that is the Gadigal people of the Eora Nation. I'd like to pay my respects to elders past and present and acknowledge that this land has not been ceded. I would also like to extend a warm welcome to any First Nations people who are joining us today for this final module, reusable. Okay, so today let's get into this. We are going to, so just front matter business first. Please use the question or chat component for any questions or if you've got any tech issues at all today or if suddenly you can't hear me. The NBN technician visited on the weekend. I'm pleased to announce and repaired not one but three broken cables between us and the node. So hopefully today works. I do encourage you to use the channels in the Slack also for after this webinar if you have further questions coming along. And you can also tweet out using Fair 101 or ARDC training on Twitter. And there is also a link to our code of conduct that we have for this course to ensure that it remains a friendly and accessible opportunity to get into the fair data principles. So today I'm going to start with an extended permacultural metaphor. Excellent things growing compost like mushrooms and other fungi. So for my first visual metaphor, let's ground reusability in a permacultural lens, which concentrates on the health of the compost and soil for excellent things to grow. Okay, so this reusability is going to concentrate on practical applications. Okay, the ultimate goal of the fair data principles is to optimize reusability. So the umbrella principle that data and metadata are richly described with a plurality of accurate and relevant attributes is further defined by three sub-principles or we could call them vice principles if you wanted, which highlight the importance of clear and accessible usage rights, data provenance, and domain-relevant community standards in supporting reusability. So how are we going to get into these? Here are some concepts I'd like to step through in the next 40 minutes or so and then we can have questions and answers. I will probably keep these things on a fairly high level and I know that's probably a little bit of an oxymoron to go high level and practical, but I don't know, that's probably the tension that we live every day. However, Matthias on Wednesday is going to expand on some of these into detail at a greater depth, shall we say, and look at fair beyond data into the associated outputs in the glorious ecosystem that is research. So in practical terms, how do we talk about reusability and what aids data reuse? In one sense, it's all about the metadata, keeping our eyes on the prize and looking at what the metadata is exposing and facilitating. So data that is available for reuse is accessible and these are just sharing with you some thoughts that occur to me off the top of my head when I think about what that might mean. So I translate that into, at the click of a button, I don't have to go deep into scrolling or any convoluted processes to actually access the data. The data is also well described. It does what it says on the tin, for example, which makes it easier for searching and finding and retrieval. The data is also familiar. When I'm thinking about familiarity, I'm thinking about things like formats that are in current usage. I'm thinking about the way that data is expressed or encoded appears in ways that are familiar to its users, that it is also easy to cite. So it's relatively painless for me to tell you where it came from and also that it is licensed so that the providers, the creators of that data are very explicit in how you or I are allowed to use that data. So let's start pulling these fair data principles apart. So at one level, rich description that metadata are richly described with a plurality of accurate and relevant attributes is an encouragement for the metadata author, whether they are humans, machines or data librarians to be generous with their information, generous with volume and specific with regards to the structure of that data. For me, this brings to mind two things. Firstly, a rich or thick description. I'd like to get into a little ethnographic story. And secondly, a certain enthusiasm for machine readable metadata schemers or rather documentation of metadata schemers that is machine readable and fair. So damn it, I probably should have put a reference into the slides. I will add it afterwards. I'm looking at me in my notes right away. So rich description brings to mind Clifford Geertz Maxim for anthropologists to provide a thick description in their field notes, that is to go beyond factual or literal descriptions. And he provides an example of reporting a wink. So instead of describing an eyelid stretching over an eyeball, he encourages ethnographers and anthropologists to consider talking about providing the context in which that wink might have occurred. So looking at the social and cultural things that are going on as well as the as well as a literal description of what is happening. I bring I bring this up because it's a I'm I'm talking now in this extended metaphor about anthropological research practices. Okay. And these highly descriptive entries and monographs of anthropological research are all part of doing that kind of research. So for other researchers to glean insight requires deep and sustained reading. And even if this does become tedious for the human, it actually becomes impossible for the computer, which is unable to filter strings by itself, unless someone has manually marked up that text or provided explicit structure to the data or digital information there. So I'm just going to park that tension here for a bit and then move on to unpacking attributes, which is that aforementioned enthusiasm I have for machine readable documentation metadata schemas. And another nice visual metaphor. Okay. So remembering that the core value of metadata is that it is structured data about data. I'll remind you of those variety of those concepts of data models that we were talking about in our previous module. So metadata assumes that the research data we are concerned with is always already structured. And that this principle goes for the metadata which describes or structures the research data, as well as the data itself. So for the fair sharing of research data, this accurate and relevant attributes point towards what gives us basic information about it. Information about how to find it, the act, what it's about, and the permitted usage. Content descriptions should cover both the content. Hi, everyone. We're sorry about the technical difficulties. If you will please bear with us, I will attempt to just take over from where Liz left off. Thanks, Matias. Am I back? Oh, sorry. I do not need to take over because Liz is back. All right, Liz. Where did I get up to? You have just gotten onto this slide with the rice paddies. All right. Okay. Excellent. Great. Okay. On with the show. Awesome. Thank you. Sorry, everyone. I appreciate your patience. Clearly, I did not touch wood when I extolled the virtues of the NBN. Moving right along. Okay. So let's have a look at this metaphor that I have thrown up for you. So as I was saying, talking about the content and the context and giving a rich description of metadata and this research data. Sometimes, so those practices in research practices that are familiar with us are not instantly translatable sometimes to computational processes. So when we do want to do something like that, we have to take a few steps of structuring our data in ways that can make it possible to harness the power of technology. So, for example, when we're making decisions about weighing up the purpose of what we're describing, we're doing that against the utility of describing it. We might not need to describe everything in a rich semantic ontology. Maybe it's only a few components. It really depends on weighing up costs and time and effort that are available to us. So in this balance between describing everything and what is fit for purpose for our users, whether they are researchers or data librarians, stewards, et cetera. Sometimes it might mean that not all of the detail goes into one long notes field, for example, which even though it's tedious or could be delightful for some humans, it is generally impossible for the computer. So here we have fields which are arranged according to the context in the shape of a landscape here. So now I'm going to really go into this metaphor. Obviously, they are also impacting the shape of the landscape and how that landscape is exploited for agriculture. In this image, not all the landscape is structured terrace rice fields. The farmers have made a decision about how to optimise the land for farming. As you can see, it's not all one uniform or level field, literally. And in some cases, you can see it's not structured at all potentially over around the borders of these structured rice paddies. We can see maybe some banana palms or other palms around in obviously very rich ecosystems themselves. So now I'd like to move into an example around Darwin core. So I've mentioned this, I think a couple of times, Darwin core being a metadata schema for describing biological things using standardised ways of what terms are and the elements that they reuse from other standard vocabularies. So what Darwin core does is they use the conventions of schema documentations, which are themselves specific standards to aid machine passability and human readability. Once you know what you're looking for. Okay, so what I'm going to do is navigate to this navigate to this website here. Okay, I trust that you can see the Darwin core basic vocabulary documentation here. So what this documentation says is lists, we can see some versions of the vocabulary. But what I want to draw your attention to firstly is that under this section for the term lists that are part of this vocabulary. And you can see we have some fairly standard just information about about different terms that are incorporated into Darwin core. Okay, so Darwin core actually borrow terms from the Dublin core legacy namespace and also from their terms namespace so from their terms and their elements. There are also some other lists here. And then here you can see under this IRI, which is like a URI resource identifier. Okay, a persistent thing. You can see that you can see that Darwin core define their own terms for the purpose of biological description. But they also reuse some from Dublin core. So I'm going to click over to the DC terms, Dublin core terms and show you here again that they are providing us some information about what they what they've created and how and under this section for again, terms that are members of this list. And here they are starting to provide us with more information about about what terms they are using and what has been borrowed. For example, there is a location term, which is called location. They provide a definition, a spatial regional named place, and that it actually replaces a previous term that they had specified. Okay, the modification was in 2008. The same thing here actually incidentally has happened with access rights. So they're using the Dublin core terms for access rights to provide information about who can access the resource or give an indication of its security status. But this documentation here is showing the term that it replaced, which was previously specified in the Darwin core term list. But now they have decided to reuse a term from Dublin core. Okay, so you can see that, you know, you don't have to make all the right decisions at the start. Maybe and I'm sure they had some very good reasons for choosing to have metadata about access constraints rather than access rights. And in fact, if I even follow this link, it would take us to the Darwin core quick reference guide. So this is where the machine readability documentation also is complemented by a much more human readable documentation. Because here in this quick reference guide, we get to see when we look at some of these attributes here, information that's going to be much more relevant to a human interpreter of the Darwin core terms than a machine. So humans need to know we like to have a bit of a comment and a definition and examples. Examples are really good for humans. So we know how to apply it and what to expect. And we can see that here in this, for example, modified term, it relates to date time when a resource has been changed, and also that it's conforming to a specific ISO standard about how to present that time. Okay. And that's the Y Y Y Y MMDD time and then the time little section after it. But anyway, so when we so we've got we've got standard ways of talking about metadata that's useful for humans to interpret and use and apply. And we also have if I went if I went backwards or not sure how I could go backwards in that right now, to the kinds of terms that the computer is going to want to know. So they want to know a bit more about the structure and the whether something's a class or a property and the semantic level on how the terms relate to each other. Okay, because knowing what values can be aligned or how often a field or element can be repeated. And whether it's mandatory or not. So the usage of that metadata description, then has implications on the things that we might want to do managing that research data at scale. Okay, so decisions that a, for example, data repository might want to make when they're considering bulk ingest of records or transforming records so that they can apply some preservation long term preservation work. This is where it matters what the rules are for managing that data. So we really like the metadata to aid that reusability to be very clear and very accurate when it comes to identifying attributes of that data. This is getting really circular references, referencing, isn't it? Okay, so let's move on. Great Darwin call. Okay, now it's time to talk about licensing. Oh, better speed up. Okay, so first I'm going to address licensing and then data citation. Okay, when the aim is actual reuse of research data, this principle encourages us to be clear about how people can do that. So when the data citation tells you where you got data from and can aid provenance in that way, a license sets out your expectations for others to follow. I just got to have to acknowledge that this is probably the most meta of slides I could ever give you right now. And I'll try not to get us lost. So many licenses actually feature attribution as an expectation of usage. And as this slide says, Quill West actually, the purpose of attribution is to give credit to the original creator of something you are using. It relates to the thing and is a legal requirement of using openly licensed works. Okay, because this speaks to the fact that a licensed work is a license is a legal instrument. So although licensing research data tends to come up at publication points, research data could be licensed during any part of the research life cycle, during planning or negotiating with potential collaborators, for example. I'm going to concentrate on creative commons licenses. After I make a few notes about Australian copyright law, drawing from the ARDC research data management, research data rights management guide in the next one. But as you can see, I am reusing this slide, this particular slide, which is from citations versus attributions by Quill West. They have licensed it through the creative commons attributions license, CCBY4, which is an international license. And you can see even on this slide, there is a picture of a LOL cat, which has been attributed and under a license for creative commons attributions and share alike 2.0. And they even provide acknowledge that this was a derivation from the original work, which was pretty much the picture of the cat without oh hi, I open source this for you. Okay, back to Australian copyright law. So look, it's complicated. But it's a fun time. Okay, so the conventions of academia to comply with copyright have developed citation and attribution practices. Okay. And while it is true that creative commons licenses can only protect material in which copyright or similar rights exist, there are two important considerations at play. Firstly, strict determination of whether copyright subsists in a data set can be complicated. And some data sets will definitely attract copyright. Secondly, for those data publishers and researchers who wish to broadly share their data, protection is not the primary objective in their selection of a particular license or right statement. Rather, in that case, the dual objectives in the selection of life of a license are or should be to unambiguously declare to everyone that the data can be reused and to indicate that the license or would like to be attributed when someone does so. So the bottom line is regardless of whether copyright exists or not, you can still apply a license to instruct how people might use the data you are making available. So as you can see on this slide, Australian law doesn't recognise copyright in machine-generated data, but it does recognise the impact of human authorship, which is demonstrated creativity in selection and arrangement of data. So if you have some raw data and you have analysed it and corrected, reformed, made modelling choices, this may actually influence whether or not copyright subsists in that data set, but it is always a case by case basis in Australian law. And the final important point is to know that rights in data usually rest with the creator of that data. So this is why we advocate for the use of licenses to make something usable when we want to be sure that we have a method of giving people a license to reuse the data. So we don't want to be resting on the conventions of citation alone. Creative Commons suite of licences have varying levels of usage, which you can bolt on to your data assets and it's important to acknowledge that they don't waive or replace copyright. And this image here we have is the attribution licence, which is quite popular because it's relatively easy to do and maps well to standard academic citation practices. Behind the CC attribution licence is a legal instrument that works internationally. That's the version four there. To apply the licence, you display this image and the words below, which link to a human readable version, a machine passable version and the legal instrument, which you are welcome to read. Okay, so let's have a look at what some of these licences look like in the wild. I'm using an example here from the Australian Ocean Data Network Portal and this is a data set about Australian phytoplankton. As you can see there, they're using the attribution licence CCBY and the portal actually licences all data within their repository as CCBY in their data use acknowledgement statement. Over here in the metadata record and I'm providing a screenshot of that, are some additional constraints to attribution depending on the parts of the data set that you might be using and how to provide attribution. Another example is from the Atlas of Living Australia and this is for the little magpie record, the Australian magpie I should clarify now that I know the difference roughly, Jim Narayna Tibbersen. So the Atlas of Living Australia actually has gets data from a lot of different data providers and they all have are welcome to use different licences. This particular image here, which is attached to the record, to the magpie occurrence record is licensed under a CC attribution non-commercial, that's what NC stands for, by a contributor called Wingspanner and that is just for the use of the image. That is not the whole record, there are other components there. So if I do that special screen sharing thing again, okay, let's try not to make this terrible. Here we are. Okay, so looking at the actual record here, let's scroll down so you can see on this page, the provenance of the who is providing data to this record is shown by these little provided by and the links to their links to the data sets which are being supplied contributed from various data partners. Actually if I scroll up, you can look over here and under the data partners tab, I'll just click that now, we can see which data partners are providing what data sets and under which terms and you can see a whole range of CCBYs, Creative Commons licences, sometimes they have non-commercial limitations on them, sometimes they only want attribution, only specify attribution. Okay, cool. So this is a nice slide into provenance and I'm probably going to finish up with this looking at provenance and then we can, I'll let Matias take you deeper into the domain relevant community standards. So what does provenance mean? Okay, well ultimately I think it's about asking what is useful for the users, the researchers to know about how the data was created. Okay, and often it's not until somebody goes to actually reuse someone else's data that they realise what is actually practically useful in terms of how the data was created or generated and what processes had been applied to that data. So provenance is something that allows people to trust data so that they know where it comes from, how it was created and can be aware of limitations. For example, if we're thinking about a temperature sensor, it might actually only do measurements in whole degrees. Now we all know that temperature changes over, like I suppose we could say degrees of degrees. So if you had a data set that only reported temperature at particular times in whole degree terms, then you would need to be careful about visualising that data and the implications for further analysis when that data had been normalised in that way. Here is a tale of two sensors and I would like to acknowledge this is actually from Matthias if you are wondering about the provenance of this example. When we were talking about how do we talk about reusability and what is helpful information to know about sensors that may have been used in the collection of data. So these sensors they collect data on humidity readings and temperature. DH-1 and DH-2, double one is the blue one actually, double two the slightly larger one, have different ranges, ranges of humidity readings. They are optimised differently for temperature, different temperature ranges and they also perform differently. So they have different rates of accuracy and sampling rates and there is a slight cost difference. So when the sensor or instruments, any kind of research instruments have similar names, the provenance is important because different capabilities will produce different results. So if we were using this particular sensor in a data collection activity it would be very helpful to be able to link out to record the exact sensor name and then link out to the properties or attributes of that sensor because then we would know what degree of accuracy we can infer from the results of the data that that sensor collects. So our decisions in terms of the fair principle of reusability is about being clear and accurate when it counts. I'm going to skip this one I think in the interest of having a chat about and answering any questions you might have but you can have a look through, I'll share the slides after this and you can have a little look through here. This is a nice example of how changes in the actual processing analysis pipeline mean that research teams can get very different results from the same data set. So probably time for me to wrap up right now. There are some links and it's a good read. Hey Matias. Hey Liz thank you very much for that and thanks for a graceful recovery that saved me from having to deliver your presentation. We do have some questions in but there are time for more questions than the number of questions we have so please do type your questions into the question module as Liz and I address these first ones. Okay so back at about the 12 minute mark you mentioned you're talking about the relevance of metadata attributes and someone in our audience says that sounds highly subjective and asks whether we have any guidance for how broadly they should think with respect to assessing relevance. Ah I'm going to be very candid and it kind of goes to is nope okay so my candid answer is what can you be bothered with right perhaps the more prudent response would be what what is fit for purpose so where do you need to you don't have to collect all of the metadata but what what is the metadata that really counts for for example for a a repository to provide a reasonable finding aid to the contents in their data repository so how much how much aboutness do they need to know about the research data in order to in order to make it easy for people to find the stuff that's in their repository and also how this is a this is a pun it's going to happen how fair is it on researchers and data stewards data curators people contributing data to repositories to ask them to provide extensive descriptive metadata about what they're producing so you've got to balance it up and often I think you want to be looking at what metadata can you automatically pull from other organizations or other enterprise systems first and then you know it's that last it's it's the last resort that you want to ask the contributors to put in extra data themselves and also I guess it kind of depends on the you know like the community standards what people find acceptable to provide okay great thank you okay next we have a drive-in busting question so it's possibly a little confusing in that list from the ala how there are all those different kinds of cc licenses so this particular question asks is cc by a different license to cc by 4.0 yes I will expand there are different versions of licenses so as the practice of openly licensing outputs and things develops there are different there are different engagements which work according to different legal jurisdictions so for example the creative commons attribution license version 3 or 3.0 that works in an Australian context okay version 4 is the international version of that license and it also happens to be the latest license so people who've come to a position where they've gone well you know what let's just apply the international license because then that will just work everywhere and we don't have to worry about gatekeeping and geographic borders that is just one step too far great thank you okay another question here would you say there is a preference when choosing a creative commons license for data sets especially when we want data to be open does it depend on the researchers preference or choice yes yes um what I what I haven't talked about at all are are institutional policies around intellectual property okay and whether that how that plays into who is providing the data and for what purposes so this is like this is a familiar attention to many of you publishing student hdr student theses okay and and managing different rights there but go back to the question because I think I was coming to the point but I've forgotten it yep so my understanding is there a preference for choosing a creative commons license over perhaps any other kinds of license when it comes to data sets yeah so I think that the creative commons licenses are I would recommend them because they are straightforward and they are they work work well for the purposes of sharing data sometime like your data could be really really old okay or your your resources could be really really old so in fact copyright may not even come into it and you may have be able to use something like a public domain mark which is another thing that the creative commons licenses include which is putting things in the public domain and um but hey there are weeds there okay I can see Matias starting to feel anxious yep okay so we've got one last question uh and I might handle this one if it's okay Liz uh so um Liz you asserted that rights and data usually rest with the creator uh can an institution assert their right to IP for data generated by academics in their employee also could a funding body assert the same as a part of the employment or funding contract now the reason why I wanted to answer this one is because I have been through this process um so as an employee an academic as an employer at the university anything that they generate any IP they generate during the course of their work would naturally fall to their employer unless there's been a contract signed saying otherwise so for example many institutions will allow their academics to hold the IP of their research outputs their publications sometimes even their teaching materials um but these agreements don't generally cover data so um and in fact in the past I have signed uh an extra contract I had my employment contract but on top of that when I worked on a particular project I was asked to sign an extra piece of paper that explicitly stated that the outputs of this project belong to the institution now strictly speaking that second bit of paper wasn't necessary but it was certainly an instrument that the institution wanted to use to protect its own IP uh and the same goes for funding um contracts as well um so for example the ARC um does specify that uh publications should be released under a particular or should be made openly available um and other funding bodies do the same now they don't necessarily go as far as saying that they own the research outputs um but they certainly do stipulate a particular kind of licensing or or access that should be used there uh did you have anything to add to that Liz nope I think you handled that wonderfully okay uh all right that's actually all the questions we have and I am sorry we ran a little bit over time but I will hand over to you Liz to wrap up oh that's it everyone um Matthias will follow up on Wednesday with a bit more detail in community standards and looking at um reusability from um uh with reference to reproducible workflows uh so stay tuned for that and um we will have quiz activities and um quizzes and activities ready for you for um for Wednesday I hope okay see you later bye