 Good morning, everyone. My name is Matthias Liffes and I'd like to welcome you to this webinar. Before we start, I'd like to acknowledge the traditional owners of the lands on which we all are today. For me in Perth, that is the Wajib people of the Munga nation, and I'd like to pay my respects to their elders past and present. Another important thing to remember is that the Fair Data 101 course is governed by Code of Conduct. And this Code of Conduct is really important to make sure that everyone has a fulfilling learning opportunity over the next eight weeks. There's a link there for you to see the Code of Conduct. And if at any time during this course you observe a breach, please contact us using the form that is linked in the code. OK, so welcome to Fair Data 101. It's been a bit of a rollercoaster putting this together, taking advantage of this time. Now I would like to introduce my colleague Liz Stokes by Liz. She will also be presenting the coursework for this course, sorry, but will be largely presenting from Wednesday. So today, unfortunately, you have to put up with my voice for the next three quarters of an hour. I'm just going to turn my video off and let you take the floor, Matthias, but I'll come back in at question time. OK, great. Thanks, Liz. And it's not just Liz and me who are bringing this course to you. There's a number of other people at the ARDC who have all been working very hard and you'll have an opportunity to meet them over the coming weeks. Now, what are we going to cover today? First up, there's a little bit of housekeeping about what to expect from the Fair Data 101 course. I'll then be giving a quick introduction or overview to the Fair Guiding Principles, which the next eight weeks are going to be all about, as well as why the Fair Principles came about. And then I will start talking a little bit about the first of the Fair Guiding Principles, namely, Findable. And Liz will continue the presentation about Findable on Wednesday. So housekeeping for the course, over the next eight weeks, there will be four modules, each module on one of the four aspects of Fair, and each module will be over two weeks. In the first week of each module, there will be two 45-minute webinars. And at the end of webinar two, we'll also give out an activity sheet, which will hopefully keep you busy for around 30 minutes as you work your way through that. There is also a quiz, and you should be able to find the answers to all the quizzes in the webinars, in the activities, and maybe any readings that we give you. Then, in week two of each module, there will be a 50-minute community discussion, that and there are a number of options for that, and we'll be going through those options later. There is also a Slack workspace available, and you can join that at tiny.cc forward slash fair hyphen 101 hyphen Slack. Now, if you are not familiar with Slack, Slack is a chat tool used by many workplaces around the world, not just workplaces, also projects and simply groups of people who want to chat with each other. And a single Slack workspace has more than one channel available within it. So, when you click on the Slack invitation link and you sign up, you'll find yourself in Slack, and you'll be automatically made a member of two channels, hash general and hash introductions. Please give an introduction to yourself. In introductions, say hi to everyone. And then generally, most of the conversation will probably be happening in the hash general channel. If you're not so keen in keeping Slack open all of the time, which is perfectly okay, you can enable email notifications for a Slack, and in fact, if you go to that URL there, you'll be able to, once you're logged in to Slack, that is, you'll be able to change your email notification settings there. On all these slides will be available after this presentation, so you don't necessarily have to furiously write down any URLs that you see today. Okay, let's get right into the material. So, the fair guiding principles were first proposed, I suppose, four years ago today, not today, about four years ago, when a group of researchers, data stewards, got together and suggested in this paper that was published in scientific data, it's fully open access, so you can have a read of it, and they suggested that simply espousing good data management practices, so simply saying to researchers and research support professionals you need to manage your data well, without providing any actual detail, the authors suggested that good data management could be broken down into a number of principles with some clear suggestions as to how you could fulfill each of these principles. Now, the guiding principles were an evolution of the open data movement, which has been around for longer than the fair principles, but the simply calling for open data all the time was a little bit problematic, which is one reason why these fair principles came about. So, the fair principles are much more nuanced than calling for open data. So, in the next module, you'll learn more about the accessible component of fair, which is more along the lines of as open as possible, but as closed as necessary, but making sure that data is available and accessible somehow, even if it's not openly available. The fair guiding principles are, well, there's quite a lot of words to them, but they are quite clear in suggesting the best practices for making data fair, and is certainly a lot more useful to your average researcher than simply saying to them, make your data open. And also, another thing that the fair guiding principles really talk about quite strongly is that data shouldn't just be fair for humans, it should also be fair for machines. Because as we find ourselves in an age when research is becoming more and more computationally intensive, it is important for machines or computers to be able to access data as cleanly and seamlessly as possible so that the computers can just get on with their work, with their number crunching, without humans having to find the right data set, download it, upload it, change it to a different file format, what have you. So, what exactly are the four fair guiding principles? We have findable, accessible, interoperable, and reusable. So, four principles, four modules in this course, and today and on Wednesday, we'll get really deep into the findable. So, the principles to paraphrase them a bit or maybe expand on these words. Firstly, findable. Data used in research could be findable somehow. You should be able to work out that it exists and should also be able to work out where that data might be. And there's a number of ways in which you can not necessarily ensure, but certainly take as many steps as possible to make sure that your data is findable. Secondly, data needs to be accessible. So, once you've found where the data is or even found that the data exists, you need to be able to access that data somehow. And it could be that the data is fully open and it's a case of downloading the data to your computer and working on it that way. Or it might be that there needs to be some kind of mediation because it's not always appropriate to make data fully open. So, you might need to contact a human to get permission to access that data. And the accessible module will talk about that as well as different ways in which data can be accessed in technical terms. Thirdly, data should be interoperable and the interoperable is possibly one that people can struggle with the most. What does interoperable mean? What interoperable is trying to achieve is that you get two datasets that were collected in hopefully a nice systematic way. And it is relatively straightforward for you to be able to combine them together or use them together. Or perhaps you have a dataset that can be analysed in one piece of software but it can also be analysed in another piece of software without much work because that data has been recorded in a systematic and standard way. We'll get more into that in a few weeks. And finally, reusable. Once you've found and you've accessed your data or even once you've produced your own data hopefully that data can be made reusable to maximise on the investment made in collecting or creating that data. So something to look forward to in about seven weeks time. Now, I apologise for having a wall of text but this is something I occasionally have arguments with people about. So the fair guiding principles which many people often call the fair data principles weren't originally coined to be just about data. So the authors of the Wilkinson et al intended that these principles be applied to everything that leads to the data the algorithm, the tools, the workflows, the software or procedures, processes and all scholarly digital research objects benefit from the application of these principles. Now, in the four years since the authors came up with the guiding principles it's become apparent that some research objects need a little bit more or it's not quite as straightforward to work through the fair principles to make them fair. In some cases, software for example, there needs to be a bit of a little more thinking or a little bit of different thinking because sure software and data are both digital objects but the way that computers use them and interact with them and even the way humans interact with them is different. Therefore, the principles possibly need a little bit of a revision and we'll be talking about these kinds of things in future weeks as well. So that being said, a very big but is that largely we'll be talking about fair data over this course except when I occasionally let myself be distracted. Okay, so findable F, what does that mean? So to expand on that a little bit, metadata and data should be easy to find by both humans and computers. Now the findable has been broken down further into four more specific guidelines and I'll be talking about one of those today and then Liz will be covering the others on Wednesday. So here we go F1 and this is possibly I mean it's the first of the principles. So arguably the most important of them all and many other principles tie into this one in that metadata are assigned globally unique and persistent identifiers sorry metadata and data which brings me on to a quick note. When these principles, sorry, so this principle metadata or data when we're talking about that we're saying that or rather we suggest it is acceptable for data and its associated metadata to have the same PID persistent identifier. So there's no need to create one PID for the data set and then another PID for the metadata and this will talk more about how PIDs and metadata interact. Okay, so I said a globally unique and a persistent identifier. So what does that actually mean? So here's an identifier. It is forgotten how many characters it is 13 characters in hexadecimal. So each of those 13 characters has 16 different possibilities. And I did the maths a couple of weeks ago and I think that's about 500 million billion different possibilities. So looking at this, there's with any randomly generated 13 character hexadecimal there's a pretty good chance that's actually going to be unique. And it's going to take quite some time for the same randomly generated number to come up again. However, in and of itself it's not guaranteed to be globally unique. And so what do I mean by globally unique? So like a snowflake, a PID globally unique and persistent identifier PID for short does really absolutely have to be globally unique. And the reason for that is so that we can be guaranteed that it is absolutely unique to a particular data set or other digital research object. It is incredibly important when you're trying to tell a computer about a data set because computers really when it comes down to it aren't very smart. They are quite fast and they can process information much faster than a human can. But when it comes to things like looking at context and making what you could call an educated guess, computers aren't that great. Unless you're talking about machine learning algorithms which we aren't so let's just stick with that. So we really, really want our identifiers to be globally unique. That is to say in the entire world, a particular identifier can only be assigned to one data set. So how can we do that? There is consensus, pretty good consensus as to how we achieve that these days. And that is in the form of a DOI. Now, hopefully everybody is familiar with the concept of a digital object identifier. They've been used for quite some time now when it comes to uniquely identifying research outputs like journal articles. And over the past, actually it's probably been close to 10 years now, they've been applied to data sets as well. And in fact, the ARDC and one of its precursor organizations, the Australian National Data Service has been providing a DOI minting service for Australian research organizations. So this DOI, for example, is from that service and combining a few different elements together. So you'll recognize the number at the end there, because that was the identifier I posted before. But this time we've added a little bit more information there to ensure that it is globally unique. So first up, we have this 10 or 10 dot. And for people who are familiar with PIDs or even DOIs, this 10 dot something will stand out as being a DOI. Now, you might also sometimes hear about persistent identifiers called handles. For people who like getting technical, DOIs are a kind of handle, but they're very specifically handles that have this 10 dot prefix. Now, within this DOI, after this DOI handle, we have a few more digits. So 4225 says that that is a DOI generated by the ARDC. Don't worry, this one's not in the quiz. You don't need to memorize this. The 06 means that it is a DOI from Curtin University. And then that number at the very end, you could consider to be a local identifier. But combining, so that is to say it is definitely unique within the local Curtin University context because the system makes sure that if it does randomly generate the same number again, that it won't use it again because it's already being used. But by combining all of those elements together, we come up with this globally unique identifier. Now, I just realized I've been laboring on about globally unique for quite some time. But we also need to talk about this idea of persistence. So the identifier needs to be persistent in the same way as some of the smells that come from my dog. They need to hang around for quite some time. Now, that's not necessarily forever because forever is a very, very, very long time. And it is quite unlikely that the DOIs that we're minting today are going to still exist in a billion years. Or indeed, whenever it is that the sun expands and consumes the earth. However, persistence means that the identifiers will continue to be available and will continue to reference the same material for the foreseeable future. And in fact, as long as is necessary, which requires a couple of elements in and of itself. So to guarantee the persistence of an identifier, you need to make sure that the infrastructure is there to handle that. But at the same time, you need to have some governance behind that infrastructure or on top of that infrastructure to make sure that the infrastructure is being managed and looked after. I'm not sure how many of you have had experiences with systems coming online being made available, but then nobody had the resources generally, it does come down to resources to maintain that infrastructure and keep it going for as long as it's useful. So we come up with websites that become outdated or databases that stop working because the software has moved on, but that original database would require too much effort or more effort than is available to update and bring into the future or indeed just the current times. So when it comes to selecting a PID for your data or your research object, it's good to find one where you stand a good chance of it being both globally unique but also persistent so that the DOIs or the handles or the orchids that you create today will continue to work for as long as you need them to work. Now, at the ARDC, we absolutely do recommend particular PIDs for particular kinds of research objects. So, for example, when it comes to PIDs for humans, so a researcher and author, somebody who contributes to research and generates research outputs, there are a few different identifiers available, but really when it comes down to it, there is one de facto standard, although you could possibly say now it's more than de facto because many, many organizations around the world have adopted this as their chosen identifier. So we have the researcher ID and the author ID, which are both owned and controlled by four proper corporations, which, look, there's nothing wrong with that. We all use infrastructure that is created by four profit corporations, but the identifier that the ARDC really does recommend is the orchid, the open researcher and contributor identifier, which is owned and controlled by a member-based organization. And the reason why that's an advantage is that, say, a university can say, all right, we're going to ask all of our researchers to create an orchid, but we're also going to become a member of orchid to help make sure that that governance persists as long as the orchids need to persist. Who knows, maybe in 20, 50, 100 years' time, there'll be some better way of handling this kind of identification, but for now, for the foreseeable and workable future, any organization can join orchid and any organization can become a part of that governance that guarantees the persistence. And here in Australia, there is an orchid consortium that is led by the Australian Access Federation. Most Australian universities are a member, so I strongly recommend you go and look at the AAF's website and learn more about that orchid consortium. So, one orchid recommended out of a bunch of pits available just for one kind of thing involved in research of humans. Humans aren't things. So one of the components of research, there are lots of different options of pits, but when it comes down to it, there's this particular one that we recommend. And the same goes for lots of other different kinds of components of research. So, for people we've got orchids, but then we also like to identify or we can and possibly should identify projects, digital objects, physical objects like samples, equipment, there's lots of different components of research. And, okay, so fair data one-on-one, why are we now talking about pits for all of these other different things? So, a pit helps you unambiguously identify a data set, but it's also probably quite useful to unambiguously identify the humans who were involved in creating that data set. Now, I was thinking of putting my own record up, but I discovered I'm the only materialist in the world, so I'm possibly a poor example of that. But by attaching an orchid to researchers, you can unambiguously identify that person, but you can also create a machine readable link between a data set and a human and between humans who have worked on the same data set or the same paper. And that machine readability is really useful because if you ask a machine, a name is really hard to understand. Now, we've also got identifiers for projects, so we can make sure that particular projects are linked with the data and the publications and the humans. And that's actually quite useful for management of research as well. So, the Australian Research Council, which funds most of the research in this country, can have identifiers on projects and know what's coming out of that project that has been made fair and available. And then, data sets, physical objects, equipment, I could go on, but I won't because we only have limited time today. So, what are some of the pits that we do recommend and are available? I've already spoken through that one. Okay, so, for people, we strongly recommend everyone has an orchid. And in fact, that is one of the activities for this module. If you don't already have an orchid, please go and create one. It takes only a couple of minutes and it's completely free of charge. For projects, there is the RAID identifier, the Research Activity Identifier. The ARDC was instrumental in getting this one off the ground and in fact, it still is. So, go to RAID.org.au to learn more about those. For digital objects like data sets, like software, we recommend the good old DOI and the ARDC has a minting service for that. But then, when you're looking at physical objects, like samples, there is the IGSN, which is the unfortunately named International Geosample Number. So, it arose when some geologists wanted to be able to unambiguously identify rock samples or mineral samples or geological samples, should I say. But has since begun to branch out into other domains as well. The IGSN is built on particularly stable infrastructure with good governance, which is why many other disciplines are looking into using it too. And then when we're looking at an emerging area that is identifying equipment uniquely, there hasn't really been a... Nobody's come up with a standard yet. That is to say, there's no single recommended identifier. Some organizations use handles, some organizations use DOIs. But if you're interested in that kind of thing, there is a group, the Identifiers for Instruments in Australia group. And you're welcome to join them and attend their meetings and chat about Identifiers for Instruments. Now, I thought I'd show you a couple of examples of strangely non-data identifiers, although this does link through the data. So, this, for example, is an IGSN on a sample. That's from a project I worked on a few years ago at Curtin University. And you can see the QR code on the sample there. And if you click, you can pull out your phone and you can scan that QR code and touch wood, your phone will bring up that metadata page on the right. So, what's that IGSN there is enabling by way of this QR code is being able to pull out a physical sample that has... And this physical sample has been prepared for use in an instrument, which is why it's embedded in epoxy resin. You can scan that with your phone or whatever QR code read you have and have the metadata record that gives you all of that rich metadata, which Liz will be talking more about on Wednesday, to let you know exactly what that sample is, where it came from, who was involved and also which data sets were collected by the analysis of that sample, which is really quite exciting. Linking everything together makes it much easier to find things. Otherwise, what you'd be faced with is a drawer full of these rounds, as they call them, with some kind of random number scratched on them with a compass or a sharp tip of some kind. And that would then have to be matched up with somebody's spreadsheet somewhere, which could make reusing samples a real nightmare. And when you think about how much it costs to send a human out into the field to collect samples, then bring those samples back and process them and create rounds or mounts or something that can actually be analyzed. This would really help in cutting down duplication of effort. Now, another favorite example of mine, some work undertaken by the University of Western Australia. And that is around uniquely identifying equipment in the Center for Microscopy Characterization and Analysis. So the CMCA is a facility full of all sorts of different instruments largely used in the life sciences. It's lots of imaging instruments or analytical instruments. And what we have here is the BRUCA advanced 3-HD NMR spectrometer. I have to admit, I don't actually entirely know what that is or what that does. But we have two different spectrometers here, one that operates in the 600 MHz frequency and one that operates in the 500 MHz frequency. And it could be quite easy to get those mixed up because there's only one character difference between the two and that could escape a casual inspection. However, UWA has minted handles for all of the instruments in the CMCA and therefore provided those with unique identifiers. And you can see there, so if you were to go to the UWA research repository, you could find those instruments and their metadata records there, as well as information on how to use those instruments. And the handles are there under the contact information in leaks. You'll be able to grab the handle from there. Now moving on because I am now running into question time. What's next? So after this webinar is finished, my colleague Nicola, who you should have already been emailed by last week, will be sending out a link to sign up for the community discussions. Now these community discussions are really quite important, a very important part of the Fair Data 101 course because it lets you connect with your colleagues also doing this course and discuss the material, either the webinars, you can tell them how fine my beard is looking today or maybe actually talk about something important. And there are three different time slots available and I've given those in Australian Western Standard Time, Central Standard and Eastern Standard Time. You'll be given a link where you can sign up for one of those time slots and you'll be attending the same time slot every fortnight. So that's four community discussions at the same time on the same day each fortnight. There are limited numbers available, so if you want to be sure to get the time slot that you want, then make sure you open that email message sooner rather than later. And then also on Wednesday, my colleague Liz will be presenting on part two of Findability, talking more about metadata. And that will be on, so that already on Wednesday. And Liz will also talk a little bit more about the activities that we'll be sharing with you and the quiz. So that's it from me. It is now question time. Okay, Liz. Hi, Mattias. How are we going with the questions? Pretty good. We have got a couple of ones at the moment that I'd like to draw your attention to. Francis asked a question about minting DOIs. Does it matter where the DOI is minted, for example, by your uni or by Zanodo? What choices or options do people have? So, to be honest, when it comes to picking a DOI minting agency, it actually doesn't really matter. It's largely up to your organisation and which service they would prefer to access. So, for example, the ARDC does provide free of charge, a DOI minting service, and that DOI minting service works through an international organisation called DataSite. And by going through DataSite, which has good governance, good infrastructure, we make sure that the DOI persists that way. However, some organisations might have fixed share for institutions, and fixed share has its own method of generating DOIs, which also coincidentally happen to be through DataSite, but they are not using ARDC infrastructure at all, and that is absolutely okay. DOI, without getting too deep into how DOIs work, but all DOIs are registered centrally by some central DOI infrastructure, and they are so, it's probably bad to say too big to fail, but so thoroughly important, they're all going to keep on going. And even if the agency that you use to mint a DOI stops creating DOIs, so who knows, maybe fixed share will decide to go with a different kind of identifier, unlikely, but maybe existing DOIs will continue to work because of that central registry that all DOI minting agencies go through. Great. Thanks, Matthias. There is a question asked by Jenny about something being based in WA, and I believe it might be the IGSN, or it might have been the example you were talking about, but I did see another question, and this may actually be an answer from Rebecca, that whatever it is you were talking about does indeed work, and it is based in WA. So there's my surmising. I think that would, if it's yes, it works. I think that would be the IGSN QR code, which I said I wasn't sure would work, but it does, which is good, which means that the infrastructure and the governance is working. Awesome. Here's another question from Jean. Is there any guidance on whether to mint DOIs at the collection level, at the item level, or both? Does it depend on the data set and how it's likely to be used? Well, maybe you can use community discussion to have this kind of philosophical discussion, but even within the ARDC, we still discuss about the idea of a data set versus a data collection. What does that mean? How do you treat them? I mean, you could almost consider it to be kind of like an encyclopedia. So you have an encyclopedia, a multi-volume encyclopedia. Remember those. Do you catalogue the encyclopedia as one thing? Or do you catalogue the individual volume? Or do you catalogue an individual entry within the encyclopedia? How do you treat that? And it really comes down to what you think is the most manageable and what makes the most sense. Is it likely that other people trying to reuse your data or cite your data with site one particular file, for example, or are they more likely to cite the entire collection as a whole? Based on my personal experience, so data projects I've worked on in the past, we have created metadata records and minted DOIs for an entire collection, but then also for each data set within that collection, we've created yet another metadata record and yet another DOI. And we've semantically linked them so that they're all related, so that you visit that collection record and you see all the data sets within the collection, and you look at the data set and you see all the other data sets that are within the same collection. Nice one. I guess you could also add that as data, as DOIs are often versioned by their minting apparatus or infrastructure, that that might weigh into your decision about at what level you put the identifier at. I've got another question, Matias. I think this is about the IGSN people asking if they, if you know if they use PPMS software to manage their research instruments. Oh, that's possibly more about the instrument handles from CMCA. I'll have to take that question on notice. Unfortunately, I do not know, but I can find out and we should have a record of these, or rather we will have a record of all of these questions, so we will be able to follow up without any further action by the person who asked it. Oh, okay. There's another question here from Sir Senka now. Are exceptions to findable record, are exceptions to findable records managed through individual institutions? Exceptions to findable records. Well, here we go into another, I mean, before we even talk about managing exceptions to findability, I would ask why would you not want something to be findable? Now, we're not suggesting that everything that is findable is also immediately open and available and downloaded. And so, first you have that decision, but then it really is up to the institutions or whoever is undertaking the research to decide whether their things, their objects are made findable. However, they might run into policies set by funders, by publishers, or even by their own organisations. So, sorry, I haven't quite answered the question there, but I think this could make an excellent discussion topic in the Slack workspace. Matias, I'm just noticing the time and I believe we have reached the limit of our webinar today. So, I am going to bow out and hand back over to you and just put a little message out to say thanks everyone who has managed to join the Slack today while Matias has been speaking. It's really great to see some of some introductions there, so keep them coming and over to you, Matias. Great, thanks for that, Liz. So, possibly haven't been able to answer every question. I'm sorry, I wasn't able to make it to yours, or if I didn't know the answer to your question. So, as I said, we've got the Slack. Feel free to post your question in the general channel, otherwise we have a record of all the questions and hopefully we'll be able to address those sooner rather than later. I will sign off now. I'm getting a little peckish even though it's not quite lunchtime for me, and I look forward to seeing you all on Slack at the webinar on Wednesday with Liz and or at a community discussion next week. Thank you very much.