 Welcome everybody. Welcome to the first webinar in the series of four webinars about the fair data principles. My name is Keith Russell. I work for the Australian National Data Service and I am your host for today. And with me I also have Nick Theeberger and he will be speaking later. Just to give you a bit of background, usual background about ANS and what's going on. The Australian National Data Service, we work with research organisations around Australia to establish trusted partnerships, provide reliable services and to enhance the capability around the research sector, around research data. We work together with two other increased funded projects. That's RDS, Research Data Services and NECTA. And together we create an aligned set of joint investments to deliver transformation in the research sector. There you are. So this webinar is the first in a series of four. And what we want to do in these series of four webinars is give a bit more background on the fair data principles. And we've broken the map into the four principles, findable, accessible, interoperable and reusable. So for today the focus is going to be on general introduction to the fair data principles and especially to look at the first principle which is findable. So the speakers for today will, that's me, I will give an general introduction to the fair data principles and a little bit about findable. And then I'm very grateful that Nick Theeberger has made some, made time available to Nick. He's the director of Paradisec to talk about how Paradisec has made their data findable. And I think it's a great example that will sort of show what it means like in practice because the fair data principles are quite high level in general. And I think the Nick can talk, speak much more clearly and give a much clearer example what it actually looks like in practice and what findable can be, how you can actually adopt findable and use it in practice. So to start off, what are the fair data principles? So the fair data principles were drafted by Force 11. Now Force 11 is a community, an international community of scholars and librarians, archivists, publishers and research funders. That sort of came together organically, started in 2011, hence the Force 11 and has been around ever since that date. And what this group, this community has looked at is to sort of facilitate change towards improved knowledge creation and sharing. And as they were working on this in 2015, they came together and they said, well it would be good to have some principles around research data and the sharing of that research data and how you can best do that. So late in 2015 they drafted these four fair data principles and in early 2016 they wrote an article in which was published in Nature about it and from that moment onwards the ball started rolling and these principles started to receive attention and international recognition. Sort of this is actually quite useful. I think a number of things to keep in mind if you look at the fair data principles and probably the reason why they are attracting so much attention is there are a number of things there. I think one of the things there to note is that they don't just look at making research data human readable but they also look at making research data machine readable and I think that offers a lot of opportunities into the future by thinking towards the future situation in which research data is machine readable, can be harvested by machines, can be pulled together, can be used for big data approaches, can be used for novel approaches in exploring data and knowledge creation and finding patterns and different knowledge out of that data I think is an interesting step into the future. Another thing that's quite valuable I think about the fair data principles is that they are technology agnostic. If you read them you'll find there's no one recommendation to go with one specific technology. It's formulated in a way that different types of technology can be used to solve the challenges. Another thing they've done quite well is to create a set of principles which are discipline independent. So the principles can be adopted across different disciplines in different ways and meeting the needs of the specific discipline. Also if you look at the principles it talks not only about the metadata and it not talks not only about the data but the two combined and where working on the metadata can enhance the visibility of the data for example or the reusability of it. So as you probably have noticed by now fair is an acronym and it stands for Findable, Accessible, Interoperable and Reusable. The are reusable is the one that sometimes results in a bit of confusion. People think that it's actually reproducible but it's actually reusable. It's a broader concept. So just to keep that in mind we'll talk about each of those principles in more detail in the coming weeks. Before we move into the first one, Findable I have a few general pointers which are probably worth keeping in mind as we look at the fair data principles. So one of the questions I get sometimes is do you want all data to be fair? And I don't think that is the case. I don't think that is necessary and I don't think it fits in with research practice. If you look at researchers and the process in which they create research data there are various steps in that process and in some cases huge volumes of data are being created out of experiments or coming off instruments etc. These huge volumes of data can't be kept or stored in that original form. They often need to be manipulated, analysed, processed etc. So these huge volumes of working data are probably not suitable to be made Findable, Accessible, Interoperable and Reusable. It's rather that data as it moves through those steps and the final resultant analysed data is probably more suitable for that purpose. Researchers sometimes use scratch data to explore different experiments, explore different settings, see how things work. Not all of that data is worth keeping or worth using right till the end. Now there are also cases in which there are research with commercial interests maybe commercially funded even. In that case there can be arguments why especially those commercial parties are not interested in having any of that research visible or public to the outside world in that they want to keep it quite to themselves that this research has taken place. This also happens in case of national security and defence research. So in those cases it probably does not make a lot of sense to make any of that research data Findable, Accessible, Interoperable, Reusable. One question we sometimes get is, well, what about data that contains data about human subjects? Whether there's privacy ethics considerations around the data, should that data not also be kept hidden or private? Now there is a distinction here between open data and fair data. So in the case of open data you're talking about making everything open. In the case of fair, actually talking about making it accessible through the appropriate roots and that doesn't have to be open. So in the case of human data that refers to human subjects, identifiable data, there might well be a very good argument why that data cannot be made openly available but it can be made accessible through appropriate roots. In that case it would still be fair because it would still be accessible, however it would just not be open. We'll talk more about that next week when we get to the accessible point. Well, what the fair data principles are not about and this is something that only sometimes it crops up. In copyright law there's talk about fair use and fair dealing. That's not capitalized, that's in lower case. That's something completely different and that's not related to the fair data principles. One of the other things I ran into recently was it turns out that a number of market research companies actually have developed their own fair data mark which talks about how these market research companies treat the data that they collect as they're doing their market research. That is also lower case and that is completely not related to the fair data principles and capitals. One other thing that's worth keeping in mind is that fair is not an actual standard. So some people expect to say, well, I want to make my data fair and I want to make sure it fits all the boxes exactly. You'll notice as we start talking about the fair principles and digging into them in more detail, it's actually not that black and white. It is a set of principles. It's a set of ideas about how you can approach it and how you will actually approach it in practice will probably depend on the discipline. So there's not one standard there that will work across all disciplines. Another thing to keep in mind about the fair data principles is that if you want to achieve, if you want to make more data more fair it's not just about the research data itself but it will actually require some work around it. So it will require a layer of underlying infrastructure and that can be human infrastructure or technical infrastructure which is in place. So that a researcher does not have to do it all on their own but there'll actually be things in place that will make it easier for the researcher to achieve making their data fair. So things that you can think about there are policies around making the data fair, procedures and guidelines that might be in place. It would be great if there are tools or platforms or software in place that actually make it easier for the researcher to make their data fair at the end of that workflow. And finally, it's going to be important to have the skills and the skill set available to the researchers, the data managers, librarians, e-research analysts, e-research staff, all the different staff members that are involved in that process to make it easier to make the data fair down the track. So I think one of the questions I get is why is it now specifically these fair data principles are coming up and why are these being adopted so widely? Or I think for one thing they've got an attractive acronym. I think the other thing is that it covers quite nicely work that is already being done. If you look at them in more detail you'll find that some of the things that are covered there are actually things that organizations around the country have been caring about for a while and been caring about more and more. So some of it is probably not, it's less about a completely novel approach but rather bring it together under a nice acronym in a well-packaged form. I think other reasons why it's proven to be useful, first of all, it's receiving a lot of international recognition, it's not just a national initiative. If you look at the principles there is actually quite a lot of detail hidden below them and quite useful detail. The fact that it is disciplined and independent makes it easy. It is not as hard a sell as making all data open. The only challenge here, and this comes back to that point about the fair data principles not being a standard is that it is hard to measure. It's hard to hold up to a list and say this data is fair and this data is not fair at all. There is a more of a scale from being less fair to more fair. So if you're looking at where fair has been picked up and in various ways there's plenty of examples out there. I've just picked off a few here, some of them international, some of them national, some of disciplinary. So for example, in the European Union, the high-level expert group working on the European Open Science Cloud sort of picked up the fair data principles and embedded that in their work and their thinking around what the European Open Science Cloud should look like. If you look at the Horizon 2020 funding program by the European Commission that's also drafted guidelines for data management and in those guidelines they also use the fair data principles. If you look in the US, NIH has just set up a data commons pilot in which they want to be exploring what a cloud would look like for sharing research data and there they're also looking at the fair data principles. In the Netherlands, initiatives being set up called GoFair which is now reaching out to get more international momentum and more international partners. That's also a very interesting development in that they've looked at the fair principles and also how you need different elements to support that including cultural change, training and building an infrastructure to make sure that data can be made fair easily. In the UK there's currently a project going on around fairing practice and taking the fair principles and exploring what that means in different disciplines. The American Geophysical Union has just come up with a project I think it was only yesterday the press release went out that what they were looking at is what it will mean to make data open and fair in Earth and Space Sciences, exploring that further. And closer to home here in Australia, one thing you might have already heard come by is the Fair Access to Research Outputs Policy Statement which was drafted and is now available for support by institutions around the country and the focus there is very much around research outputs in the ARC and HMRC definition as in publications and conference proceedings, all sorts of publications, materials and how those materials can also be made fair. So that was a long-winded introduction more in general about the fair data principles. The one principle I wanted to talk about today is the first of those and that's findable and if you look at the actual principles and the way it's described findable is broken down into four elements. So for the research data to be findable in the principles they say the data and the metadata should be assigned a globally unique and eternally persistent identifier. Well, in practice that just means it needs either a DOI or a handle or a pearl. Some identifier which is globally unique and eternally persistent and there's an organization that sits behind it that cares about making sure that that identifier will resolve to that data set even when that data set would move. This is where that example of being technology independent comes up. They don't recommend one over the other. Any of those solutions works as long as that identifier is in place and it gradually resolves. Second heading there is that they say that data should be described with rich metadata. That's great, however they don't specify what rich metadata means so this is one of those places where it's not black and white is your data fair or not. What we'd say is make sure that there's enough metadata assigned alongside the data so it can be found that it sort of answers the right questions from somebody that's searching for your data. The third heading talks about the metadata and the data being registered and indexed in a searchable resource. So there's different ways and several ways to tackle this and a number of ways to think about that is while having a search interface locally, a database locally some way of making sure that your data collection can be found through a search interface. But what we'd also definitely recommend is making sure that the data collections are, the descriptions of the data collections are passed on to aggregators, national aggregators, for example research data Australia, but there are also other aggregators out there, more disciplinary aggregators like TURN. They might go out into an international disciplinary aggregator like OLAC, the Open Language Archives Community. And data can also be published in international disciplinary repositories like Pangea for Earth and Environmental Sciences or in the case of astronomy, for example the International Virtual Observatory Alliance and the systems they have in place. So there's various possible routes to publish your data but make sure that it goes into a place where it can be searched, can be found, and also will be indexed by search engines like Google Scholar. Finally last point and this really comes back to the first one is if you're going to have a globally unique and eternally persistent identifier for the data collection like a DOI or a handle or a pearl, make sure that that's actually captured in the metadata. Okay, so that was the first, that was a quick overview of Findable and the way that they have described Findable. Now I thought it would probably be of interest to show what that means in practice and what it means like in practice to actually make data findable. And I'm really grateful that Nix made some time available to give us a short presentation about Paradisic and the work that's been done in Paradisic to make their data findable. I think it's a great example of how they've, in the course of several years and building up experience, a slow but surely made the audio recordings more and more findable using all these different elements that I just described. So I'd like to hand over now to Nix and one thing I also want to add before I hand over to Nix is Paradisic has also done great work on making their data accessible, interoperable and reusable. However, today I've asked Nix to focus on the findable side of things. So please keep that in mind in his presentation and there's also a lot of other work that Paradisic's done in those other aspects too. So I would like to hand over to Nix and then Nix can talk about finding in Paradisic. Great. Thanks very much, Keith. And I'd like to thank Anne's for the support of Paradisic over the years. So Paradisic has been running for some time and as you can see it's become a significant collection. We have 31, most likely more than that because it's increasing almost every day, but around 31 terabytes of material representing over 1,100 languages and 162,000 files and 7,500 hours of audio. So it's a significant collection and there's a huge management task involved in that and one of those tasks also is making sure that this material is findable by the people. We want to find it. We have a catalogue that we've been working on for a number of years. We've built our own. Unfortunately, we didn't find one on the shelf that we could use but the catalogue allows you to look at material with geographic point of entry into a faceted search. We have OAI, so Open Archives Initiative and Dublin Core based metadata. We try to be as lightweight as possible with the metadata because our experience, we're all researchers. I'm a linguist. My colleague Linda Barwick is a musicologist and our experience was that people just won't enter metadata if it's too complicated. So we've tried to make it as simple as possible and to make the catalogue do as much of the work for you as possible. So using control vocabularies, doing predictive data entry and having, you know, minimal number of fields. As you'll see here, we have, here's a screenshot of the catalogue. We have the possibility to make the metadata private. So as Keith was just saying, fair doesn't mean that everything has to be made publicly accessible. If you're constructing a collection, you can keep all the metadata private and then publish it when you're ready. You can also assign various kinds of access conditions including open subject and normal conditions or closed subject to whatever conditions you want to specify. Because our project is really focused on language materials from small languages, that is all of the 7,000 other languages that are out there in the world, we include language identifiers for subject language and content language of items in the collection. And this is the linchpin that lets us then feed to a number of different harvesting services that I'll show you in a minute. Our online catalogue lets you specify geographic coordinates, which then also allows you to search using that geographic information. Because of the work we're doing, we have lots of connections into the region, in particular the Pacific, and we are actively seeking collections in the Pacific, collections of analog tapes that need to be digitized and you can see the various agencies there that we've collaborated with and continue to collaborate with digitizing hundreds of tapes and then putting them into the collection and making them accessible. So when we talk about findability, we can talk about the granularity of finding. We can find collections and we can find items and we should be able to drill down into the collection to find things that we're particularly interested in. So we can characterize findability on a scale, if you like, from zero to 10. So if we talk about research materials, primary research materials that people have in their offices or in their homes, typically the findability of those things is about zero. It may be one, if your colleagues know that you've done this work and you have these tapes sitting in your office, but a speaker at the language trying to locate recordings that you made with their grandparents, they're not going to be able to find that material. So from our point of view in Paradesic, we infer that these records must exist because we know that the research has been done. So we can go looking for it. And then what we can do with that is we could add records to our catalogue pointing to analog materials and we do this in some instances. We also point at websites that we know exist. So there are some fine websites that have language materials on them, but the websites might be transient and what we then do is point at the Wayback Machine or the Internet Archive entry for that. So here's an example of a text that was produced in the Solomon Islands put online by the Project Canterbury, which is an Anglican archive, online archive, but it's a website. There's no guarantee of longevity. And so by us putting it into our catalogue, it then makes it available and findable via the search engines that we'll see in a moment. So there we increase the findability of that to perhaps 3 out of 10 and using the language identifier. So there you can see the three-letter ISO 6393 code for languages. In this case it's LKN. What we've also done is provided images of manuscripts. So this is a collection of papers produced by Arthur Capel during his life. He was a professor of linguistics at Sydney University. When he died, he left a huge number of papers, which we then digitized. We just set up a camera and took images of all of these papers. And as you can see in the bottom right there, there are a lot of handwritten original manuscripts, which were really valuable from a research perspective. But sitting in a box in his executor's house, they are completely unfindable. So putting entries into the catalogue and we put this through the Heritage Data Management System to put a HTML framework around it. And you can then find these items and resolve to the level of the image. Now you can't get to the transcript of the image because at the moment all we have are images there. But one of the next things that we do to increase findability is to include transcripts together with recordings. So here's an image from our catalogue and what we have is time-aligned transcripts of recordings. These allow us to play the recording. And you can imagine, because I won't show it to you, that as the recording plays, it scrolls through that transcript. So this is increasing findability. Significantly, you can resolve down to the level of words and find them in the context of the recording. One of the other things that we do is we embed some metadata into the header of the WAV files in our collection. We create a broadcast WAV format file, which is the European standard for archival formats of audio files. And you can see a little snippet of XML there which is extracted from our catalogue and inserted into the WAV file before it's all sealed up and put into our collection. We use persistence identifiers of various kinds. Because the collection started, as I say, 15 years ago, we have an internal persistent identification system, which is a collection followed by an item number. More recently, in the last couple of years, we've put DOIs through the whole collection. So we have DOIs from the level of each file up through items and up to the collection level. You can see also that we have Zatero and Mendeley integrations. So that also makes things findable in that people will cite these items using this form and they can click and insert them into their Zatero or Mendeley databases. We have an API. We have two feeds that we produce so people can link into our collections. RIFSIS is at the collection level, and that's what's harvested by Research Data Australia and other services. Trove also harvests that material. And the OAIPMH feed is primarily targeted at the Open Language Archives community. So linguists have been very good at setting up services based around these language identifiers. And the OLAC page allows you then to look at all the material that's produced by any one of their 60 member archives for any given language. So it's a fantastic resource of finding information about the world's languages. And if we update an item in our catalogue, then the nightly harvest from OLAC will update that OLAC harvest the next day. So as you can see Research Data Australia takes feed and produces it in interesting ways. So the benefit for us is not only that our material is more findable, but some of these services present the information in our catalogue in ways that we don't. So you can do faceted searches in some of these services. And it also links into all kinds of other services and data providers that allow you then to do interesting new searches. There's the Open Language Archive community page. They have a faceted search on the right and a whole lot of services that they provide, advertised on the left there. If you're interested in languages at all, really the one-stop shop for finding information. What's in any archive in the world in their harvested system? This is the Virtual Language Observatory, which is a European service funded by Claren in Europe. They also take our feed and you can see that you can search our collection through that service as well. And WorldCat, the international catalogue of all libraries, also takes our feed. So that's sort of on the big picture side of it, and international search engines. On the other side are the people that we want to find this material out in the Pacific. And we've been working very hard to get material available in forms that can be accessed by people in the Pacific. On the top right, there's a really interesting little project that was run in Medang where they took recordings and played them at a local market and asked people in the market to comment on the recordings, perhaps enrich the metadata in that way. They then sent that to us in a spreadsheet which we were able to report into our catalogue. At the bottom you can see a speaker of one of the languages who happened into my office in Melbourne and went through the collection and found his grandfather speaking and he was quite amazed by that. So there's an example of how unfindable, I suppose the material can be, that he had to come into my office to find it and that's one of our big problems is how to make the material in our catalogue accessible to people who aren't perhaps always looking around on the web because they just don't expect to find material in their language. On the left there's a man who's working in our office in Sydney. This was an ANS-funded project to enrich the PNG metadata in our collections and he's going through listening to material and adding metadata where he can. So one of the other ways that we're promoting the collection is by building a virtual reality project. So what you're looking at there is a map of Vanuatu and each of those shards of light coming up represents a language where there's a little symbol there. You can listen to snippet of the language which comes out of the Paradisic collection and you can see some information about how much we know about the language, what if there's a grammar, if there's a lexicon and so on and how many speakers there are of that particular language. Now this is generating a lot of publicity as you can see on the right. There's an article from the Papua New Guinea Post Courier and on the bottom right there's an article written about this in pursuit at Melbourne University and so on. Getting this publicity is important exactly so that people will then go to look in the catalogue and to find information or think about collections that they have that need to be digitised. So this is... It's an investment of time and effort to build the virtual reality but it's captured a lot of public attention. And it's also a research output in that it is driven by well-formed data in the Paradisic collection. We've automatically snipped 20 seconds out of audio files and used the naming convention and the metadata that's in the catalogue to then feed this virtual reality display. So ultimately we do want to get this material out to the Pacific and what's amazing really is that now most people in the Pacific have mobile phones that are accessing the internet. On the right you can see a poster for the internet on your phone in Port Vila in Badoatu. And on the left you can see a church but above the church there's a mobile phone tower which is now the way that people are accessing all this kind of information. So we want to make our material findable for people in these remote locations even in the highlands of Papua New Guinea or in the most remote parts of the Pacific. So the catalogue is findable to them through various means including of course Google but we also need to make the data accessible, interoperable and reusable for them but I'm not going to talk about that now. So Paradiesx created a standard metadata set that means that as the data comes in it's described with a light touch as I say. We apply as much metadata to items as possible but for some of the legacy material there's just very little metadata and we have to infer what we can. We also rely on people putting that metadata in online if they can or sending information to us. We are always open to enriching the metadata in the collection. The main point of the metadata is that you are able to then locate the primary records and have them play to you or see them or download them if you have the privileges. So all of that makes it more accessible and findable and by punching that material, the metadata through APIs for our discipline specific and more general search tools that makes it more findable as well. We do many things to try and publicise the existence of the collection including what may seem gimmicky virtual reality or augmented reality but all of this goes to increasing public knowledge of the collection so that it will increase the findability but also increase our location of analog data that needs to be digitised. Part of all of this also requires data management training so that people know about how to build their own collections. So we do a lot of training of researchers here in Australia but also in the Pacific and we also have a lot of engagement with community agencies in the Pacific and try to get funding to run digitisation programs with those agencies. So that's our story about findability. Thanks. Thank you Nick. Thank you Nick. Thank you for the presentation. I think it's a really interesting example of how you've taken up the findable and adopted that in a way which is relevant to the language community and relevant to audio recordings. Okay so now you can see the resources on findable hopefully. There's links here to DOI DOI and Handle Minting Service that Ann's provides. Link off to a metadata standards directory if you're looking at different metadata standards what would be a good one to use that will be relevant to your discipline. There's a link off to the National Australian Research Data Discovery Service Research Data Australia and if you're looking at more international perhaps disciplinary repositories where you'd want to make your research data findable have a look at really three data international initiative which lists all sorts of different repositories and parody sex in there too. Finally if you want to have a bit of a crack at research data things we had 23 research data things last year and there are a number of exercises there in which you can learn more about what does it mean to make your research data more findable etc. Well we've picked out three of those which are especially relevant to the findable space so if you're interested have a look at those three things and see if there are things there that you might like to look at. So finally I would like to thank you all for your attention and I'd like to thank the National Collaborative Research Infrastructure Strategy Program which funds ANS and makes this all possible.