 So welcome to this webinar on FAIR data. It's the first in a series of three and it's aimed at increasing your awareness of FAIR and providing approaches with which you might be able to make your own data and software approaches fairer. So the first webinar will focus on the F and A aspects of the FAIR acronym whilst the second will provide information on the I and R elements and the third webinar will be a much more hands-on approach where we will uncover how the Biodeval RDC projects can be made for FAIR. And so a quick introduction. I'm Frankie Stevens and I'm going to be your presenter today. Also in the webinar, as mentioned, my colleagues Andy who kind of kicked us off. Andy's your local project contact from the ANSTEE. We've also got Martin Schweitzer on the call and Martin has been working with us on this material and will be presenting the next webinar. And we've also got Keith Russell from the ANSTEE who's actually the partnerships program manager and who's joining us as an expert moderator on FAIR. So we actually anticipate that this webinar was probably going to run a little under an hour, which you'll appreciate no doubt. So without further ado, I'll get going. We'll start with a little video that actually sets the scene. Could she hear the sound on that? No, we can't. No. Okay. So I think I need or Andy, when we start the Zoom meeting, there's an option to select share with sound. Did you by any chance click that one? I don't recall that option. Okay. There are subtitles so we can maybe just go with that. Yes, probably less painful than the voices anyway. Okay. So the problems encountered by the panda in the previous video would actually be far fewer if research was more open. Open research not only benefits individual researchers, but society as a whole. It opens up research to the general public, promotes collaboration, increases transparency and reducibility, increases the visibility of researchers work, fosters good scientific practice and allows existing data to be reanalyzed and repurposed. And as you can see from this slide, there are a number of benefits to open data and open research as a whole. And these range from the research of focused increase in citations and credibility through to addressing the challenges currently encountered in research reproducibility and enabling better linkages between academia and industry. At its most basic, it avoids unnecessary duplication effort and allows the data to be repurposed. And data reuse is an efficient way of encouraging new discoveries. But some data could be freely available, but completely illegible, so to speak. And this is one of the problems that fair data seeks to address. So what is fair? They're principles that were designed by a diverse set of stakeholders representing academia, industry, funding agencies and scholarly publishers. And these stakeholders all came together to design and jointly endorse a concise and measurable set of principles to address how best to enable data reuse. And the principles aim to bring about a change in modern research communications through the effective use of information technology. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. The fair principles put specific emphasis on enhancing the ability of machines to automatically find and use data in addition to supporting and reuse by individuals. The principles have actually received international recognition since they were drafted by that group and broadcast by force 11 in 2015. And they were published in a nature article in 2016. So what are they? They are guiding principles for findable, accessible, interoperable and reusable research data. So fair is obviously an acronym. And there are a few things that are of interest regarding the principles. They look at making research data reusable not only by humans but also by machines. And this has the advantage of making data harvestable for big data approaches. So pulling in large amounts of data and enabling patent recognition and new innovative processes for knowledge discovery across large and varied volumes of data. And the principles are also intentionally technology agnostic. And the principles have been written with any research discipline in mind and have been formulated in a way that they can be applied across various disciplines. And the principles address both the data and the associated metadata to enable optimal reuse. And they're not just about the data. Making something fair requires underlying infrastructure, sometimes policies, procedures, guidelines surrounding not just the data but tools, platforms and software and use. So making something fair necessitates a little understanding. And hopefully you're going to get that today. So some would argue that not all data is suitable to be made open or even fair in the end. As researchers are running experiments in huge volumes of data coming off instruments, the raw data will not always be kept. And for working or scratch data, it doesn't always make sense to keep this and make it fair. Another such example is where the data has been produced in a project which was jointly funded with business partners or if there are commercial interests in the data. There might not be an interest to have any part of the data or the research discoverable, unfortunately. And the same may apply for national security or defense research. And there may be cases in which the data can't be made open for very valid privacy reasons. For example, when it contains information that can identify individuals and contains other sensitive information such as pinpointing the location of threatened species that smugglers may want to get hold of. And it's important to note that just because data isn't open, it doesn't mean it can't be fair, though. So data can still be fair but be kept under mediated access controls so it isn't open. Then some of these examples, it can make sense to make the metadata about the data collection open and describe how a user can get access to the data and that would still count as fair. So what else is fair not? It's not related to the legal terms that are in use in copyright law for fair use and fair dealing. Neither is it the fair data mark and it's not a standard. It's an acronym. So we're now going to look at each individual principle in more detail starting with Findable. But Findable can be summarized by data and metadata being easy to find by both humans and computers. With these principles, where you see brackets around the word meta, that means that the principle is applicable to both the data and the metadata. So starting with F1. So metadata and data are assigned a globally unique and eternally persistent identifier. Identifiers in this case means a link on the internet. For example, a URL that results to a web page that defines the concept such as the information on a particular human protein. And we'll come back to that example link later in the webinar. F1 stipulates two conditions for your identifier. The first is that it must be globally unique, which means that someone else couldn't reuse or reassign the name identifier, the same identifier, without in so doing referring to your data. You can obtain globally unique identifiers from a registry service that uses algorithms guaranteeing newly minted identifiers are unique. And the second is that it must be persistent. It takes time and money to keep links active on the web and over time links tend to get broken. Registry services guarantee to some degree the resolvability of that link into the future. So F2 metadata, that's data about data and we'll come to that in a bit. Metadata should be generous and extensive. So you should include descriptive information about the context, quality and condition or characteristics of the data. And rich metadata allow the computer to automatically accomplish routine and tedious sorting and prioritising tasks that currently demand a lot of attention from researchers. F3 identifies and rich metadata descriptions alone will not ensure findability on the internet. Perfectly good data resources may go unused simply because nobody knows that they even exist. If the availability of a digital resource such as a data set service or repository isn't known then no one and no machine can discover it. And there are many ways in which digital resources can be made discoverable, including indexing. F4 metadata and the data set they describe are actually usually separate files. The association between a metadata file and the data set should be made explicit by mentioning a data set's globally unique and persistent identifier in the metadata. Many repositories will generate globally unique and persistent identifiers for deposited data sets that can be used for this purpose. And here are some links to where you might find more resources unfindable. And we'll be sharing a PDF of these slides with you so that you can follow up on these in your own time if you need to get more information from these resources. So moving on to accessible. To be accessible A1 data and metadata are retrievable by their identifier using a standardized communications protocol. So both the data and the metadata should be accessible. Using the identifier for example a DOI handle persistent URL and we'll come to all of these later. You should be able to get to the data and not only as a human but also as a machine and examples of such protocols HTTP or FTP. A1.1 to maximize data reuse the protocol should be free. So that's zero cost and open as in open sourced and this globally implementable to facilitate data retrievable. It should not be bespoke home built and badly documented and it should not require some specialized expensive software. A1.2 is actually an often misunderstood part of fair data. Accessible does not necessarily mean open but rather it gives the exact conditions under which the data are accessible. So even heavily protected and private data can actually be fair. And if fair is well implemented a human can see that the data isn't open but clearly can see what steps they need to take to get access to the data and this could be as simple as being presented with the name email address phone number of the custodian of the data. It could also include for example a clear description of the ethics approval process they need to go through to get access to the data. And as the fair principles also look at making data reusable by machines. If a machine is looking for the data the machine should be able to recognize that the data isn't open. It can then let the human know what steps they need to take to get access to the data. And if the user be a human or machine has been granted access to the data then the data should be accessible through an authentication and authorization procedure. A2. There are cases in which data have to be destroyed and this is of course not ideal but could actually occur for example if consent for use was only for a limited period of time or there's been a legal takedown notice that the data provider has had to comply with. And if the data is no longer available then the metadata at least must be kept and made available. And this will allow anybody or any machine looking for the data to actually find out that the data is no longer available. And so here are some links to where you might find more resources on accessible. And you'll notice thing 19 there near the bottom mentions APIs and you may be wondering where APIs fit into the picture and you'd be right to question this. We've actually opted to further discuss APIs next week when we talk about interoperability but there are obviously links to APIs with accessibility too. In fact there's actually a bit of crossover on many of the approaches taken for many of the bare elements. So that's why next we're going to talk to you about kind of what's needed to make something fair and for well in this particular instance findable and accessible. So we talked about identifiers. First of all what is the problem that persistent identifiers are trying to address? So everybody will be familiar with this. You click on a web link that takes you either to a page not found error page like this one or to content that's actually unrelated to the link you clicked. And both usually happen because a web resource has been moved to another location and you've got the old link. And from a research perspective this means that a scholarly resource which may have been cited can't be found verified and potentially cited again and so on. So this is the problem that persistent identifiers are here to address and many data repositories will automatically generate globally unique and persistent identifiers to deposited data sets and identify as essential to human machine interoperation and play a vital role in enabling data sharing. And finally identifiers will also help others to properly cite the work when they're reusing the data. So a persistent identifier is simply a long lasting reference to a digital resource. So even if the resource moves location on the web the persistent identifier is there to make sure that the link always resolves. So if a PID is used as a citation link in research literature it will always resolve to information about the resource either a descriptive metadata page the resource itself or information about the removal of the resource from the web. So it's important to note though that PIDs do not guarantee a link will never be broken but they create a technical and a social framework which which actually helps to to guarantee it. PIDs play a key role in the discoverability accessibility and reproducibility of research. So how do they do this? Well PIDs play a role in linking scholarly resources such as publications and data as well as tracking the impact of these resources. They provide social and technical infrastructure to identify a research output over time and they enable machine readability. They enable research objects to be labeled uniquely and disambiguate one object from another. And they facilitate the linking of research projects related people and things so that a person may discover a publication, its related data set, related software, related methods and so on. So let's look at the handle system and as an example of a PID. Most PIDs for research work by separating the identity of a scholarly object from its location on the web and handles are run by the Corporation for National Research Initiatives or CNRI in the US and CNRI is a not-for-profit organization formed in 1986 to undertake, foster and promote research in public interest. The handle system is very robust and is widely used internationally among repositories and importantly it also provides the underlying infrastructure for digital object identifiers which will come too shortly. So what are handle characteristics? There's a central handle registry where handle identifies are recorded. The model is one where you assign one handle per resource. There's a distributed computer system including handle proxy servers and there's minimal cost unless it's usually borne by the handle issuer such as an institution running a handle proxy server and they're unique, global, scalable and reliable. But remember PIDs are both technical and social infrastructure so if the URL of a resource changes then the owner must update the URL in the handle system. So the handle system is mainly made up of a suffix that identifies the local name of the resource, a prefix that identifies the naming authority and the resolver service such as hdl.handle.net. So let's look at another example of persistent identifiers and that's digital object identifiers or DOIs. As mentioned there are an implementation of the handle system and you can see that they actually have a very similar structure and these originated from the scholarly publishing industry and DOIs are routinely assigned by publishers to identify journal articles and other published works and there's a great deal of technical and social infrastructure invested in DOIs and according to some research they're by far the most widely used persistent identifier for research objects which includes research data. So DOIs are applicable to a variety of digital objects in research so that can be publications, data, software methods, theses and so on and they're governed by the International DOI Foundation which is another not-for-profit organization and DOIs are used by an issued even by a DOI registration agency or their agents and ANS is actually one of these. There can be a cost associated with DOIs but some agencies such as ANS offer a free service and with ANS you can actually get a DOI through the ANS service either through manual or machine to machine minting and as for handles DOIs are unique, global, scalable and reliable. They also actually come with a metadata package and here's an example of a DOI metadata schema in this case the data site metadata schema and this is a set of mandatory metadata that must be registered with the data site metadata store when minting a DOI for a data set and the domain agnostic properties were actually chosen for their ability to aid in accurate and consistent information of data for citation and retrieval purposes. You can see that there are certain properties here that are mandatory recommended and then optional but why do we need all of these metadata attributes and what is metadata? So the most widespread definition of metadata is that metadata is information about data or data about data. There is another way to look at it that's different to that rather dull description. Jason Scott actually sees metadata as a love note. It might be to yourself but in fact it's a love note to the person after you or the machine after you where you save someone that amount of time to find something by telling them what this thing is. So describing physical and digital objects is what metadata is about. It helps the classification access and storage of all types of digital assets. It's with metadata that the encoding of knowledge within any data element is possible and metadata comes in many shapes and flavours carrying additional information about where a resource was produced by whom, when the last time it was accessed, what it's about and so on. Some would say that there are three main types of metadata. A descriptive, structural and administrative metadata. So descriptive metadata adds information about who created a resource and most importantly what the resource is about, what it includes. Structural metadata includes additional information about the way data elements are organised, their relationships and the structures that they're in and an administrative metadata provides information about the origin of resources, their type and the access rights and metadata elements grouped into sets designed for a specific purpose. For example for a specific domain they're called metadata schemas and we showed you a minute ago what the data site metadata schema was and metadata schemas that are developed and maintained by standard organisations such as ISO or organisations that have taken on the responsibility for example the doubling core metadata initiative actually called metadata standards. But the problem with standards is that there are so many of them. Across the research disciplines there are thousands of standards and several thousand databases where resources are kept and as consumers of these standards and databases it's often very difficult to know which resources are the most relevant for a specific domain. Fair sharing, it's an educational resource and information resource on standards databases and data policies and the fair sharing team works with and for the community to map the landscape of community developed standards. So I would recommend having a look at fair sharing to discover metadata standards applicable to your domain but bear in mind that fair sharing isn't the only resource for this in much the same way that there are thousands of standards there's also more than one place that collates these resources and we've provided a link there on the slide to a nice digital curation centre resource that you can investigate also. So being able to describe and identify the data is one thing but this is less useful if nobody can discover it so we've already spoken about why it's useful to reuse research data. It enables secondary data analysis on the data and it enables reproducibility among lots of other good reasons. Those are just a couple of the examples of why people go looking for data but here are just a few examples of where people go looking. So Google of course who doesn't always hit up Google first asking a colleague for data is also very common checking out project websites or disciplinary resources it's another avenue but the two that we're going to touch on in a little bit more detail are the Generalist Research Data Australia Re3 data and we'll talk about some biospecific repositories as well. So starting with Research Data Australia or RDA. So RDA is an ANS operated service which enables you to find access and reuse data for research. RDA has metadata records of data assets from over 100 Australian research organisations government agencies and even cultural institutions. But getting your data registered in RDA increases its chance of discovery and therefore reusability and I'm sure you're all familiar with Research Data Australia so I will not labour the point here. Re3data.org is a global registry of research data repositories that covers repositories from different disciplines it presents repositories for the permanent storage and the access of data for researchers funding bodies publishers and scholarly institutions and Re3data.org promotes a culture of sharing increased access and better visibility of research data and the registry went live in 2012 and is funded by the German Research Foundation. We also previously touched on fair sharing with respect to their metadata standards capabilities but they also hold information on databases and this slide also presents some links to biospecific resources. GEO for example is a public functional genomics data repository. Array and sequence-based data are accepted in GEO and tools are also provided to help users query and download experiments and curated gene expression profiles. So Array Express in particular the archive of functional genomics data stores data from high throughput functional genomics experiments and provides these data for reuse by the research community and finally on this slide we're going to mention Elixir which has compiled a list of resources that it recommends for the deposition of experimental data. The scientific community community actually has a shared responsibility to ensure long-term data preservation and accessibility and these deposition database lists actually help researchers and others involved in the life sciences to enable this. So we'll now take a little time to discover how a life science resource can actually be made more fair and in this example we're looking at Uniprot. The Uniprot is one of the world's largest freely available biological data resources providing key life science data in the most open and accessible manner to the scientific community. All entries are uniquely identified by a stable URL that provides access to the entry in a variety of formats including a web page, XML, plain text, RDF and red services so this helps achieve findability and accessibility of the resource. Interlinking with more than 150 different databases every Uniprot entry has extensive links into for example PubMed which actually enables rich citation and these links are key to the user experience in human and machine readable formats. This helps with interoperability which we will definitely cover in more detail in the next webinar and the entries contain rich metadata that's both human readable through HTML and text formats and machine readable through XML and RDF and again this helps with the findability of the resource and whilst we will cover more on interoperability and free use in our next webinar all Uniprot representations use shared vocabularies and ontologies such as Go and Eco which help metadata descriptions for both interoperability and findability. The RDF representation uses the Uniprot RDF schema ontology and faldo which actually helps with interoperability and reusability as we'll cover next week. So that actually brings us close to the end of our first webinar fair and we hope that we've kind of covered the topics of findable and accessible relatively comprehensively but to verify this we kind of like you to do a quick online quiz for us if that's okay so that we can potentially also modify this training material for future use if needs be. So if you could please open up the link that you see on the slide and I think Andrew is going to be pasting that to the chat box and take just a couple of minutes to answer the multiple choice questions it's really not onerous I promise but once you've hit submit come back to Zoom where we'll answer any questions you may have on F&A and then wrap up the webinar. By the videos I'm not sure that anybody needs any more time than been given already so let me shift the slide to the next one which just says questions actually but yeah does anybody have any questions on the material that we've actually covered today Andy Martin myself and Keith are rearing to answer them I want to say thank you it was presented very well and it was quite clear so that is lovely to hear thank you I have a question and a comment and as well I want to say that was great it was very very good I thought that was very useful um I guess I had a question on resolving what happens when organizations or projects and for instance the devil project will be copying some reference data sets from places from databases that are you know will be listed in things like deluxe the deposition databases so you know things from ensemble or from you know one of the big european databases and we'll have a local copy of it in you know in the in the virtual lab um and then I guess my question is when you know I guess one of the things we're thinking about is definitely ensuring that we have much better provenance metadata associated with our copy you know making sure that it's clear what the identifier was where we sourced it from when we don't download it etc etc I guess my question is about resolving to you know either to a copy or to the original source what what what's your do you have any thoughts or around that um so if people came to our resource and we're effectively accessing the copy um but when we were talking about potentially you know reuse of data would would it be good to resolve to our copy or to the original copy so my thinking on this is that you would resolve to your copy um and I guess we might explore this a little bit further in the in the workshop but um and if if you're going to because over time your copy might differ to the original copy and it depends on whether you're going to keep the two always in sync or not so there could be the possibility that in the future um the two may be out of sync and then if people want to reference um you know for reproducibility reasons um a particular copy then at least they know that the copy you have is the one that relates to what they've used or what they've reused so that would be my thinking on that but I'm happy to hear from Keith Keith or yeah so Jeff I think um uh I think for one thing it'd be great if if at the other end they mint a DOI for that data set and you can just point to that and that would be the place to point to uh the big question and risk is indeed uh is that really a reference a stable reference data set or is there a risk that that would change if you are taking a subset or you're taking a version that that uh for some reason might differ from the the data set up up in Europe then I'll definitely make a copy and then um if it differs then you'd need an own DOI for your own data set and for provenance reasons and for tracking what's happened to create that data set it would be good to have a trail that says okay this is my data set this is perhaps the DOI for my local data set but it is it's been derived from this data set which has this DOI and that's where the real benefit of DOI comes in in fact you can always that should always resolve back to the original data set too because that's the responsibility of the Yemble end the European election yeah yeah no thanks I guess we we tend not to pick and choose from data sets we we tend to take a specific data object that doesn't change so I guess that makes life easier and and these things are just like the the um the protein the uniprot thing that you show I mean they don't necessarily have DOIs in fact none of them have DOIs but they have the persistent identifiers that you know have been around for 30 years and at least have a URL so yeah thanks and and the general principle around all persistent you are into persistent identifiable globally unique by specific and identifiers is that if if there already is a if somebody has already minted an identifier for that to point to that rather than trying a new one if it really is the same data set yeah great data author that makes life easier