 For some quick introductions, as you know, I'm Andy White. I'll be hosting the first part of today's webinar. Also with me are my colleagues who you've already met, Frankie Stevens and Martin Schweitzer. And today we've also got Tom Honeyman who will help facilitate as well. Tom is the Ann Stutter Consultant in New South Wales who's interested in being involved today due to BPA being one of the institutions he works with. We, and we've also got Min Feng Wu here as well. She's also from the Ann's Melbourne office who's seeing how we operate the workshop. We anticipate the workshop will take about an hour and a half. It depends on how much interaction we get and how many people end up turning up. There will be breakout sessions and the opportunity for group discussions which we highly encourage, but just be aware that it's difficult for technology to work properly if there's more than one person speaking at a time. So just keep that in mind and try not to talk over. Okay, so what we're going to cover today, what Andy was going to cover today was a quick review of FAIR. We're going to then show you some exemplar records that meet the FAIR criteria. We're going to also present a case study for you where some activities been done around FAIR. We'll introduce you to the Ann's data self-assessment tool and that has recently gone live in the last few days and you'll get to try that out. And then we'll be looking at the systems that are involved in the GVL and BPA systems and looking at tools and data and seeing how we can make all of those things more FAIR. And the idea is that at the end of the day we'll come about with an action plan for how to make improvements or FAIR improvements to those softwares. So the first thing here is a quick video that recaps why FAIR is important, especially when we're talking about data sharing. So I'll play that with you for you even. And it doesn't matter that there's no sound because there is subtitles. So I've just unshared my screen so that Andy can take over the reins again. So Andy, if you want to start sharing your slides from the one following the video, that'd be awesome. Okay, hopefully that's worked. It has. Excellent. Okay, so my apologies for that. Obviously there's some network issues here. Griffith and set my whole system down. Okay, so let's recap FAIR very quickly just to reiterate. So the F stands for findable enabling users or others to discover your data through rich metadata available online as a searchable resource and using persistent identifiers. The A is for accessible meaning humans and machines should be able to access the data through clearly defined means using open standards and protocols such as HTTPS or APIs and it may or may not be open. The I is for interoperable data and metadata should be in recognized and standardized formats to allow it to be combined and exchanged and includes the use of control vocabularies and ontologies. And finally, the R is for reusable meeting domain standards with a clear license and a clear path on how and why and by whom the data has been created. So you should all be experts in that by now, I would imagine could almost do the training yourselves. So here's a simple example to illustrate some of these concepts. The Australian National Moring Network Facility has a series of reference stations in Moring's design to collect time series observations of temperature and salinity in Australian coastal areas. Now, if we were a researcher seeking the data and did a simple Google search for the data and entered an Australian Moring temperature and salinity time series into Google, we'd get the following results. Now, if we were to click on the first result, that takes us to research data Australia which is a registry of data sets which harvest data about this collection, including a summary of licensing details, description of the data when it was created, related services and organizations, subjects and identifiers. Now, if we click on that, that go to data provider button there, it takes us to the integrated marine observing system, repository which hosts some metadata, including for the details about the instruments and the data. So on this screen, there's even more information about the data we're interested in, including citation, reference information, current contact details. If you want to know more about the data, keywords, licensing information, which is clearly stated and under which conditions you can access and use the data, there's geospatial information, how often the data is updated, quality details, constraints around the data, the data standards that are used and detail around how the data was collected. There's also an explanation on the parameters used to describe the data using standard vocabulary terms including links to the definition of the vocabulary term. This is all available in XML so that can be read by other systems which automating access to all of this. Finally, to access the data, you go to the sister site, the AODN, and manually download the data in a range of standard formats, which Suture needs. So overall, it's a very efficient way for anyone to find the data, understand what it's about, understand its limitations and how it can be reused, whether that be in human or machine formats. So I think that sort of provides a good example of the sort of direction that we're looking in for systems and interoperability and fairness around systems. It's probably a good time in the presentation to talk about what others are doing in this space internationally. There's a lot going on and it changes very quickly. So I'm only going to touch on a couple of these points. But the Dutch Tech Center for Life Sciences or DTL is a public-private partnership of more than 50 Dutch life science organizations. So it's sort of applicable to the group here. It acts as a network of professionals that jointly improve the Dutch life science research infrastructure and there's a focus on accessible high-end technologies, fed data treatment and expert training. So they're very active in the fair data space. You could say that they're leaders or some of the leaders in Europe. They're developing tools, which what they make up or what they call the data fairport. Some of the terms I use are a little bit obscure and take a while to get your head around. So they've developed the fairifier and a metadata editor with that to create data, a fair data point to publish the data, a fair search engine to find data and what they're calling orca to annotate. I don't know why it's orca and not fair orca or something. So the fairifier is an online software tool designed to address the commonly encountered problems and data manipulation tasks in the verification process. There's a new verb for us. So the fairifier can speed up the process of data verification, especially for larger data sets. So it incorporates a metadata editor and they've also defined a five-layered metadata schema which is able to hold repository metadata, catalog metadata, data set metadata, distribution metadata and data record metadata. So it's quite detailed and to extend to very fine granularity. The fair data port, which is the publish aspect that has data owners to expose metadata and data in a fair manner. So there's a GUI there for human clients and an application or an API for software clients. The search engine have a submitted data available on the fair data points or compatible data repositories. So it doesn't have to be one of theirs. Indexes and provides a search interface and then the orco is an annotate system which allows human curation of knowledge graphs by offering graph annotation as a service and capturing the provenance of the annotator in the original statement. So altogether the data ports and interoperability platform and allows data users to publish their data and metadata and allows users to search for and access that data. When data owners publish their non-fair data sets the embedded verifier transforms the data into a fair data set before it's actual publication in the fair data port. So it can actually undertake some of the difficult work that's involved in the verification sort of process which is nice. So they're still developing those tools. Some are available and a lot is still being developed at the moment so it's a good space to watch. Now I'm sure most many people are aware of Elixir which is an intergovernmental organization which brings together life sciences resources from across Europe. So DTL also work within Elixir. They bring together resources to form a single infrastructure and in one of the previous seminars we've talked about fair sharing. So that's a resource for the data and metadata standards as well as interrelated databases and policies. So there's currently seven projects underway with them. They're developing infrastructure around interoperability of services, improving discovery of data through standardized metadata, harmonizing best practice around identifiers, using linked data to stick resources together seamlessly, developing specifications for workflow and tool interoperability and a portal to promote best practice and standards. So that's another space that's definitely worth keeping an eye on. Now before we launch into some hands-on goodness, we wanna briefly present a case study of the intermind system. So intermind is an open source software product which integrates biological data sources, making it easy to query and analyze data. Now there's a number of instances of the software and each with a different speciality such as the weak genome or the human mind which is this one here which has a focus on the human genome data. So a few years ago, the developers of intermind found themselves in a similar predicament to this group and they wanted to make the system support the fair principles better. So they had a workshop and brainstormed a range of improvements to the software to enable it to support fair data. So the above is a list of some of the items they came up with and some of these may be relevant in this context to the GVL or the BPA portals. I'll just rapidly go through some of those ideas that they've come up with. So they're after more stable URIs. So they wanted to, so this is around the identifiers I guess, they want to construct and identify incorporating class names and IDs that form a primary key or an object. So this was a bit of a problem for them based on their software rather than using something like a DOI if you like. They wanted to register URIs externally. So their solution was to provide one click ability to register installations with external repositories such as bio-sharing or the Elixir bio tools and provide facilities to register top level mine products with fine-grained registries such as the identifiers.org. So they used an ontology already within their software but they thought that they needed to extend it so that they could interoperate with non-sequence ontology systems. But I think one of the key things that they recognized that they're not going to create their own ontology themselves they're going to see what was available to them and work with that. They also wanted to embed metadata in their web pages. So there was a problem with finding data sets related to particular bio entities spread out across many different databases. So they looked at the bio-schemes as a solution to that and then they wanted to embed basic metadata so search engines can make the biological data on a particular subject more findable. They also want to add metadata to query results so make the provision of existing metadata column headings for example, consistent across their output formats. So add more metadata such as ontology terms URIs and presented in the data model for objects in their fields. They also want to make objects available in XML and JSON formats. They're also looking at integrating RDF into the software which will improve the interoperability with other systems as well as Sparkle support. And last but not least better licensing metadata so they were looking at machine readable formats such as creative common rights expression language and bio caddies machine actionable licenses. So they've been working on these for quite some time and making good progress. And for them the focus is on improving the systems I guess that support the fairness of the data as opposed to looking at the data. So now we are going to move on to the interactive part of the workshop. So what we want people to do and how many do we have here now? We've got more, okay so we might work on this as one group and so maybe as a group we want people to use the fair data assessment tool and using either a GDL or a GDL or a bi platforms collection or one that you've already got. Maybe try using that tool there that we've got a link to at the bottom there and so we're going to do that. So we've got a tool there that we've got a link to at the bottom there and some frankies kindly put those links up into the chat window and have a play around with the tool. It's a new tool which has been developed just recently essentially you'll be one of the first groups outside of the development team to take it for a spin so we're interested in your feedback and just see it should be quite self explanatory but just see how fair you think those data sets are that you're analysing. So we'll give people probably about five minutes of that and then we'll come back and have a discussion. Feel free to ask any questions or any commentary. So we were to go into the US version of Galaxy look at the reference data library for the human genome build number 19. We'd find the current screen. If we drill down to that HG 19 there it gets very similar metadata to what Jeff highlighted before. So it's got very little information there concerning the provenance of the data which it does road trust and integrity when you have minimal metadata especially when you compare it to that AODN example I gave before and the rich metadata they have though theirs isn't perfect but it's quite impressive. Now if we were to... So if we took that iGenomes that message there so we don't know where this came from so if we were to do a search on that iGenomes USC HG 19 gene annotation there's a few things that come up there if we select the first two this is going into the Illumina website identifies a particular data set with no metadata attached just there or if we went into the second link in those search results again there's a data set there but it's impossible to determine where that data originated from and how it was produced and the parameters were used and this can have a large impact on the final results of the experience and insights so I think that's highlights what Jeff was saying as well Now we're going into the next hands-on activity I just wanted to bring your attention to project deliverables which have already been earmarked to be completed for the project and are expected to enhance the fairness of the systems so the project managers have done some heavy lifting here I'll just go through them briefly so we've got the development of a prototype API to allow sending of data to GVL service development of deployment of mature API to allow sending of data to GVL service a general improved improved alignment of BPA data repository with the fair data principles development of draft data persistence policy development and publishing of mature persistence policy publishing of descriptions of any API service sorry about the background noise the head honchos have decided to turn up right now of all times publishing of descriptions of any API service and points in appropriate national and international repositories I won't go through the rest of them but I think people who are working on the project know about these deliverables and their relation to fairness so the next activity we originally were going to break into two groups based around GVL and BPA and we're going to look at the principles of this checklist which I sent around in an email prior to the meeting I think I sent around yesterday and determine, you know, we're going to brainstorm what can be done to make the data more fair within these systems and then what modifications to the systems the processes and the policies which might be required so we're sort of talking pie in the sky type stuff here I I don't think we want to go too far into how these things are going to be achieved or done we just want to focus on what we can do to improve the systems and then later on we can have more focused discussions on picking those out that we might want to actually implement and we can work with different groups myself for and RDSI Nectar included on actually implementing some of those tasks so we've set up two so we've got the two systems we've set up two Google Docs but we might do this here as an entire group probably doesn't make sense to break out seeing as we've got such a small group and we'll capture that information in a Google Doc so I've sent that around to my my comrades so we've got probably about 20 minutes we may not need that long but to work sort of through that checklist and to see some low hanging fruit ideas on things we can do to improve the current systems and practices so how does everyone feel about that? Yeah, that sounds good Okay, well do we want to focus on one system and then move on to the other or do people want to I think that it's good because the two systems are actually quite different so the BPA data repository is a data repository whereas Galaxy is primarily an analysis tool and just things here like licenses I could almost comment on every single one of these for both and I know that for instance the CCG now have implemented a licensing making that clearer for data items in the BPA data portal but I have a question on licenses in Galaxy about yeah it's an excellent idea but if we actually obtain the data from an international repository that doesn't observe the licensing framework you know how do we reflect you know how do we apply a license when they may not actually be a license on the original item for instance I have a lot of questions so now you're getting into the how well I mean you need to touch on it I think but I think it's a first step I think it's good to go through and see to imagine the system in a perfect world what you would do to it and then work out work your way backwards and some of those things will you know will cost too much money or too much effort or will be impossible to actually implement but I don't think it hurts if people have a bit of a vision of what it should look like in the future so that when these systems are developed forward they know to incorporate these things from the outset