 All right We're gonna get started with this session which focuses on can we harmonize metadata and To start the session. We have Michelle du Montier who recently moved from Stanford to Maastricht Has been a co-chair of the health care and life sciences w3c group and the bio to RDF work Which is using semantic web technologies and all kinds of open source things I'll take it from here and we'll have a few questions at the end of each Presentation and then we'll have a discussion with the panel after that Great. Thank you very much. It's a pleasure to be here to be in a community that I have never really been part of and I think The the rationale here is quite clear the work that you are doing for the most part is also very similar to the work that many Other communities are doing and so this topic that I will tell you about the fair principles really is a Global call to arms for us to get together and figure out how to solve Interoperability and reuse problems that we all face so I Haven't made my own data set. I'm a bioinformatician And a biochemist by training. I haven't made my own data sets for at least 10 15 years I use other people's data and I'm curious how many people in this room use other people's data Yeah, this is 10 15 years ago using other people's data was practically unheard of It was actually very very difficult and you'd have to be a close collaborator to use somebody else's data but now it's much easier to to find content on the web download it process it and do something interesting with it and Just as an example Profesh Kautry at Stanford had been working on a project to find Gene signatures that would help us understand why Transplants are rejected and what he wanted to do was build this meta analytical framework that would use gene expression data from Geo and Try to build a signature of genes that were strongly related to organ Transplant rejection and so what he did was he grabbed a bunch of data sets From heart from kidney from liver and lung and what they gave him at least an initial is is the contribution of any one gene to this Process robust across these different tissues. So instead of just looking at heart or liver or lung Putting it together says I can find something that is maybe fundamental to all of these different organs but secondly is once you build such a An analytical framework and you look and you build a signature then what he ended up doing was Targeting this gene signature with drugs that would counter the gene expression Profile of those genes so gene turned some genes off when they were on and vice versa And what they found was that they the signature correlated with the extent of graph injury and could predict future injury to graph and Also that mice that were treated with drugs that countered this signature could extend graph survival. So this is really cool This is nice science And this is what we hope would what happened that never really was demonstrated in any of the original studies really only by Combining these data could this finding be made? Okay, but the problem and I think what brings us into this room is that a lot of effort is actually needed to get that figure Right, so it's about not only finding the right data sets but then processing them in a way that you can easily reuse them and This is perhaps really well exemplified by if you look at gene expression and omnibus and you look at some of the Characteristic fields that are used to describe the tissues of the samples that were analyzed One of those might be the age of the sample Right, and there are at least as far as we found 31 different ways to specify age or age has been specified in samples Not not even looking at the values like what they put in the value field of that key Which can be not only numeric, but all kinds of literals So it's just the Wild West of any put anybody gets to put anything they want to describe their samples So you can imagine that if you're trying to find a data set that that Fulfills a particular set of criteria You're going to be challenged to find all of the day sets for one But secondly because of even errors in coding you're going to find things that are just not correct but have been erroneously introduced now age is just one of about 17,000 different keys that are specified Although we expect that maybe they're about a thousand or twelve hundred real variables in play But you would never be able to find exactly what you're looking for So I think part of the problem is that the way that we do science is for the paper Right, so we collect data We do our analysis and we write our narrative and we publish our paper to tell this story and the data Is kind of our byproduct like go look at the supplementary materials and I think if you treat content in this way Then obviously the quality is going to suffer, right? So I think what we have to do is kind of change the natures that put the data first and then write the narrative around that and Because if you do that then you can not only use it to do your analysis But to to be able to reproduce somebody else's work to validate your own work with other people's data and to generate new Hypotheses like what Provesh did in his study Okay, so if we're ever going to realize the full potential the content we create Then we have to find ways to reduce the barrier to publish and define reuse digital content But in a responsible manner, so why does this why does this matter? Right, so at least we know that we could enable new science doing it But I think it's more substantial than that So John Ionitis has been studying the problem of reproducibility For some time and he sort of made this statement quite you know over ten years ago Most published research findings are false Which catches a lot of us by surprise But actually when people start looking into this and have been looking developing reproducibility studies in major by a major landmark studies studies we see that 39 out of 100 landmark studies in psychology are reproducible Only 21 percent in pharmacology and then only 11 percent in cancer So actually the problem of reproducibility is a little bit more Substantial than we might want to admit and this there are lots lots of reasons for this everything from sort of the study design to the inadequate descriptions of the components in that study To the findings themselves not matching up with what was published Now why I think this really matters is because we spend a lot of time and effort trying to develop new therapeutic treatments for people and This I think there's an ethical issue is are we running clinical trials that will inherently fail and what we're finding is that Even though we're accumulating more and more knowledge and we think we're doing this better and better our clinical trials are failing more often So they're not actually reducing and it's not like we're a rate of clinical trial success is increasing It's just the opposite So actually I think there's an ethical dilemma to the way that we're doing science is specific specifically in the context of translational science Okay, so I think we have to fundamentally Rethink how we do discovery science. We obviously need to improve our confidence in anyone given result And I think we can do this in two ways one is we can use more data Right and this helps us build confidence, but the individual result that we had But I think we can also use multiple lines of evidence that helps us convince us that that effect truly does occur And I think the grand challenge at least as far as my research program is concerned and many others is How can we automatically uncover this evidence that would potentially support or potentially dispute a hypothesis Using all available data tools and knowledge. That should be our grand challenge That's in some sense what we're working towards and when we talk about Interoperability and we're talking about standards Of course, it's useful for people to find that content to reuse it But ultimately and at the end of the day It's going to be an infrastructure that we need for people and for machines So we have to and we have been doing this is building a social a legal and technological infrastructure to facilitate Discovering reuse of digital resources, but I think we have to do it a little bit more Seriously than we have in the past So why machines right so people are obviously super important in this equation, but I'll give you two arguments for machines I think first machines can make a sense of a vast amount of information far beyond what we can And to make personalized and evidence-based predictions or decisions to maximize outcomes and I think we see a lot of these Ideas coming out in a variety of different fields This is one for cardiology where the idea is to take information from hospital hospital data environmental data biobank data lifestyle data Our general knowledge about biology and biochemistry and to put all of this simultaneously into some kind of feature matrix from which we will Set machines on fire to learn patterns that are really too subtle for us to distinguish and to predict these four individuals That makes a lot of sense The other part which I think you can see these discussions ongoing is the issue of bias right people themselves are biased And we're seeing the machines are learning from the people So the question is can we build predictive machines that are unbiased or relatively less biased than people? And I think that certainly can be the case So this really brings us to this idea of fair right? So for all of this content for for machines and for people to make use of other people's content We have to take some steps to make that available So fair stands for findable accessible interoperable and reusable And it's a set of principles that we developed to enhance the value of all digital resources So they become easier to find and reuse but it's not just for data sets But also for web services for repositories for soft software and publications And I think part of the staying power of this idea of fair is because it's been developed and endorsed by a large number of stakeholders That it's not just the the geeks and the informaticians in the room But actually involving uh stakeholders like funders and industry partners and publishers as part of the equation And so we've seen some really nice uptake Is the the fair principles have been endorsed by the g20 european open science cloud horizon 2020 nih and many other organizations and even here in canada canary, which is one of the One of the organizations that have layered put the The optical cable from coast to coast have also sort of said, you know, when we give you money We expect you to do To to to follow the fair principles. So what so what exactly are the fair principles? So we publish this paper. We put this little box in You see the four principles and then 15 sub principles and it basically comes down to this Okay, so the first from a findable perspective We ask for globally unique resolvable and persistent identifiers That there would be machine readable descriptions to support structured search and filtering So this is a little bit of when we were talking about pervasive case where you're trying to search through geodata sets And you want to filter those results according to a set of characteristics You want to do that in the easiest way possible. You need structured content to help you do that From an accessible perspective, we ask that the metadata is accessible well beyond the lifetime of the digital resource So we recognize that some data sets, particularly in imaging data are very large and are in some cases streaming data And we can't necessarily hold on to all the data But if you use that content, then you need a description that content to refer to and that we should be able to to make available The second is also that we define Clearly define access and security protocols And so here the main idea is that a lot of the content that certainly I deal with as a biomedical researcher Is patient data and it's clear that we can't make patient data Directly available to people. We can't make it open So what fair asks you is if you have content that is sensitive Then you and your organization need to come up with a procedure by which that content can be made available And you should specify what that is for other people to be able to follow it as well So that might be you need to get review board institutional review board approval You need to assign a data usage agreement There might be a number of steps that you need to follow and you that needs to be clearly articulated From the interoperable perspective, we expect then extensible machine interpretable formats for data and metadata And to use fair vocabularies and to link to other resources So here this is an interesting sort of dependency is if you say, yes, I use a vocabulary And then the question is is it fair? So is it easy to find? Can we access it? Is it interoperable with other vocabularies? And does it have licensing and other kinds of information that will let me reuse it So now you get to push back on the people who provide you with the vocabulary and say Are you also being fair and you have the mandate as part of your own assessment that this this is needed The other linking part is also that we don't create data silos Right that content that you push out also is interlinked with other content that is already available And then finally from the reusable perspective, then you provide clear licensing terms You provide a detailed provenance and also that you use community standards We have been hearing a lot about community standards and this is the push for you So the idea the hypothesis is that improving the fairness of digital resources will increase their quality and their potential And ease of reuse so my argument here is not that if you do if you follow these fair principles Everybody is going to come knocking at your doorstep and say this is awesome And I want to use your data but more that if there's even one person that wants to use your data It's easy for them to do that So it's not a popularity contest. It's really that the barrier is lower And so fairness when I saw that and when I talk about this is a quality that really Reflects the extent to which a digital resource meets the fair principles or addresses them But as per the expectations of the community, okay, so why is this important? We imagine that people will do fairness assessments It will assess how fair or what steps have you taken to make your content fair And that will be reflected in search engines You might see something like this This is a mock-up for the Dan search engine mostly of social sciences data sets And they have a little insignia Which has the f a i and r and you can see different star levels there And so you can imagine that if you were doing a search that if all other things were being the same Your ranking of your search result would sort of indicate here all the data sets that match it But if people had made some extra efforts to make sure that they were reusable by other by by that community Then you might choose that data set over others So this is where we will see a differentiation of if you follow the fair principles and you enact on it Then people will start to choose your content because it's easier for them to reuse So the question is really communities such as this one have to make clear what your expectation are What do you expect from your colleagues and your peers? What must they do for it to be easy for you to reuse? And fair is the the thing that you can push in front of people and say we need to do it All right, we need to set our own standards and we need to follow it And so this is exactly what we see when we look at the citations of this paper That many of these are discussions amongst communities sort of establishing what tools and infrastructure and processes and procedures They have in order to meet the fair principles We have also been working on infrastructure to try to do this assessment of fairness, right? You can look at the principles and they're the principles just tell you they're aspirational They tell you what should be there, but they don't really tell you how to do it There are lots of ways to to meet the fair principles So the question is how do we measure whether or not you've addressed these things like you've provided? A persistent identifier or you've used a community standard or you have a licensing term that everybody understands So we developed this we're trying to develop a framework to do this and we want to measure it through a set of metrics And so metrics are standards of measurement They in our in our templates they have to provide a very clear definition of what's being measured and why one wants to measure it You have to describe what a valid result is and how one obtains it so that it can be reproduced by others So recently we Published this paper which is called the design framework for example and exemplar metrics for fairness And this has a set of 14 universal metrics that cover each of the fair sub principles And they basically demand evidence From your from your community from you and your community in which some of them may require new actions that you don't have So let me give you some of the set So digital resource providers must provide a web accessible document So this is typical metadata And many of you produce metadata as part of a repository where you deposit your data set or your your software This has to be machine readable metadata It has to detail your identifier management strategy and your metadata longevity strategies So these are in your data management plan of you or the organization that you're using or the system you're using to To make that content available And any of these additional authorization procedures that might be required to access the content. So the security issue You also have to use other people Standards that are developed by the community. So you can't say I built a new standard and here's my data in my own format But really it has to be have gone through some kind of community process So these we expect to be publicly registered. So that includes identifier schemes. So imagine dois or urls and things like that Um, secure access protocols like HTTP HTTPS, for instance knowledge representation languages like json ld Or or xml or whatever it is Licenses that are available out there different provenance specification languages and community standards. So These are things that you have to develop and you have to register And then finally you have to provide evidence of the ability to find that digital resource in search results That is provides links to other resources that those linked resources are fair themselves and that you meet community standards so I and many others have developed A standard for a data set metadata. It's called the hls community profile for data set descriptions And we did this through the world wide web consortium and we built this it took us like two years And we talked about different kinds of metadata elements that we thought were important in including to describe a data set in healthcare and life sciences And there's a beautiful html document That gives you examples and and so on that you can follow and create this this compatible thing But to be fair I needed to register this standard, which was even though it was published in through the w3c I added it to the fair sharing Repository and it required me to fill in certain kinds of metadata about a standard Right, so this is where I even had to make an action to be fair with respect to that And the second part is also that this description, right? Which was on the html document easy for people to understand but really hard for us to validate whether you met the standard at all So there we built So we used this shape expressions constraint language developed a specification for our Our computable specification and then now you can prepare a document and submit it and get a validation report and say here's What you've passed and here's where you're falling short So we need to be able to certify that you've met the community standard Whatever that is and that means that community that community standard needs to be computationally accessible Okay, so we made a first assessment using these metrics We asked a number of different repositories to Fill out this questionnaire provide us with URLs or yes or no answers So we asked dataverse dryad this nano publication networks anodo the Yale system fig share wiki data And they filled in their the questionnaire and they provided us with URIs which are basically two pointers two documents On the web that comply with these fair metrics And you can see the green boxes show you they gave us something and it's exactly what we were looking for The yellow boxes, it's it gave us something But it wasn't quite what we were looking for And the red boxes are they didn't give us anything at all or they gave us something and it was wrong Okay So you can see actually there's pretty good discrimination We see you know about half of them are green boxes We have about a third or a quarter or so are red boxes. So that means that Which is good. We expect that not everybody will have satisfied all the requirements Otherwise, we wouldn't even have this discussion, right that there's still work to be done But it's not so devastating that nothing has been done Right, so there's just enough work for us to build on And what's really interesting is that we've gone Through bioity world and had these hackathons and we've asked people in the course of 24 hours to do a fairness assessment of the Resource at the start. So here we had the C bio portal the jacks jacks team had a data set The broad had a single cell data set and there's the bio assay from ebi And they did a fairness assessment 24 hours later they they looked we told them what are the fair principles What do you got to do and they picked some things and they could improve it in the course of 24 hours So it also tells you that we can improve the state of the art In a relatively short period of time, maybe not to the totality of a hundred percent Scoring here, but we're it's it's doable So now we're also building systems to do automated first fairness assessments And so I think part of what we learned was that What people thought we were asking for was different than what they thought We were asking for and we need to standardize a little bit about what that is And the question is can we have machines automatically do fairness assessments? So we've started to do this I'm building infrastructure to to do this to retrieve content to do fairness assessments But one of the problems that we found is that The content that people provided us before which was iffy is now Is absolutely not good enough for the machines and even things that we had previously agreed was good enough The machine says it's not good enough So these assessments are pretty unforgiving and I think this will be problematic for some time until we figure it out Nonetheless all of this work we are doing through the NIH data commons pilot phase We're building infrastructure to do fairness assessments And that's not just the again the questionnaires and the filling out the values But also bringing that up through applications and showing you different kinds of insignias that show you how fair your resources are I'd like to also mention the There was an expert group Of the european commission It was tasked to write a report on turning fair data into reality. They have a number of different recommendations. This Includes our cross-disciplinary fairness encouraging and incentivizing data reuse and facilitating automated processing And so again like ferris is telling you what you should achieve But how do we get there and I think that also includes you know data science and stewardship skills And all additional training and curriculum development Which we are starting to do through the context of this global open fair initiative or go fair Go fair is really meant to be a grassroots initiative We will be focusing on three different things one is on go build which is basically technical infrastructure for doing Fair work training networks and as also this cultural change that is required to embrace this this thing While there are networks for metabolomics for training for rare diseases Also, there was some talk about maybe having something for neuro neuroscience neuro informatics And I think that'd be a great place for this discussion to go Right. So in summary, I I hope I've convinced you that fair represents now a global initiative to enhance the discovery and reuse of all kinds of digital resources This fair concept is maybe maturing faster than we expected But also there are plenty of avenues for you to participate in if you think this is important And I think it's important because ultimately you'll end up being assessed based on what you and your communities decide Are the standards that are expected from you. So you should participate I think there are huge benefits to be had I think there are two two aspects. We're having discussion during lunch where You have to think about it that for the most part if you're just a data producer and you're producing content for other people I mean, that's very altruistic But I think you should be thinking how do I as an individual Make use of other people's data and do I have the skills to do that? So there are many young people in this crowd This is what you should be making sure that you can capitalize on this emerging phenomenon And also that machines will eventually automatically process a lot of this content And that that can up and up new opportunities for discovery And I'm finally it demands really a new social legal And technological infrastructure That really doesn't exist in whole it has parts and we've we've heard a lot about that this morning But we need to put it together so it makes sense and also that it's ethical with respect to our expectations Lots and lots of people involved in in both fair The fair metrics we have lots of support from funding agencies and organizations around the world, which is great and we expect A lot more developments to occur and hopefully with you so with that Thank you for your attention. Happy to take any questions Hey, thank you for your talk and I think it was a great overview of the Of all of fair. Um, I want to stick the hornet's nest a little bit though And I think in the community the community Norah informatics community has worked a lot on standards and in our interoperability and understands the importance that I think in in general Um, and we also I think have a sense of sort of the costs and benefits that come with this Your talk different than a lot of other ones had the words must and need to and Like the search engine page you had you said people will look at this and decide whether to use your data or not and That sounds like To make must happen. There's either carrots or sticks Because otherwise it's best practices and I think sure, you know, we've embraced best practices as much as possible Um Is there what what is the sense of the carrots and sticks that make this happen that make must happen? Yeah, right So I think one of the biggest ones is the the commitment of funding agencies certainly in europe with respect to data management plans In particular for the horizon 2020 programs I think the expectation will be five percent of your overall budget will be allocated research data management And also in your research data management plan You must clearly address how you are addressing the fair principles in your data management plans So these will be part of review and they're also part of the funding strategy So this I think is to really show that this activity of research data management is not a research activity And it's something that we expect that we will have a uh, and I think the plan is to train data stewards To to participate in this and I think in my opinion data You know good data sign responsible data science involves Data stewardship so young people should certainly understand what that means But there will there should be funding also allocated to make sure that you can delegate this task to to others to do so I think that part is is Is quite popular amongst funding agencies and that will happen But I mean the other you're right that the other is the stick part like the journals having To require particular formats or endorsements by you know communities But uh as as I was mentioning I think the most powerful incentive is can you can you make use of Your data in a different context can you do this method and analysis to learn something That you can't from the single data that you had and that that you know the insight that you could potentially derive from Integrating other people's data is probably the most powerful motivator to do it well because otherwise it won't fit in your pipeline You won't be able to reuse it So this is what we're doing at Maastricht University. We have a pilot project now to couple research data management with eScience Across all the six all the six disciplines that we have there and it's a super interesting Exercise where we talk to social scientists and they go what data You know like they do qualitative kind of assessments and things like that Whereas we have others especially in the neuro informatics community who have monstrously large You know MRI data sets and they've been doing this for a long time And they sort of say again How do I couple that now with the electronic health care record data in the hospital Along with the data that I'm collecting here with the the patient. So I think there's good opportunities to That the incentive is discovery is the biggest one and then the second is financial compensation for these activities Yeah, so while we take the next question, maybe I can ask Jeff to come up and set up Hi, thank you for this talk Um So so that automatic fairness assessment system that you've presented Um, is it based on basically self report? Is that how it's It's meant to work Yeah, so the the first one that we did was a self report and then we also did our own assessment So an expert assessment And then we tried to jive the differences between the two by having conversations to learn about whether our questionnaire was clear enough Of whether what we asking was clear enough or whether they fundamentally didn't have that information at all The automated assessment I think is that leap forward, which is Are you providing that content in a way that is easy for us to find? And automatically discover And there I think we will provide additional guidelines for people to publish this and particularly this will matter for repositories That that take in your data, let's say and make it part of a searchable engine Then what other things can they do to make it very clear either through the submission process of additional content? Like what license do you want to release this under? And present it to uh, these fair assessor tools in a standard way But but basically for the system to work the particular database or a data set We'll have to provide you with information with metadata about like what license they they use and whether Find abilities also. Yeah, so that's what I kind of mentioned is like There needs to be public registration of those resources that you're pointing to so if you're using a license That's a standard license. We expect to be registered in a place where standard licenses are registered If you use a data format that is a recognized community data format Then it should be in a repository with recognized community data formats are So it will require the registration of a shared resource in order for these fair assessments to go on You can't just build your own thing anymore. That's a big difference and and not not tell the world that you've done it Okay, thanks. Yeah, we can discuss it after Hi, um, I have a question about the hcls standard that you mentioned and um, so Um, so since the session is called them can metadata be harmonized How does the that particular Standard relate to the other ones that have been mentioned today like bids 90m Decat Prove sure. Um, yeah Is it how does it how is it different and what kind of harmonizing? Yeah, exactly. So so part of our exercise was really, um To actually a survey existing vocabularies and see how we could provide a guideline To the use of those vocabularies for specific metadata needs. So actually we didn't we didn't create any new vocabulary We just made a guide of how to use other These existing vocabularies to describe to to to to provide metadata in a computational way So maybe different. Uh, it was one of the few where have you ever seen that picture people often put this picture There's 14 standards and then somebody makes a new standard for all the other standards So here it was just basically let's use the 14 standards available And here's the guideline for you the user who doesn't want to sift through 14 standards But really just says I want I have this task. How do I get it done? And I think we probably need to do more of this is just to guide people to just get get it done And then the interoperability thing is something that uh, we push back on the developers of of these vocabularies and ontologies And sort of say how are you interoperable with the other vocabularies that are out there? Yeah Hi, I was wondering is there any way in the in the fair Metrics, I guess maybe principles, but most likely a matrix to to have some specific Criteria for sub fields or sub disciplines because I can I mean, it's great to have general principles that apply You know across disciplines, but at the same time I can see that that you know specific niches would would would like to say, you know For instance, your metadata doesn't have any value unless you share this particular piece of information And conversely and you know the fact also that it's it's being picked up by funding agencies is great But I also can see a risk that if the framework is too rigid You know, we would end up with normative metrics that may not fit all the use cases. Yeah, no I think this is a great great question So the way that I see it is that the fair principles are broader than any one community standard So the idea is and as described by fair principles community standards can be Are part of that So if you have very specific elements of data or metadata or whatever it is And that's something that you and your community agree to that's something you can specify But it's in addition to what we provide as with a core set of of expectations So in no way does this replace existing community standards It just augments them with a in a sort of in a principled manner As to what we should expect from all communities wherever they may be But what if you have more specific requirements? Then it's up to you to build these community standards again that are machine Understandable community standards that we can process with them and we can scale with the number of of data elements that are out there And that's where we would start to push back and say is your community standard fair Let's thank michelle