 I'm really excited to be here today. So thank you so much for the kind invitation. I think it's really wonderful that I get to go first because what I'm gonna talk about has to do with laying a lot of the groundwork in making data open and actually genuinely reusable for all of the artificial intelligence types of applications that we'd like to layer on top of the data. And so, you know, good AI starts with good data. It's very hard to come by. And so today we're gonna talk about making science go faster, better and together by harmonizing and reusing data. And this, I'm gonna give examples from two major projects that we've been involved in. One is an ongoing project called the Monarch Initiative which aims to integrate data from many different resources and organisms in order to help support rare disease diagnostics and mechanism discovery. And the other is a COVID related project given the urgent need to address the pandemic. And I hope that you'll be interested in one or both of these projects. And we look forward to, especially for the COVID project welcoming everyone to join in and help out. It's an open project. These slides are available at that link there at the bottom at the bit.ly I'll put the link up again at the end of the talk and also feel free to tweet me if you like. So just to get started, I think one of the biggest challenges that we have in science today is that data generators are often really unaware of how people are using data downstream for their different types of applications. So a data generator thinks about planning, collecting, documenting, analyzing and sharing and they have their very immediate term goals in mind when they think about the data that they're generating. However, the data reuser comes at this from a very different perspective. They're looking for data to discover they wanna be able to access it. They wanna be able to combine it. They wanna reanalyze it or analyze it in new ways and they wanna share the new results of these. And the fundamental premise of the Parasite Award, for example, is based on how well do downstream data re-users actually reuse existing data and find new insights. So we really are doing a lot more science on the data reuse side. And so therefore we need to help our data generators do a better job of generating the data such that we can actually reuse and recombine in novel ways. So I wanna talk about some of the barriers to doing this today and hopefully this will inspire some new ways that we can all help each other both on the data generation side as well as on the data reuse side. And we often hear about the term fair, findable, accessible, interoperable and reusable. And I like to say that fair actually requires a little bit more thought in certain areas specifically. And these really relate to the R in the fair. To really make data reusable, you need to have data be traceable, licensed and approved and connected. And if you do those three things, then you're really enabling the downstream data re-user to find those new discoveries. So we're gonna talk briefly through the Monarch Initiative project about some of the examples for barriers and some of the things that we've done to overcome issues in these three areas for the Monarch project and then we'll talk about the COVID project, which has all the same issues, but as more of an open invitation for participation and a different example of data harmonization issues. So first traceable, we wanna make sure that the data has proper evidence, provenance and attribution, traceability and licensure go hand in hand and licensing is often a convoluted credit hack. And you can find more information about the TLC framework at this other bitly here. But it basically, one of the things with FAIR is that it's goals are very lofty and have really created a huge sense of awareness, which is terrific, but it doesn't provide much direction on how to actually make data reusable. And so we've aimed to try to help provide some further guidance on these three particular issues. So evidence, provenance and attribution. So I'm gonna give an example here from our work on harmonizing diseases. So there are many different disease terminologies. They cover different areas of disease, complex diseases, cancer, infectious diseases, rare diseases, Mendelian diseases and a whole host of other terminologies. There are literally hundreds of terminologies representing different kinds of diseases. And so for our disease mechanism, discovery and diagnostics, we needed a way to combine all of these different disease resources. And we needed it to be a systematic, reproducible and consistent process and representation across all these different domains that were modeled with different categories and different target audiences in mind. So if you're familiar with terminologies, you know that there exist many different mappings between them, one term and one terminology will be mapped to another term and another ontology. But these mappings are often without any kind of provenance or evidence for how they were actually created. So why was this term mapped to that one? Was it just because the string matched, the label matched? Or is it because there's some other curation strategy that was used to create that mapping? And then how is that mapping actually encoded so that we could use it computationally? So we can use these mappings, but there are many problems. They're often mutually inconsistent. There's end to the second set of mappings and there are not one-to-one equivalence, equivalencies across them. So just because a term and one terminology has the same label as another does not mean that they are the same. There are many examples of where siblings are declared the same or parent-child relations are declared the same or even flip-flopped. And we've found many different types of problems in source terminologies by looking at them across the whole set of resources. So this was developed by my colleague Chris Mungle at Lawrence Berkeley National Lab. He developed a algorithm called K-Boom, which is the Bayesian ontology merging algorithm which takes logical and probabilistic inference by taking all the different disease sources and there are very many and running them through this algorithm to find the most parsimonious graph that would represent the whole of all of the terminological content together. And the algorithm then is, the output then is curated by expert curators who look at the least parsimonious, least probable collections of equivalencies. So, and oftentimes we find examples like we found in Mesh an example where there were two branches that essentially were all the same but where one had subsets that were encoded with Roman numerals and the other one where they were encoded with alphanumerics and it was just was an accident that the same content had been implemented twice with two different sets of labels. And so that got fed back into Mesh and they fixed it and it was cleaning up everything upstream of our integration. This has resulted in a new terminology called Mondo, which means for the world. And the reason it's so important is because at the moment if you use these different terminologies in different countries and different diagnostic tools, you have the potential to get less options or less good options for diagnostics if you're not using the complete picture of human knowledge, especially in rare diseases. So we're really excited to be working with all of the different source terminologies to create Mondo as a collaborative open science initiative to reconcile the world's disease definitions. It gets a little more complicated than that because different communities actually annotate different relationships to define a disease. So for example, diseases from ClinVar are defined based on their relationship to variants where diseases to genes are found in many resources such as the online Mendelian inheritance in man. Our group creates disease to phenotype on relationships in the HPO and tries to understand how do we best define diseases based on their phenotypic features? And the list goes on. And so what we really need in order to understand the global disease picture is a way of coalescing these different relationships and how we define diseases in a common structure. And so this overarching schema is now being used to help define those diseases in Mondo and help provide the data from across different resources in an integrated fashion. So when we apply this approach to understand how many rare diseases are there and this is this little model that I showed you on the last slide and we use the Mondo technology to help look at different sources. We actually, there had been approximately 7,000 or 7,500 rare diseases identified in prior work dating all the way back to the Orphan Drug Act more than 20 years ago. And but we actually found when we looked across just these five sources, NCIT, the Disease Ontology, NIH's Guard, Rare Disease Research Resource for Patients, Orphanet and OMIM, that we actually had over 10,000 unique rare disease concepts. And what was especially interesting is that only 333 out of that over 10,000 concepts were shared in all five of these disease resources. And many resources had diseases only in one resource. So this really shows you the power of integration and of having a common framework and structure for doing a good job of doing integration. This is still very much a work in progress but we have published this preliminary work here if you're interested in this. And I think it's again just really shows the need to sort of coordinate across these different resources using a common model for how we actually define diseases. Okay, so now we're gonna move on to licensing and approved. So in this case, we use a lot of upstream resources. Have you seen a few already? The license must be clearly stated, comprehensive and non-negotiated, accessible, avoiding restrictions on the kinds and who can reuse the data as well as having HIPAA or IRB regulatory compliance. So whether the data is actually open or closed for reuse is often impossible to know until you try. And it is actually similar to greenwashing. We now have sort of share washing, lots of data resources say, I'm an open resource but if you actually look at the legality and the licensing and how you access things, it's actually often not the case that it's really that open as you think as or as they claim. And actually figuring it out often comes at a significant legal expense since most licenses are missing, vague or restrictive. We've had many resources which we've had years long conversations about trying to get the licensing right so that we can reuse the data. And by reuse, I don't mean simply, I downloaded it to my computer and I was able to analyze it. I mean, I integrated it into another data set and redistributed it. There's a very, very big difference between those two things. If I wanna build a diagnostic tool on top of other people's data, I need to actually be able to redistribute those data with attribution and provenance, of course. So this is just an example of some of the data sources that we integrate in the Monarch Initiative. They come from all kinds of information about genes, mechanisms, contacts, phenotypes and diseases from all different sources, all different species, using lots of ontologies. We love our ontologies. And this feeds a variety of different kinds of tools and resources and third-party apps that can then be used for different applications. Well, this means that these are some of the organizations that have been involved in the creation of the Monarch Initiative. These are some of the sources that we've been ingesting and integrating and trying to redistribute. And so you can see that there's a pair-wise relationship between each source and each institution and then recognizing also that each institution or each resource also may have multiple institutions involved. And so the legal burden is, every one of those lines is a separate legal interaction in order to obtain those data and be able to redistribute them. The burden of this licensing is enormous and untenable. So in order to create better community awareness, we created the Reusable Data Project, which has a rubric that helps evaluate the licenses of data resources for actual reuse and redistribution. The license must be clearly stated. It must be comprehensive and non-negotiated. It must be accessible so you can actually find it, the license. You must be avoiding restrictions on the kinds of reuse and you must avoid restrictions on who may reuse. And so this platform for evaluating licenses is available at reusabledata.org. It's all based on GitHub YAML file. So if you're interested in participating, we would be delighted, especially for the data librarians in the audience, to have your help. It's a open science volunteer initiative and we really have found that it has greatly improved the landscape of licensing and data reusability and redistribution simply by creating better awareness. And you can request any new data resource be added or if you're interested, you can curate them yourselves and make a pull request. We'd be really delighted to have contributions. So here's what we found. This is a little bit out of date. Now there are some new sources in there since this slide was created. But essentially 43% of the licenses we evaluated had no issue. 55% had no issue in being comprehensive. 84% had no issue in the data being made accessible. 40% had no issue in avoiding the kinds of reuse and 36% had no issue in avoiding who may reuse. So if you look at the overall landscape, it's roughly about 50% inaccessible data from supposedly open resources, many of which are funded by government funding. So it's not terrible, but it's not great either. And we have in fact, since we started this project, we've seen an increase in the green parts as people have just recognized they need to have greater clarity in their work. And so here's what the downstream consequences are of that. So permissively licensed data is like a universal donor. It doesn't get much back from more restrictively licensed data. And then restrictively licensed data can only be combined with permissively licensed data, not most other restrictively licensed data. So you can imagine that if you really want to aggregate, and these are just the sources that we were able to look at in this particular study that the more permissively licensed, the more fundamentally recombinable and redistributable the data will be. Okay, so moving on to connected. So since we're all familiar with FAIR, I wanted to kind of just mention a little bit about identifiers. We in our group have a very strong opinion about how identifiers should be managed and provisioned and used. And when we think about FAIR, there's a huge dependency on identifier management. So if you want to know something about what is findable and accessible, you need identifiers. If you want to know something about how many or some kind of more complex question for interoperability, you need some more metadata than just the identifiers. And if you want to know more complex questions than that for reusability to answer actual mechanistic questions, or in our case disease diagnostics, you need not only the identifiers and the metadata, but you need models and terminologies and their reconciliation in order to really truly integrate the data. And that all fundamentally depends on the bedrock of identifiers. So here's just a few examples and I don't have too much time to get into it, but just understanding what the equivalencies are between different records is a simpler task. But as you go up this curve of complexity, getting understanding versioning of content and content evolution, understanding different identifiers for related concepts, different identifiers for related records, it gets really quite complex in managing all the sort of relationships between identifiers across many different systems. And similar to the work that we did on the Mondo terminology, really documenting in a robust, versioned, persistent, well-curated manner the relationship between identifiers in different resources is really critical for long downstream data reuse and integration. So we are so nerdy about all of this that we actually coordinated a community development paper on best practices for how to design, provision and reuse persistent identifiers to maximize the utility and impact of life science data. This was led by Julie McMurray in our group. And this is published, there's also a blog that was associated for the lay person. So if you're interested, it's a great teaching resource as well. Okay, so now I'm going to move on to the COVID project, which is called the National COVID Cohort Collaborative. It's called that because it's a national, it's a U.S. and national initiative to build a national COVID patient cohort of patient level clinical EHR data for use in machine learning and AI types of applications such that we can actually reveal new patterns of the disease from a massively large data set. So the pandemic has highlighted many chinks in our healthcare infrastructure and many specific needs relating to the needing of algorithms for diagnosis, triage, predictive algorithms, drug discovery, multimodal analytics, interventions that reduce disease severity, best practices for resource allocation, which is maybe a little bit less of an issue at the moment, but with our spikes coming along might maybe come an issue again. Coordinating research efforts, we have so many people around the world that are trying to work on COVID and help understand this new disease, but we are not necessarily maximizing our efficiency or efficacy or reproducibility in those contexts. So there's a great need to try to coordinate. So all of these things require the creation of a comprehensive clinical data set and we need it now. So one of the things that we've been able to accomplish, and again, I was asked to speak a little bit about data harmonization and sort of some of the challenges and benefits from doing that. This project is fundamentally an electronic health record data harmonization challenge. And so there are many research organizations that perform federated querying and they can ask questions of a given clinical institution. How is time on the ventilator impacted by a given drug? And this question is then sent to partners in the data network that have their data already set up into a standardized clinical data model. And then they can get the result back because that data is in a standardized clinical data model and send it back and that information can be average. So if two sites send back reduced by one day and another site sends back reduced by three days, we can perform statistics on those results. And then here in this case, we've just taken the average and said, okay, it's reduced by two days. So this is what happens when the data stays locally. But we have a huge opportunity to build upon this type of work for sites that are already have their data in a standardized clinical data model. And that is that we can ask questions that are much more sophisticated leveraging new kinds of algorithms and you can do more discovery oriented analytics such as in patients under 60, which factors are most predictive of the severe outcomes. Let's just look at all the data once and try to reveal the patterns in the data. And we can work collaboratively on these data to build, test and refine algorithmic classifiers and then identify novel associations, which will hopefully help reveal better patterns of the disease so we can better care for patients, especially while we wait for that elusive vaccine. So there are major work streams within the national COVID collaborative. And we, again, this is open to everyone. It's the, as far as I know, it's going to be the largest limited data set in US history. It's probably also the most open sensitive data data set in US history. There are many hoops to jump through in order to participate. In terms of legalities, your institution must sign a data use agreement. I'll get to this next. And you have to request access to the data, but the data is fundamentally open to all for anyone who can validate that they have human subjects training and are studying COVID. So the different work streams are data partnership and governance where we've partnered with the four most commonly used clinical data models, ACT, Trinetics, PCORnet, and OMOP. Today, actually the Odyssey community is having their big symposium, which I am now missing. But there is a huge partnership and foundation upon these networks without these networks and partnership. And also with the NIH and NCATS, we would not have been able to create this cohort. Then a common phenotype is defined across all these different sites and all these different models, no small task to get the community to do that and to be able to have it be applicable in all those different contexts. Again, a harmonization problem. Then the data is pushed to NIH and NCATS cloud where a very long and lengthy process goes on to harmonize the data to a target model. And then it's deposited into a secure enclave where we can perform collaborative analytics. And then finally, we have a synthetic clinical data pilot where we're trying to evaluate the validity and de-identification status of synthetically generated data from these limited sensitive data so that we can make the data even more widely distributable. So the goal of the data use agreement, which is the legal document that the institutions sign to give their investigators access. So, and there's a list on our website. I forgot to put the link here. If you want to see if your institution is already signed, we have well over 100 organizations nationwide that have signed data use agreements. So your institution may already have a data use agreement with the NIH to be able to request access to the data that the research must be COVID related. There's absolutely no re-identification of individuals or the data source. So we are not identifying the sources. There is no data download or capture of raw data out of the enclave, but it is an open platform open to all researchers. Security in the enclave is recorded so we can actually see what everybody's doing in there and disclosure of the research results to the N3C enclave must be made for public benefit. We also have really terrific analytics, provenance and contributor attribution tracking, which I'll get to in a minute. So this is just a quick overview of the different data levels and governance. So if you're interested in participating and it's interesting because we think about the L part of licensing, this is really sort of the regulatory component of that where without really robust governance and regulatory control, this whole program would not have been possible and is with great expertise from the community as well as from NIH regulatory control that we were able to achieve the creation of this special data set as well as to make it accessible to so many people and still have the kind of transparency and reproducibility that most of the recent retractions we've seen in this space have suffered from. So institutions contribute with the data transfer agreement data for the clinical institution. The data is pushed to NIH and into this different clouds. There's three different tiers. As I mentioned, there's the limited data set which has HIPAA identifiers in it. It contains geo codes and dates. Those are critical for pandemic analytics. And in order to access this limited data set, you must have a user project specific IRB at your institution. We also have the de-identified data which has date shifting and the geo codes are reduced to three, three code zips. And then we have the synthetic data project that I mentioned. So if you're at an institution, your institution gets a data use agreement, makes a legal agreement with NIH and then data users register in the system and then make a data use request, signing the user code of conduct, a data use statement and attestation of security and human subjects training. And then you're allowed to access synthetic data or de-identified data without a further IRB because your institution has already signed the institutional data use agreement. Or if you need access to the limited data set, you can either request at that time or at a later date with a specific project IRB. So to date, this is the data that's been ingested. We have 67 clinical sites that have executed their data transfer agreements. 27 of them have already deposited data. It's really wonderful to see that they are coming in all flavors of clinical data models, OMOP, Trinetics, Act and PCORI. And so these data are gathered from each institution, harmonized into the common target model of OMOP 5.3.1 and then pushed into the enclave where we can then do the collaborative analytics. So that's basically just what I just said. So this is what the enclave actually looks like and some of the current statistics. So we currently have 121,000 positive COVID patients and almost a million total patients in the site in the enclave. These are the 20 sites that have had their data cleaned up and harmonized. So it's not perfect, but it's ready for use in analytics. You can see that that results in quite a very large number of lab results. We have 661 million lab results and 41 million visits, for example. The cohort contains all ages, many different distribution of races we're working really hard to make sure the cohort is representative of all Americans and also that all clinical institutions that want to participate can participate. This is just a very quick plot showing an overview of sort of the high level conditions that have been captured. So you can see that we're just capturing a variety of different conditions. Obviously the data is sensitive. So this is about as much as I can show you without having you in the system. So I just want to talk a little bit about that data harmonization. So just as an example, here we have a patient in hospital A who has an identifier, has a diagnosis identifier, has a diagnosis date, but you'd think that this would just be a solved problem and it is kind of shocking, frankly, that it isn't, but it really truly isn't. Over there in hospital B, we have a different patient identifier with a different code system for diagnoses and a different representation of the date. These patients both have a COVID diagnosis, but a single query will not capture both of them. So we have to harmonize this diagnosis code here. Oops, with this diagnosis code here, we have to harmonize the dates and the units. And so all these version data have to get harmonized together in order to sort of have one representation of all of the data. So this is just part of an example of how that might work for one particular variable. So here we have the ingested data, it goes to domain tables for person, visit, time. There's code sets. So defining code sets using the Odyssey Atlas tool and then pushing this into our enclave and then having quality control visualizations. I'll show you an example of that in a minute. So here for example is a concept that needs harmonizing creatinine in urine is included with blood serum levels. Okay, well, we need to know how that should be harmonized across the different sites, but we aren't getting any creatinine from site XXX. Well, why is that? Well, it's because that their mass and volume is actually mixed with molarity in their measures. And so these need to be harmonized. And so these code sets then are created to combine and harmonize the units as well as creating code sets that might mean the same thing, but have been had different lab values or different diagnostic codes that are collected conceptually into sort of an Uber level set of codes that can then combine those data. So this is what that looks like at a high level overview for creatinine and also over for asthma on the right. And so you can just see sort of if we look at creatinine with mass per volume in arterial blood for the different sites, we end up with very different results in different sites. And the whole point of this over on the left is just to say the sites are giving us different information, different encodings, some are milligrams per deciliter, some are millimoles per liter, some we don't even have the units for. So it's very hard to sort of reconcile all these different units and codes that are used across different sets. This is an example of asthma code usage across a number of different sites. And you can just see that like based on the color pattern, like just understanding who has asthma is no small challenge. And so building out these code sets that combine codes that mean something similar enough that a user would want to just query on asthma is a challenge. And so we domain experts and technical experts working together are creating these really fantastic code sets that can allow the harmonization of data across sites. So then moving on to all the data in the system, the system is a beautiful secure enclave which supports many different kinds of analytics. All your favorite are and Jupyter notebooks and these types of platforms have all been deployed. But we've also been trying to put in a number of community developed tools such as those from the Odyssey community, the NCATS translator community, cohort discovery tools, NLP tools that can push structured data into the system as well as a synthetic generation tool from MD clone. And then we have a hashing strategy which means that we can connect data within the enclave because remember the data cannot leave it as sensitive data, but we can connect it using an honest broker to other data that might exist in other platforms for clinical trials, imaging and genomics. And so this whole system then provides a secure, reproducible, transparent version, provenance, attributed and shareable analytics on patient EHR data at a scale that we've never had here in the US before. So this is really exciting, nothing like a pandemic to push us towards open science on sensitive patient level data, but we can do it and we can do it because the system is so secure and so has such excellent provenance and security. And so we are actually able to generate reproducible reports from all the analytics out of the system and hopefully combat some of the types of other analysis that have been needing to be retracted for lack of that transparency in the recent days. So this is something that I'm really excited about. So I wanted to highlight it because I know the information science community would be enthusiastic as well. So we've been deploying a very simple attribution model in the system since we have to track every user action, we may as well attribute every user action as well. So here's an example where here we have Tiffany Callahan, we have her, you have to have an orchid in order to be in the system. She's created a mapping file between the open biological ontologies and the OMOP clinical data model. This file has a DOI, it's an artifact, it's been deposited in Zenodo. And so she has a role that she has that comes from the contributor role ontology of being a curator of this file. And so we have an agent, a contribution in an artifact. So this model is called the CAM model and there's a link here if you're interested. This over here on the right is just showing the provenance graph of an analytical workflow. So you can see different relationships between different artifacts in the analytical workflow would be actually associated in this way, each with the people who did anything to them or contributed them. And the information then is all captured in the onboarding. And so then when the analytical workflow comes up with a result, a file is dumped that has all of the orchids, their roles and the artifacts that are listed in that sort of triple. So by doing this, we can actually know who are all the people that contributed anywhere in the pipeline, whether they participate in the actual analysis or not. So if anybody uses Tiffany's oboe to OMAP file, then she would get attribution for having contributed that file no matter whose workflow it is. And of course, everything is versioned as well. So what are some of the kinds of questions that people are asking in this system? And this is hopefully to spark some conversations about how those foundations of best practices and managing and sharing data for actual reusability and integration and downstream dissemination can actually be the foundation for artificial intelligence applications. So things like how to predict which patients will develop acute kidney injury, how effective is convalescent plasma, determining birth outcomes across COVID severity, intervention and vaginal versus C-section deliveries, postpartum morbidity and complications. Is there a racial disparity to access and testing? What is the transmission intensity among populations by race and ethnicity, rural, urban income, et cetera? And so there are many, many different kinds of questions. There are clinical topics are organized in domain teams. And if you go to our website and you're interested in any of these topics or you see one that's missing, we are welcoming of domain teams which help bring expertise together to answer questions together. And so with that, these slides again are available at this bitly. I think I'm out of time. So we'll just rush through the acknowledgments. But please go to either the, our project website, ccovid.cd2h.org or for regulatory information go to the NCAT site, ncat.nih.gov slash n3c and we welcome everyone. And then just to thank everybody, this was a Jamie a paper that has been in preprint for a very long time because the journal is having a very hard time understanding how do we publish a paper with more than 200 authors? So it really, in just a few weeks, we wrote this paper in 10 days with 200 authors and so there's an enormous number of people helping to make this go and we welcome as many as are willing to help. I also wanted to thank for the monarch part of the project, all of my colleagues who have worked so hard over many years understanding all the challenges in making data TLC and also to our grant funding at the NIH. So thanks very much. And I'll just go back to this link so you can grab the links. And I think, do we have time for one or two questions? We do. Yeah. Thank you so much. That was fantastic. And we did actually have one question come in and so I'm gonna go ahead and relay that. So the question, Melissa, is the national COVID cohort collaborative only open to those in the US or is it something that would be open to more of a worldwide audience? It is open to a worldwide audience and it's also open to commercial entities in pharma. We worked really hard to make it maximally open. There is only US data in the system, however. So right at this time, we cannot accept data from non-US institutions but we have made it accessible to those outside the US. There are some restrictions on access from outside the US to the limited data set and the IRB but you can go to the NCATS website and get specific questions about that. But we have done our very best to make it as maximally accessible to all people.