 All right, so I'm going to introduce our next speaker, Manny Sandel, she's very close to home. She is, in fact, an associate professor at OHSU and also the Director of Translational Data Science at Oregon State University. She helped launch the Monarch Initiative, which I already saw in a few slides here, represented, and she also led the development of the human phenotype ontology and it's gonna be about genome phenotype integration. Hi, everybody. So that was a fantastic introduction to some of the things I'm gonna talk about from Bonnie, so thanks so much for setting me up so perfectly. So I'm gonna talk about phenotype data integration across species for disease diagnosis. And the disease diagnosis use case is, I think, pivotively important to really understand the power of leveraging the biodiversity data that we have in hand, and this is just one way in which we are starting to use it and my hope is that this talk will inspire many, many, many more ways that we might do it. And a subtitle for my talk might be, a gene does not walk into a doctor's office. So and the point of that is that, if we really wanna understand how to diagnose a patient, we fundamentally need to know more than just their genomics. So first I wanted to leave you with a vision and then we can come back to this in the discussion. So we talked a lot about reference genomes and the need to fill many gaps and the need for better annotation, better assemblies. We need a reference phenome. So my vision here is to leave you with, in five years, and I'll credit Paula maybe for this one, in five years we will have a reference phenome for the whole of the tree of life. And life got cut off there. And the lovely interoperability between Google Slides and PowerPoint here. Okay, we'll see how that goes. So essentially when we think about how we go about diagnosing a rare disease patient that walks into that doctor's office, the prevailing clinical diagnostic pipelines take the genomic data. Maybe you do a gene panel or you sequence a few particular genes. Maybe you do a whole exome. Maybe you do a whole genome. Maybe if you're a lucky your insurance company will actually pay to have your family sequenced. And after that we compare it against all the reference information that we have, the frequency information, information from our knowledge bases that describe genes that are dysfunctional and known human diseases. But at the end of the day, we're really not leveraging most of the data that we could be using to help add to that to actually prioritize the variants. And so from those genomic candidates we often end up with thousands and thousands and thousands of candidates, which is way too many for any busy clinician to try to sift through. So we absolutely need to have better mechanisms to use more different kinds of information about the patient and about our public knowledge coming from many different sources. And not just human sources, but population sources, model organisms, sources, mechanistic sources. So I wanted to introduce a little bit more about ontologies and we heard a little bit about them from Bonnie and the human phenotype ontology is also a member of the oboe foundry ontologies. And an ontology is a controlled terminology, but just a little bit more information about how this works. So essentially the ontologies are graph where the relationships between the terms, which are represented by the circles here, are logically defined. And so it's a computable graph and that computable graph is represented by its relationships to many other terminologies. So here in the case of hypostmia, it's actually defined in terms of the genontology term, a dysfunction in the sensory perception of smell, which just happens to be annotated to 34,000 annotations from 22 different species. What we do is we then take this terminology and we use it to describe phenotypes of rare disease patients when they come into the clinic. This is a largely manual process because EHRs don't actually go through this process and that's another area of effort. And so what you end up with is a set of nodes. So that little patient up there on the left is represented by a set of colored nodes that represent those nodes in the graph and that's their phenotypic profile. And you can see the power of being able to leverage these other data. And I will be talking about that. So this is an example of using that phenotype profile to compare against known diseases. So here we have a gold standard for Whiteman-Steiner syndrome. And these are two patients that came into a clinic within two weeks of each other, but it's a rare disease and the clinician did not recognize their phenotypic features. And in this case, you can see that the three-year-old girl had slightly different features than the 14-year-old boy, both of which were slightly different from the gold standard canonical phenotype profile for that rare disease. And in fact, some of the phenotypes are quote opposite. So the definition for Whiteman-Steiner syndrome has a short toe whereas the 14-year-old boy had long toes. And it turns out that these two different patients had different variations in the same gene, but were both diagnosed with this same disease. And so this shows the power of a graph-matching algorithm that essentially allows us to fuzzy match the phenotypic representation of these patients against known human diseases that are defined in the same way. But this is just human data. And the reality is that using this method, we still have very poor outcomes for being able to diagnose rare disease patients when they come into the clinic. And so how can we actually improve the situation? How can we use that below the genomics line data? And this is just scratching the surface with phenotypes from human data. So I wanna talk a little bit about some of the earlier work that we've done on using model organism data, but I'm hoping that with the incredible diversity of work that's going on in this room, we can start to actually understand how do we actually bring all of the treat of life into these types of processes to understand human disease mechanisms. And this project, this problem is not specific to human diagnosis. If we wanna understand how to diagnose livestock or crop species, it's exactly the same problem. So all of us in this room still have the potential to bring to bear these same kinds of data to those problems. So how do we bring in that model organism data? So over here on the left, we have our 19,000 plus human coding genes and roughly almost 4,000 of them have known human disease genes associated with them in human. But if we take the orthologs of all of those species, using Panther, we can see that we have 15,000, more than 15,000 of those orthologs actually have phenotypic features associated with variations in those human orthologs in the five most widely used model organisms, flyworm, yeast, mouse, and zebrafish. If you combine those with what we know from human, we go up to 82% coverage. So if we think about that human use case and doing that phenotype matching with only 18% of what we know to be related to those causal mutations in human, we now have access to 81%, that's an enormous amount of phenotypic information that we can now use to inform a human diagnosis. So how do we actually go about doing that? Well, it turns out that revealing the sort of deep homology is really, really hard. And this is a slide. So who knows what any of these organisms might be? I was trying to get our number of organisms up in our session. So these are, so this is actually expression of distalus and it's labeling what I like to refer to as the very formally defined sticky outie bits of organisms that are essentially deep homology and deep mechanistic homology for how we push protuberances out of the body wall. And if you look across the phylogeny here, you can see that we have a similar function in tetrapod limbs, in a cydian ampule, in a kinoderm tube feats, and in analyds in their peripodia. So how do we reveal these mechanisms and then thereby understand how those mechanisms are dysfunctional and thereby causing causal disease? So the answer is we need ontologies. We need computable ways of representing form, function and dysfunction in order to interpret the genome. So we can have all the most wonderful genome assemblies in the world if we don't have a good way of representing the anatomical, the expression, and the phenotypic variation that occurs from the variation in the genome. We aren't really gonna be able to readily interpret the function of those differences throughout evolution, throughout our crop breeding and in disease. And so this is an example just showing how we have, I'm not sure if my arrow, I don't think we have a pointer, do we? Does that work? Nope. Okay. So I will pretend I can see that far. So if you look over on the left in the blue bar, it says form, and there we have the fish swim bladder, which is homologous in its ancestry to the human lung. But in effect, in fish is actually the respiratory function, functions in the gills. And so we have a divergence of function from the ancestral anatomical homologue. And many people of course are familiar with limb homology. And the uberon anatomy ontology really represents that anatomical diversity. And it's integrated with the cell ontology, which represents the cells that actually live in those anatomical structures and perform many of those functions. And both of those ontologies are used to, as I mentioned on the human phenotype ontology slide, to help define the dysfunction in the phenotype ontologies. And here I'm representing what we call the u-pheno-ontology, which is a cross-species effort to align phenotypic ontologies across all of metazoans. And then also the gene ontology and the logical definitions for the gene ontology are built much in the same way as they are for those phenotype ontologies. And so collectively we have this semantic representation, this integration that can allow us to represent form, function, and dysfunction across a diversity of species. So the problem here is that most of those different taxonomic groups or domains go about developing these things in a bit of a vacuum. And so one of the recommendations that we have is that it takes a lot more effort to clean up those messes and coordinate and integrate after the fact than it does to prospectively coordinate them. And so it's not just, and it's not just about coordinating a terminology, it's also about coordinating all the relationships between them, all the different ways in which a gene might be related to an anatomical structure. It's not just at one single relationship, right? It could be expressed there. It could have a modifier or developmental function, et cetera. And so we need to have a way of actually structuring our work that works together. And so this is just to kind of show you how that actually works in effect. So once we have that integrated semantic structure, we can now compare phenotype profiles of diseases and of patients against model and non-model organisms. So here, I will probably screw it up now that I have it. Let's see, does that work? Yes, okay, now, so here over here on the left, we have the phenotype profile of a mouse. So these are phenotypes that would have been in that graph for a mouse here with this mutation, also called blumini. And here in this case, we have a one phenotype is a duplex kidney. And if we wanna try to match that against the human phenotype profile, that's the gold standard for this human disease, it matches against this phenotype called renal hypoplasia. And it's this collection of phenotypes and their best fuzzy match to the whole profile that's actually important for that diagnostics. But you can see how the uberfenoontology actually can relate things that have, that don't have the same labels, but are clearly anatomically related. So duplex kidney is related to renal hypoplasia via this term abnormal kidney morphology, which a human being can easily recognize, but we need to tell the computer how to do that. So this is what happens when, so now we get back to some actual genomics here. So what do we do for the patients? So we can take the patient's data. So we have a VCF file, we ideally have the trio from their family. We batter them with a variety of different algorithms for determining pathogenicity, frequency, Mendelian inheritance, and a variety of other things to hopefully come up with a smaller set of candidate variants for what might be responsible for that patient's disease. And that's kind of general standard of care. And so what we've been able to do, and this is part of the Monarch Initiative and the tool that we've developed called Eximizer, is we take those phenotype profiles described using the human phenotype ontology and we compare them using that fuzzy phenotype matching profile matching algorithm, which is called ALSIM. And these profile matching also crosswalk using Panther and StringDB protein interaction networks and orthology, as well as profile matching currently just against mouse and zebrafish. And using this technique, we can then far reduce this number of candidates to a much fewer number. And we are now actually, and using this process, we're able to improve the diagnostic efficacy in the undiagnosed disease program from between 10 and 20% depending on how you add it up. So just adding those two organisms and this fuzzy matching process in, we're really greatly increasing the diagnostic capability. So this is just tidbits of what's out there. So how do we actually bring more of this kind of data together? Okay, so and this is just an example of here, we were able to find a candidate for a patient in the Genomics England program based on match to a mouse where we didn't actually have a match to a known human disease. And so that mouse now becomes an automatic candidate for being a disease model. And I wanted to tell you a real story about a real patient because it really hit home the power and I always cry when I do this, so apologies. So this is Jessica. Jessica was four when she went through this process. She has a rare condition which causes epilepsy and affects her movement and has a developmental delay. They did a bunch of standard genetics tests and basically we found that there was a de novo deletion in the SLC2A1 gene based upon this use of the Eximizer tool and use of all these different data integrated together. This and our tools using those processes in the phenotype matching, we're able to diagnose this patient and the wonderful story about this patient is that this particular disease, even though it's a new disease, it can be successfully treated with diet. So this child is gonna grow up and have a normal life because of model organism data. Okay, so one of the problems that we have in integrating all of this data is that different communities are annotating to different relationships. So this is just the human space. So this problem is just orders of magnitude bigger if we think about the problems with biodiversity, crop science, et cetera. And so in this case, we have different groups that are annotating to a disease. So we annotate and for us in Monarch, we annotate diseases to phenotype so we can do our profile matching. But from OMIM, we get disease to gene, from ClinVar, we get disease to variant, from comparative toxicogenomics, we get disease to environment. And the problem is is that none of these really use the same model and really the disease model that the canonical real disease model all of these things put together. And so one of the other big tasks that we have is how do we actually help each of these groups continue about their excellent curation work but do so in a way that's born interoperable and in a way that where that curation is really part of the bigger puzzle kind of as it's being generated. And so we've been using a variety of approaches to integrate data from very many sources, some of which are listed over there on the right, in two projects, not only the Monarch Initiative but also in the NCATS funded data translator program where we're creating these large knowledge graphs where we're able to connect the dots, as Bonnie so nicely put it, between the joins between these different data sets. And using these different knowledge graphs, we can find candidate diseases for diagnosis as I showed you earlier, but we can also find underlying mechanisms. We can try to do some drug repurposing. We're identifying environmental exposures that have, you know, affect different diseases. And so this is a really interesting way of sort of thinking collectively about all the data resources that we're generating and how can we help them be more born interoperable so that we can do this kind of more, some is more than the parts kind of synthetic science. I also wanted to just remind everybody how important all of the organisms that we don't study often are and we got to hear in Carlos's talk about fauna bios wonderful work on the 13 line ground squirrel. But just to mention a few other excellent ones and the point being that we really want to reveal all the other ones that are out there so that we can better utilize those, you know, genetic and phenotypic underpinnings that come about from their evolution. So the dogs retina has an area centralis, which is analogous to the human macula. Aged cats are natural models of Alzheimer's disease, naked mole rats, we heard already, don't get cancer. My favorite one, armadillos are a natural host of the mycobacterium that causes leprosy and the only other organism that that's the case for. Treeshoes, glioblastomas are their most morphologically and genetically similar to humans. Ponsnales are models of inflammation mediated memory dysfunction and silkworms are a model for uric acid metabolism. The list goes on and on and on and on but most of which we haven't really revealed. So how do we really kind of find all of those different mechanistic models of disease that help reveal the underlying gene functions and the variation in those genes that causes the dysfunction that might relate to human disease but also to animal diseases, plant diseases or traits that we simply want to breed for to help solve problems like human hunger. And the answer lies in understanding our ancestral genome for any given problem or statement for us in monarch, it's human disease but for all of you it might be something different. That the answer is still the same. We need to understand where the ancestral genes functions were created and understand what those functions were and how those functions evolved to understand how to best model and investigate those dysfunctions that we either want to help solve or otherwise promote. And so somewhere on this tree lies the answer for any given problem and right now we are having an awfully difficult time of revealing those. So just some takeaways which unfortunately are not my updated ones which I have in my Google slides which were linked in the beginning so if anybody has the links you're welcome to go back. So the challenge is just to summarize really one of the problems we have is that curation of information relating to gene expression and to phenotypes is largely limited to variants is largely limited to humans. We don't really get a whole lot of information that relates variants to phenotypes from most other species. Phenotype information is largely found in text and tables and is not very computable. Most genotype to phenotype information does not actually include any genomic information so we might just say well this gene is related to this phenotype but we don't know anything about the variants or the background mutations that might exist in that particular organism which of course modify the phenotype outcomes. Most many relevant species as I mentioned have no curated databases and never will so we need to think about how we're gonna manage that as we move forward. Most data resources lack standardized data models, ontologies, APIs, terminologies and so bringing them together is a lot of manual and computational work. More work than it would take if we did it prospectively. So the recommendations are really to build a foundational integrative infrastructure that's essential to comparative genomics. I love the statement earlier about how do we get reviewers to get more excited about infrastructure and I tweeted that I think that we should have a section in NIH grants that's about the non-innovation section where we talk about how we're gonna reuse other people's work and contribute to and use standards and these sorts of things. Continue to curate genomes and phenomes, improving the integration readiness of the data. So one of the things that I really took away from all of you hardcore genomicists is that you're not thinking about what we're doing downstream and trying to interpret it and use it for various kinds of applications. How can we bring what we're doing into the same workflow for what you're doing on the genomic side so that they're done together? Use, improve, and collaborate on the cross species terminologies. These only work if everybody contributes and everybody uses them instead of going off and building their silos and as I mentioned earlier, it's a lot more work to reconcile them after the fact. Again, improve reporting of genomic context. So understanding the rest of the genome, not just the variant that you're or gene that you're reporting on. Improving organismal interoperability with case and population level clinical data. So we do a lot of work of trying to understand how to represent electronic health record data, patient case level data with our J4GH standard coming out soon called phenopackets, but we don't really have anything like that for other kinds of organisms and so that's a really big gap that needs to be filled. And then finally, studying a broad taxonomic range that is highly sampled of reference genomes and reference phenomes. So I especially wanna thank two people in the audience, but there's many, many contributors to all this work, but Chris Mungle who's co-led the Monarch Initiative and is here with us today with me for many years and also Paul and maybe who was with me when we built the first anatomy ontology way, way back when. So it's really exciting to be able to talk about this work so many years later with you both here. So thanks very much.