 So, I'm going to talk today about some work that we've been doing in the context of a consortium called the Monarch Initiative. It's a fairly new consortium that's really at Peter Robinson and myself and Chris Mongol, as well as some other investigators around the world. And the idea is to have clinical computational science and basic research science kind of leading large-scale data integration efforts for genotype to phenotype. So, I just did a back of the envelope count the other day, and in OMIM right now, we have more than 3,000 Mendelian diseases that don't have a genetic basis. And in ClinVar, we have 66,000 variants with no known pathogenicity assigned to them. And this is, of course, just the things that have been recorded in these repositories. We know there are so very many more. So we know a lot about the genome, but we still don't know very much about what it does. If you look actually at, do I have a pointer? This thing? Let's see. Does that work? Yeah. Okay. So if you look here, in this figure, we took all the coding genes in the human genome and looked to see how many of them do we have causal mutations for a known disease or phenotype. And that number, and this is from OMIM plus ClinVar, that number is about 20 percent. We take the orthologs of every coding gene in the human genome, and we compare those against the five topmost widely used model organisms, so rat, fly, worm, mouse, and zebrafish, to just look at the orthologs of those and look to see if we have any phenotype data that we know is caused by mutations in those genes. We go up to 80 percent coverage. So that's an enormous amount of data that's being not very well leveraged, that's hiding there in the model organism databases, and mostly in the literature. The other thing is, is that if we look across the different organisms, and this is a very complicated figure, but essentially what it shows is over here on the left we have different data sources, and then we have the different, I can't see very well here, we have all of our different species that have been integrated into the system that we've been building, and we look across the phenotype categories, and this is just a few shown here, so in entanglement, skeletal defects, muscle, we see that we actually learn different phenotypes from different organisms. So in aggregate we're actually getting much better, and in fact many of the types of phenotypes that we see are only seen in one organism. So we want to have that really broad coverage. Some organisms are better at looking at certain kinds of phenotypes. We study zebrafish quite extensively for neural crest development, but there's other non-model organisms in there too, so like possums are really excellent species to study craniofacial development, or naked mole rats for cancer. So there's a lot of interest now in including phenotypes in a more structured way from non-model organisms from which we don't actually have model organism databases. So the problem though is that different communities use different vocabularies to describe their phenotypes. We have palmo planter hyperkeratosis is a clinical term that might be described by a patient as having thick hand skin, but the same phenotype seen in a mouse is described as having ulcerated paws. So it's not a string matching problem to try to associate these phenotypes with one another across species. So the challenge then is that every source uses their own vocabulary, and even on the human side we have very many different vocabularies for describing phenotypes. But similarly the same is true across all of the model and non-model sources as well. So how then can we help machines understand what these phenotype terms mean? The computer does not know what palmo planter hyperkeratosis means, it knows that it's a string of letters. So how do we turn this string of letters that's used by clinician into something that can be computable across all organisms? So the answer is that we have a universal converter box and I would conjecture that that is a suite of ontologies. I'm not sure if you all have talked about ontologies yet, but that's what I am as an ontologist, so there's going to be some more of that coming. So the ontologies can serve as a bridge to help us relate these terminologies across sources and across species. So here is an example, taking that same example. So palmo planter hyperkeratosis can be described logically. So this is now just essentially using logical axioms in what I'll show you in a minute as a graph to represent sort of decompose this term into something that we can compute on. And in this case we would describe that term as having an increased keratinization where the increased term comes from a standardized vocabulary of qualities and the keratinization term comes from the genontology in combination with a representation of the anatomy from the uberon anatomy ontology, the stratum corneum layer of the skin, and that it's located in an autopod. So you can see where I'm going with this, because if I represent that clinical term, palmo planter hyperkeratosis, using these terms, I would also use those same terms to decompose the term that was used in the mouse, which was ulcerated pause. So because the logical decomposition of the human term actually matches exactly the decomposition of the mouse term, I now know that those two terms are equivalent across species, and that's essentially how we logically walk across the species. So the human phenotype ontology, which was originally developed by Peter, but we have quite a large number of people now that have been contributing it to it in many different ways, is a graph structure that represents a suite of clinical phenotypes, and there was a study done a couple years ago that showed, because people always ask us, well, why do we need yet another clinical vocabulary? We don't have enough yet. The answer is that yes, we do, because one of the things that we don't have in most of the clinical vocabularies is actually a representation of the patient's phenotypes as if we would treat them as a biological subject in the same way that we would treat a model organism. So we have to think more atomically about the actual phenotypes that we're seeing, not about billing or quality of care, which is what most of the other clinical vocabularies are more designed for, not to say they can't be used for many different other things as well, but they aren't designed to be interoperable in the same kind of way that I just described. And in this case, we actually have this really awkward location here. We have this term here in the graph hypostmia, which is defined logically in the way that I showed you earlier in terms of a genontology term, sensory perception of smell. And here we have deeply set eyes, which are a subtype of abnormality of globe position, which is a subtype of abnormal eye morphology. And similarly over here we have motor neuron atrophy, and you can kind of get a sense that there's different anatomical systems represented. There's some logical hierarchy there. But what's not shown here in this graph is that each one of these terms is logically defined in the same way that I showed in the last slide in terms of these other ontologies on which there exists an enormous amount of data. So over here on the genontology side, we actually, that term is related to 34,000 different annotations in 22 species. So we know a lot about what causes hypostmia based on the underlying logic of the ontology and its relationship to the genetic makeup. And the same thing is true for anatomy and for cell types in a variety of other chemicals and drugs. So this forms the basis of large-scale data integration. And here it's shown some of the data sources that we have in Monarch are here. The different data types, sort of in the generally categorized in the genotype to phenotype buckets, we get different types of data from different sources. And each one of these sources uses a different ontology to capture the phenotype data or the gene expression data or the genotype data or the anatomical data, expression data. In some cases there aren't any, and we use text mining tools and manual curation in combination to apply those to facilitate the use of data that may not be computationally attractionable. And then we build what are, we call these bridging ontologies. So we'll start over here since it's the easiest to conceive of. So the uber on anatomy ontology is essentially an uber anatomy ontology. It subsumes all the other anatomy ontologies and represents metazoan anatomy, so across all metazoans. And so because of that logical infrastructure, and we have about 14,000 terms now in there, we can actually walk across things as diverse as fly to mouse to human anatomically. And we've done the same thing for phenotype data, so we have the human phenotype ontology over here, but there's a variety of other things, the mouse phenotype ontology, the vertebrate trait ontology, which is used by the rat database, worms, flies. And then similarly for diseases, there are very many different disease vocabularies and they all suck for their own special reasons and they're all awesome for their own special reasons. And so we really struggle with the disease vocabularies because really a nozology of disease is very task specific anyway. So in our case, we've aggregated some of the rare disease sources and a few other minor resources along with MedGen and Mesh to create our own hierarchy to support the algorithmic types of analyses that we want to do. Finally, I don't have really time to talk about this, but we've also done the same thing for genotype. One of the biggest struggles that we have in integrating these data across all these sources is that every source associates phenotypes with different aspects of the genotype. So one source will say this phenotype is related to an allele, another will give you a SNP, and another will give you a full genotype. So how do we actually aggregate these data when the meaning of what's actually causal is different in those different sources? So this is a little uber genotype ontology that allows us to propagate phenotypes properly. Okay, so trying to harmonize all these things. So how does this work when we actually get it put together? So over here, we have, in the middle, we have a patient with a set of phenotypes. Hopefully you can see them because I can't. So maybe we might have, like, microcephaly there at the top. In that patient, in a set of phenotypes, which we would call a phenotypic profile or a phenoprint in some cases. And so what the goal is then is to see what known diseases or what known models have the most closely matching phenotypes. So really, you know, and this, just on this side of things, ignoring, you know, what's actually happening. A portion of these models have the most closely matching phenotypes. So really, you know, and just on this side of things, ignoring just looking for things based on orthology or genomic region. This is really just a phenotype similarity matching problem. And here in this case, we can see that the microcephaly over here on the top matches hyperplasia of the frontal lobes. But similarly, you know, we might not have a match for that over here on the mouse. But there are other terms that match a lot of what we can There are other terms that match from the phenotypic profile to the mouse. And so by doing this, we can actually prioritize variants that might be known for disease D for patient A, or variants that are known for mouse M for patient A, or it's ortholog in this case. So that's the idea behind phenotypic matchmaking across species to inform diagnostics. The other thing that you win with this is that, as it turns out, there's a person, a clinician over here in this case, that is actually phenotyping the patient, or in some cases has published a paper or otherwise provided data relating to disease D. And on this side, there's a person who phenotyped mouse M and is responsible for a deep understanding of the phenotypic variation found in the types of genes or the types of disease models that this person might be studying. And so by doing this phenotypic matchmaking, we can also matchmate clinicians to basic research scientists because in the end, it's not necessarily the person who might study the same gene family that you want to phenotype your potential model organism that might be modeling your rare disease patient, but actually the person who's experienced in the right kinds of phenotype assays because those things are very specialized. So here's just a quick example of how we've applied this, and there's a link to the paper there. We have a tool that Peter alluded to earlier called Exomizer. It also now works on whole genome sequence, although we haven't tested on any cohorts yet. And in this paper, we worked with the Undiagnosed Disease Program to apply these phenotype-matching similarity algorithms to help prioritize a variant for this family in the STIM-1 gene. And in this case, there was no known information in any of the public clinical databases, but we had a really nice match from the MGI source, and in combination with more traditional exome analyses and pathogenicity measures and Mendelian inheritance patterns and frequency filters that combined with this phenotypic similarity in the mouse, we were able to prioritize this new disease that was caused by a STIM-1 mutation. So getting back to the central biology dogma, genes plus environment equals phenotypes. Well we all know it's not really that simple. It actually looks something more like that, right? So one of the issues that we have in trying to integrate these data is that the standards and the ways that we represent genes' environment of phenotypes are not actually all that great. And in fact, we're doing, you know, better with genes, but we are not doing very well with environment or phenotypes, and we're certainly only scratching the surface and trying to describe all of the crazy lines that you see there. So the standards for encoding and exchanging data computationally must be up to the challenges of representing all of this. So here we are now where we have a variety of formats for exchanging sequence data, but we really don't have a format for exchanging environment or phenotype data. We have vocabularies for describing them, and those are standards, but we don't have a standard format for exchange that we can use in any context computationally. And so there's a kind of broader than the Monarch Initiative and as part of the Global Alliance and a number of other organizations have been working on a new phenotype exchange format. Is that a timer for me? Five minutes. Okay, perfect. So, because here we are at the right time and place for this example, I'm going to explain to you what goes into what we like to fondly refer to as a phenopacket. And the reason we call it a phenopacket is sort of a fond name for the phenotype exchange format is because it's a packet of phenotype data that I can hand to you or to you or to you, no matter what kind of biologist or clinician you are or what kind of context you might have. So here you can see that we have Donald Trump, he's male, he really likes the canonical JSON format. But his phenotype profile is essentially a very simple method of displaying that, which says that he's used, here we've used the HPO and he has small hands and this was described during development, so he's been assigned the onset of congenital onset. This is a traceable author statement, it was actually made by me. And so in this way it's really simple, but it's just this simplicity of it that makes it useful in all the different contexts that we might see out there. And there are very many different contexts out there, so we want to exchange clinical phenotype data. In that sense it's not intending to be, you know, it's a proxy for clinical data, for clinical EHR data or other clinical phenotyping data that you can't share broadly because of privacy constraints, but these are proxies that can be shared broadly and can be at the end of the day computable in all these different contexts that we have these algorithms. But it works for model organisms, it works for disease vectors, it works for crops, biodiversity, domestic animals and we can use it for personalized medicine, drug discovery, genetic engineering, the possibilities are endless, similar to, you know, fast A file for sequence data is used in so very many different contexts, it's the same idea. So just a couple of quick more slides, we have a model that we've been working on as I described earlier for evidence modeling, this is a collaboration with the Bracket Exchange and, how much time do I have now, I'm good, three minutes, okay great, the Bracket Exchange and also ClinGen is participating in this evidence modeling and essentially what we're doing is trying to tease apart the claim, so the genotype to phenotype association, from the evidence of that claim and where it came from. And so using Bracka as a first test case, but essentially doing this for all of ClinVar and in collaboration with ClinGen, like I said, trying to build a model where we can tease apart the functional evidence where we see such diversity in the pathogenicity calls and then hopefully be able to build something that's a bit more computationally tractable so that we can combine evidence when it comes from very many different places, which is our big challenge. So the last thing is just phenotyping isn't free, so how do you know how much phenotyping you need to do and we struggle with this a lot in the undiagnosed disease program where, you know, it could take a whole day to do the phenotyping to create, you know, a phenotype profile that's computationally useful. So one of the things I would like to suggest is that, you know, in fact, what we need to be doing whenever we're annotating any data, it's not specific to phenotype data, is to be using all the data that we already have out there to help inform better the curation that we're doing in any given context. So for phenotyping, it's like if you go to Amazon and, you know, users who shop for this also like that, well, there's an army of ontologists behind the scenes that are making those relationships to help you figure that out. So if you, and this is from an example from Star Trek that I don't have time to show you, unfortunately, but the idea is that if you're, you know, if you're looking, you know, for very rare phenotype, then it won't take very many descriptions, but if your sets of phenotypes are much more common or only limited to one anatomical system, you're going to need to go a lot deeper. And so we have a tool and some metrics that take advantage of the data that we've aggregated to do that. And this is available via services for other tools as well. So in summary, deep phenotyping within and across species can aid diagnosis, discovery, and translational matchmaking of clinicians and basic scientists. We desperately need this exchange standard to facilitate distributed phenotype data sharing for patients and across species. And this computable evidence model, we hope will bring better computational tractability to the diversity of functional evidence that we see in all the pathogenicity calls that are out there. And with an extremely large very many thanks to all of our collaborators and data sources for which we would not exist. And a special thanks also to Chris Mungle, who's not here today for co-leading this initiative with Peter and I. So thanks very much. Great, Mark. So two quick questions. One is the model for persistence and sustainability of this extremely valuable resource. The second is that, you know, I think you've done an excellent job of showing how you can take all the different source data, map it across to ontologies into something that's tractable. For us to make it to the bedside, of course, now there's another translation step, which is how do we get these resources to interact with laboratory information systems and electronic health records. And so the question around that is, I know we've already got some connections between the ClinGen resource and Monarch. So is the strategy going forward to leverage ClinGen for that step of the translation? Or do you anticipate that working directly with standards developers that are acceptable by EHR, either through an API using FHIR or perhaps just translating resources into HL7 and open info button are approaches that more direct approaches would be better? So that's a really good question and we've been really just starting to take the first steps on how to approach that. The Eximizer tool is used sort of more on the clinical research side for rare disease around the world. So we have that working in those clinical settings, but that's, of course, a very small population. Our data right now is fundamentally best for that context. We have a new effort going on now to work more on cancer modeling and cancer phenotype data, but the system isn't really designed for that right now. So it's really focused mostly on Mendelian rare diseases at the moment, but our idea is to now take these technologies and try them out on other more complex and common diseases. So that's one thing. The second thing is that we're right now, we have a tool called Patient Archive which is being used in a number of other countries, not so much here in the U.S. yet, but it essentially tries to aggregate some of these functions into one platform and it performs text mining on the clinical notes and then spits you out a phenotype profile that the clinician can then vet and then see some of the analyses all in the same kind of platform. And it's very much a sort of brand new tool that we'd love to have feedback and testers on that. And we have a couple of universities here in the U.S. that are going to be testing some integration in the next year of some of that. We've also been talking, since we work a lot in the Global Alliance, and there's a new effort to coordinate some of the genomic APIs with the FHIR standard there, trying to figure out where essentially a triangle between the current genomic APIs FHIR and phenopackets, because the phenopackets phenotype part is the part that's been missing so far from both of those really. So there's a new GA4GH working group to bring those three efforts together. And then especially once that's been done, we can think about the smart apps, which is I think a great mechanism for implementing something like this. And we've met a couple of times with the Epic folks as well, and they seem keen to do it, but it's not exactly clear where the road is. So I would love advice on that. So Mark wants a follow-up, and then we'll come over here. Right. So beware of grabbiness from Epic. You may find everything disappearing into the maw. I see. Just, you know, buyer beware there. Have you considered working, testing your phenotyping text miner with eMERGE, since we are heavily invested in phenotyping, have certainly done some text mining. But it seems like that would be a real natural opportunity to test across a number of institutions that are really very familiar with extracting phenotypes from electronic health record data. Yeah, a number of people have asked me that, and I hear there's a new rare disease effort in eMERGE, but I have yet to actually have those conversations. So we would be delighted to help with that effort. I think that the rare disease use case is very different than the eMERGE algorithms that they have now. In looking at those algorithms myself before, they were pretty non-applicable to the kind of work that we were doing, but now that we're overlapping in this rare disease space, I think there's a lot of opportunities to work together and take advantage of their nice social network and process for doing that. So it's a great time to do that. This is really great work. I'm thinking about one of the potential issues is the subjectiveness of phenotyping information. So when you say increased keratosis or keratization, etc., what does increased mean? And can we move toward having some more quantitative things that's difficult to do in the clinic? In part because the technology that we use is not very high. I go to my clinic with a seamstress tape and a six-inch plastic ruler, which is not new tech, and we have not really exploited opportunities to use new tech to both improve the quality of the information in terms of its quantitation as well as to speed that process of phenotyping individuals. So I encourage that as well. Thanks. So there's a lot of really important points in there. One comes back to the contextual data interrogation part. So one of the things I think is especially true in the model organism community but is also true clinically is that everybody's experts in particular areas of phenotyping. And so if you're looking at a patient or a zebrafish, you might be an expert in craniofacial development. So you might not notice some cardiac defects, especially if they're subtle. But if you record then those specialized craniofacial defects and the system says, hey, you know, we actually know from this mouse, a completely third organism, that those phenotypes commonly co-occur with this heart phenotype. Maybe you should look for this. That, you know, it's another way of sort of informing, you know, what we should be looking for so that we can be more effective phenotypers across all of those species. So that's one thing. The second thing is I think, you know, the quantitative piece. So the use of semantics is really a proxy for quantitative data. And we've actually taken the mouse phenome database and worked closely with them to convert all their quantitative data to qualitative semantic annotations. And the way that we do it is we just take a very, it's a very brute force thing. We just say if anything that's plus or minus three standard deviations away from the mean of whatever population was evaluated, we count as abnormal or increase or decreased. And what we'd like to do is to be able to have the user specify where they want to draw those lines instead of us doing that. So right now it's just set up so that we kind of take a very conservative approach in assigning abnormality. But it's better if the user does it. And then part of that also relates to the evidence and provenance piece because of course it's the reference populations and the reference definitions for the assay, especially for clinical labs where you have, you know, potentially different population values that need to be measured against, right? So you want to know which guidelines were used, what the date was, you know, and the population at which, you know, against which it was evaluated to then also. So I can imagine a system where you might have, you know, user specified slider bars or something like that where you could tailor your system to take that into account. But it's important to recognize that this type of technology is a proxy for that data. It is not trying to operate natively on that quantitative data, but rather just lift out of those large numbers of quantitative data sources some of the bits of value that might be related across species to help send the user to the right place to go then look at the data. So did I get all your questions? Three of them in there. So we're running a bit behind. So Jose, Peter and Callum all had questions, but I'm going to ask you guys to hold those for the discussion period. And so we'll move next to Nancy.