 Today I'm going to talk about a tool that we developed called Marvel and I'm going to explain how we typically, how we developed Marvel and a use case with one of the, another study that I've involved in. So by way of introduction I have to talk about the Undiagnosed Disease Network because that is the reason why we developed this tool. So how it works is patients who usually have gone through many different doctors and cannot be diagnosed will apply to the coordinating center of the UDN and then they will be assigned to one of 12 clinical sites and there they will be invited to the site, the patient and the family. The patient will be seeing every type of doctor they can and also sequenced along with their parents. So at that point there are a small proportion of patients that get diagnosed but the majority do not and at that point they have a couple of options. One is to narrow down the list of variants where they have maybe at most five to ten variants that they think might be causing their disease and they can also send patient samples to the metabolomics core, you can send either blood samples or fibroblasts to get more information and also they can apply for the model organism screening center which I am involved in and there we do our own screening because we have to decide on a number of things. First do we want to accept the case, second do we have the resources to contribute to the case and finally which model organism is appropriate. So there's three model organisms in the center, C. elegans, drosophila and zebrafish and from there we are able to say whether or not a specific variant is pathogenic or not and we may do some drug screening type of studies and feed that information back to the clinical sites and I almost, I guess about five cases there actually has been changes in clinical care so it has been quite successful and this is in phase two right now and going back to the informatics phase and the model organism screening center, I want to go through a case study to show you our challenges. So this is a study in collaboration with Phil Campo at Montreal so their free families found to have biolulic variants and ORXR1, this gene called oxidation resistant 1. So family 1, they have their, the individual has compound heterozygous for one splice site mutation and one stopgain and family 2, a consanguinous family, this individual is homozygous for a frame shift and finally in family 3, there are three individuals who are homozygous for a splice site mutation and the patients have a different type, a group of phenotypes. So first they have demountal delay, they also have intellectual disability, speech delay, epilepsy, cerebellum, hypoplasia or dysplasia and hypotonia. So the question then I face is whether or not I should study this case, right? So has this gene been known to cause a disease already? What is known there? Are these variants found in the control population? What's known in model organisms, et cetera? So I have to go through all of these databases, right? So I have to go to all of these human genetic databases, I have to go to each of the model organism databases and I also have to go through other like expression pattern interact on databases and that takes a long time and when you have a hundred or more than hundreds of cases coming in every single year, that is a huge time suck. So our solution is to aggregate all the data that we find useful in prioritizing cases to study and make it easy to use website that is publicly available. It's on Marvel.org which stands for model organism aggregated resources for rare variant exploration. We describe it as the kayak for rare disease research where you go to one place and we gather the relevant data from all over the place. So how it's structured is three big types of databases. One is human genetic resources, one is integration type of resources because we need to connect these human genetic resources with model organisms as well as homolog prediction and of course the model organism resources and this was not done with ourselves, of course. So in Deng Dong Lu's lab, so he's a bioinformatics PI and there's two core people who helped in this project. We also collaborate closely with the Alliance of Genome Resources. So their goal is to gather all of the individual model organism databases and try to harmonize that data into one place so that say the non-fly expert can also have access to understand the data from the fly base. And of course we have a lot of interest in common and goals in common so we have been closely collaborating. So Norbert Perry-Mann's lab has also developed this tool called DIOPT. It is a tool for homology prediction. They basically took 15 different ortholog prediction tools and made it into a score let's say like 10 out of 15 of these scores predict two genes as homologs and this is the gold standard that we use and what the Alliance of Genome Resources use as well. So this is the best tool that we know of to predict orthologs. So there's two main ways that people currently use Marvel. So clinicians use it by looking at, okay, say their patients have variants in genes X, Y and Z and they want to know which one could be good candidates. And from a model organism researcher's perspective, let's say someone finds an interesting phenotype in a mouse mutinous A. There are human disease relevance, what's known in other model organisms, etc. So that's how people usually, why people usually go to Marvel. I want to go through how to use it now. So this is what the landing page looks like. You can see that there are two search bars, first for the human gene symbol, the second for the variant. It's a nucleotide variant. So if you do have a protein variant, meaning the amino acid chain, you have to click on the other button. And you might notice on the very top there's a second tab that says model organism search. So if you don't want to start with a human gene, you can go there to start with a mouse gene, for example. So on the top is a variant that is from my case study there. So orcs are one, NM is a transcript number, and then C is a coding. Nucleotide number 1100C is mutated to G. And that corresponds with a protein change from serine becomes a stop codon. So here I'm entering that variant. And the first type of data that comes up is OMIM, online Mendelian inheritance of man. So it tells me whether or not it's been associated with disease or not. And also, first, as a small blurb on the gene function. This case, it tells me how orcs are one was originally first identified. So in my case, orcs are one is not previously associated with a phenotype. So that's good because we're interested in disease gene discovery. And then it will display any pathogenic alleles, if any. And the second group of data that we display is from EXACT and NOMAD. So this is, I put control population and quotation marks because although it's selected against severe pediatric populations, note that there are adult onset diseases in this database. For example, schizophrenia and cardiovascular. So that's worth noting. And NOMAD is currently the largest control population database that we have. So it has more than 100,000 exomes and more whole genome sequences now as well. So in my case, it reports that there's no matches in NOMAD, which again is good news for me because it's still in the running for pathogenic variants. It would also tell you if people are heterozygous or homozygous for that variant, which is quite important, depending on if you hypothesize the disease to be recessive or dominant. And finally, the general statistics of the gene. So previously in other sessions, people already talked about the PLI score. This is similar, but this is specifically observed versus expected ratio of loss of function alleles. And in my case, it's 0.2, which means that it is relatively loss of function intolerant. So this means that what loss of one copy of my gene may or may not cause disease. So we take that into consideration because the lower this ratio is, more likely, the disruption of this gene to cause disease. And we also have displayed a number of pathogenicity prediction tools. So there are a lot of these out there. And how we picked these seven is we interviewed medical geneticists and asked them which of these tools do you use in your clinical practice and which of these scores do you use to determine whether or not something is pathogenic when you report back, let's say, in your clinical reports. So CAD and REVL are ensemble methods. MCAP uses conservation data, PolyFan uses a lot of known mutations as well. And GERP and Philop use a lot of multiple alignment and phylogenetic information. In my case, the results look like this. You might notice that there's only three out of the seven available, and that's because a lot of them are miss and specific, and mine is a stop-gain variant. So in the second column, you can see the scores are kind of all over the place. It's because each of these scores don't have, like, a zero to a hundred. Their scales are very different. That's why we also display the rank score, which is developed by another group. So the first one, for example, 0.96, 3.9, means that this variant is more pathogenic than 96% of all predicted variants. So you can see that all three of these tools predict that my variant to be pretty pathogenic. On the next group of data that we display is from Geno2MP. It is a collection of variants found in patients and family members from the University of Washington Center for Mendelian Genomics. So the unique part of this database is that you do actually have some phenotype information, and they present it in the form of HPO profiles, meaning human phenotype ontology. The drawback of this database is that the number is quite limited. So in my case, there were not any matches. But in some cases, you actually get lucky, and there's also lots of function variants, and there's also some phenotypes that matches with your patient, and that can really move your study along and confirm your suspicions. Briefly, I want to talk about three more human genetic databases we also display data from. ClinVar is where a lot of researchers and clinicians submit their variants, and they may or may not annotate it as likely benign or likely pathogenic. DGV and decipher are copy number variation databases, and you can answer the questions such as, are there control individuals with loss and copy number of your gene of interest? And in my case, there are quite a few people walking around with loss of one copy of OrxR1. So that means that it's less likely that HEPO sufficiency could cause A disease. So the next group of information is from the model organisms. And what we did was we went to each one of these databases for mice, rats, zephen, flies, worms, and yeast, and a lot of it is now under alliance of genome resources. So we display ortholog candidates, tissue expression data, gene ontology, where we only display the ones that are direct experimental evidence because there are a lot that are predicted or orthology based. And we provide some very useful links to people as well. So here is a screenshot of what that would look like. On the top, you can choose whether or not you want to only display the best ortholog. And so on the first column, there's a homolog, then the diop score, then the expression pattern, and then the three columns of gene ontology. And we add some links that we think are really useful for people doing this. So first is Monarch. They really focus on the phenotype mapping between all of these model organisms. So if there is a match, then you can also predict what type of disease that your gene might cause. And of course, IMPC. So right now, we only have a button link, but we are developing ways to display the data from IMPC as well. I'll talk about that a little bit later. And for human expression data, if you click show all, you can actually see the entire bar graph. And for me, I'm interested in whether or not my gene is expressed in the nervous system since my patients have a lot of nervous system defects. So the gold part, the gold part is actually all nervous tissue. So that's good news for me that the gene is actually expressed in the nervous system. If you scroll down, you can see there's some zebrafish information and fly information. Here, I see that MTD or mustard is the fly homolog of my gene of interest. And it's also expressed in the head, I am brain and these nervous tissues. So at this point, you can also go back to the model organism search tab and put in that specific model organism gene. So in this case, it's MTD in my case. And if you click search, what comes up is actually all the human ortholog candidates. So here I see that both orcs are one end and COA7 are actually pretty highly predicted to be ortholog. And I can hypothesize that they're paralogs and they may or may not give me more information on what this gene does. So going back to the results page, though, the next group of information we show is protein domains and multi-species alignment. So it shows a very simple list of the protein domains. In this case, LysM and TLDC. We also show a multiple protein alignment across organisms. And for me, I can see from there that there's good conservations in the domains, but less conservation in the other regions. So this is what the alignment in the domain would look like. So the domains are highlighted in magenta. And if you scroll up a bit where there's less conservation, you can see that this is outside of a domain region. And there's actually a function on the bottom here to select the amino acid of interest and it will highlight it and teal for you. This is especially useful for us because we do a lot of mis-sense type of analysis. So in summary, we got a lot of good information from there and then now I can be more informed with my decision of whether or not to proceed with this project and what kind of experiments I should do, what model I should do this in. And so this tool has currently about more than 1,500 users every month with about 78% return users. And a lot of different databases have asked us would we include their data onto our platform. And also there are some downstream tools that would like us to incorporate into their workflow as well. So without encouragement, we are having quite a number of future development goals. So the first is what we call Marvel 2.0. So as Colin mentioned earlier today, we were recently funded to incorporate a lot of common fund data sets. So the two main things that we are focusing on is IMPC and Ferros or IDG. So model organism phenotypic information is currently not displayed on Marvel and we're working on that. And the second one is approved drug. So let's say I will pursue this gene of interest. I do want to know if there are any approved drugs that target my gene of interest. So I can readily test those drugs after I make my model. We're also going to include more human genetic databases and cross species comparison of expression pattern as well as batch query where people sometimes want to query a number of variants at a time or a whole list of genes at a time and we will allow for that. So this is a mock-up for how we would want to display the phenotypes and although we only have like a yes or no, like a green box or not, but if you click on the green box you would then have a list of the phenotypes. So we're working on that to integrate the IMPC data. For Ferros we are specifically interested in the approved drug section. For example, this gene, it was also a UDN gene and it would have been very useful if we immediately knew that there is a approved drug that is targeting this gene and we should test that whether that improves our model organism's phenotype or not and feed that information directly back to the clinician and they would actually be able to test that if the patient is convinced of the data and everyone's comfortable with it. So going back here, we recognize that this is still a very manual process even though we gathered all this data and it does require a lot of background knowledge to use it. So what we want to do is do a developer AI-based program for variant prioritization. And another thing we noticed is when we interviewed medical geniuses and asked them how do you prioritize your variants, everyone has a very different set of logic and number of tools that they use and what we did was we combined their logic and the tools that they used and put them in all one place because we said why not make sure that everyone takes advantage of every single tool out there available. So what we came up with was a flow chart that looks like this and what we call expert guided decision induction system. So I want to zoom in on one of these modules out of these six to show you a little bit about what that would mean. So for example, module five is a very gene-centric module and it really answers the question how likely is the disruption of this gene to cause a disease? So first, we would evaluate the spatial temporal expression pattern. So we want to match whether or not your gene is expressed in the tissue that is expressed. And then secondly, we want to know which other genes are co-expressed with your gene of interest and also what are the protein-protein or genetic interactions that are with your gene of interest. And from there, are there any known disease-causing genes that interact with this gene? So if there are, then this gene is immediately higher up on our suspicions that also is a disease-causing gene. So at this point, we only have a conclusion that it's higher likelihood or lower likelihood and the big question is how do we weigh them? And this is where the machine learning part comes in. So right now we're working with Undiagnosed Disease Network, Baylor Genetics to get some labeled data and to figure out how we can accurately weigh all of these different components and seeing if we could really make the prioritization of variants much more automatic and not in a way that it's a black box, but more in a way that people have, except from the logic of experts. So going back to the case study, I want to walk you through what happens after all of this, right? So this is a study that will actually be published in a few weeks online. So what is known about Oxidation Resistance 1? It was first discovered because overexpression suppresses reactive oxidate species-induced mutations in E. coli. And what's known is that the mouse knockout of Oxidation 1 causes early death and cerebellar defects. And as I mentioned already, mustard is the fly homologue of these two human genes, Oxidation 1 and NCOA 7. And so Oxidation 1 has been reported in a lot of different phenotypes. All of these have been reported, but really no non-molecular mechanisms that tie everything together and explain all of these phenotypes. So now that we have some patients, can we use, by studying these patients, understand this gene's molecular function? So the first thing to do is actually to look at the mutations themselves. So here I map the four mutations onto the protein diagram. And as you can see, there's a stop, I don't know if you can see my arrow, no, okay. But there's a possibility that a very small protein is being made. So what I did was I took two commercial antibodies and we were able to obtain fiber blasts from two of the affected individuals. On the western blot, there's three controls. Individual 3.4 is an unaffected sibling and also two affected individuals. So you can see that I could detect no protein at all, meaning that these variants are severe loss of function, if not null alleles. So now that we know this, we can actually model this in the fly. So we did create null mutations in the fly. And the first thing I did was check for oxidative stress because that's what it was found to do. And it was actually very, very mild. I'm not showing the data because it's very, for the time's sake, but what we did find is that when I specifically knocked down this gene in the fly neurons, there was an early death phenotype where most of the flies were dead by two weeks. And when we did H&E sectioning of the fly brain, this is a fly adult brain, we saw a massive neuronal loss and we called this actually a cheesy brain. So then the question is, okay, it's not oxidative stress, what could it be? And what we found was that there are a lot of lysosomal accumulation in the fly brain. This is a acridine orange assay where a very specific excitation emission spectrum reports lysosomes. And then the question is, okay, now I know that my fly gene mustered probably plays a role in lysosomes. Does this also inform the human gene function? And what I tried to do is rescue my fly mutant with human CDNA. So here's a diagram of my rescue experiment. The first one is my mutant. The second is a genomic rescue of the fly mutant. The third is a fly CDNA rescue of the mutant. And the last two are actually human CDNAs. OrxR1 and NCOA7 surprisingly rescues my fly mutants very well. So what this tells me is that OrxR1, NCOA7 in my fly gene have molecular mechanisms that are in common. So I already talked about OrxR1, but what does NCOA7 do? So it turns out that NCOA7, lost in NCOA7 in mice is homozygous viable, but it does have an increased UNPH and have decreased abundance of VATPA subunits in the kidney. And over expression of NCOA promotes vesicle acidification, lysosomal protease activity and degradation of lysosomes. And interestingly, so in large scale proteomic studies, I found that both NCOA7 and OrxR1 interacts with lysosomal vacuolar type HTPAs, which is responsible for the acidification of lysosomes. So now we really have a clue to how the molecular mechanisms underline all these genes. And we went back to the human fibroblast and we also found an increased accumulation in lysosomes. So there's an increased signal in lysotracker. And when we did TM on the human fibroblast, we saw that not only is there an elevation in lysosomes, there's also a very abnormal structure to them where it looks like there are undigested materials in the lysosomes, meaning that it's likely a degradation, content degradation defects rather than a biogenesis defect. So in summary, we found that loss of function variants in OrxR1 causes a novel human disease. And abnormal lysosomal accumulation are found in both patient cells and the fly brains. And we found that both human genes, OrxR1 and NCOA7, can rescue the fly mutant. So here we're really showing a novel molecular mechanism of OrxR1 and its involvement in lysosome biology. And here are my acknowledgments. So the human, the patients are homozygous loss of function and the western blood was those patients of their nulls? Yeah, it's from their fibroblasts. Could be a synthetic lethal to make the combination right. Are there any resources for IPS cells? People have used IPS cells for moderate diseases, yeah. So people have been trying to convert fibroblasts to IPSCs, but it's a very long and expensive process from what I know. And there are newer processes that convert fibroblasts directly to neurons, which is a lot of these genetic diseases, but I think it's not as common yet, yeah. You think that that's something that should be worked towards? Yeah. So that we could get differentiation, look at differentiated function in the... Oh, for sure. I mean, so I was only able to study the fibroblasts, for example, but definitely if it was differentiated into neurons, we'd be able to do many more experiments or have more accurate phenotyping that would be more relevant to our patients. And for drug screening too, treating your fibroblasts is nowhere near treating neuronal cells, yeah. Lovely talk. Well, it could be fun to go through all the sequence data that's now been generated for about 200 strains of mice and about 100 strains of rat, and just fish out all of the common missense variants and run them through Marvel and add that to your armamentarium of... They will generally be weak alleles and survival, but it would nonetheless be quite interesting. Yeah. So someone, I believe someone on MGI is already starting to annotate the alleles or various specific variants and trying to annotate that will take a while for them to do it. Another group, I was actually doing something in primates, so non-human primates. They also sequenced the natural population and that couldn't help inform human genetics as well. Yeah. How does this relate to matchmaker, I mean, in terms of sort of, do people go to matchmaker for a different purpose, I guess, just to see if there's another patient out there, but if they go there, they don't find this other information about paralogs or model organisms. Why aren't model organisms in matchmaker? Sorry? Why wouldn't model organism data get into matchmaker, or is it a very limited sort of scope in terms of... Yeah. Actually, model organisms can do matchmaker. We simply don't put in that we have specific variants, but it's not ideal because not many model organism researchers go into matchmaker. And the other problem is a lot of the model organisms still have issues with making point mutations and, I mean, the fly, we are able to make transgenics quite quickly, like within three months, so we can test those variants. So I would think that flies would be a screening tool, and then, of course, we work with Baylor Com to actually generate the mouse model later on. So perhaps there's two layers of model organisms versus screening the variant function and then the disease modeling. Right. Yeah. And Monarch, which you mentioned, does connect to matchmaker, sorry, it's Carolyn. Sorry. So I think if you enter... So that wouldn't give a model organism person an entry in, but if the human person enters in, they do get information back from Monarch. So there is some connection in that way. Yeah. So we were just trying to figure out other venues to get the data out there. So the genius idea yesterday was to do case study, one-page case study publications for every knockout stream, so that they would get into a journal that would host these case studies, but that would put them into PubMed. So the traffic in PubMed that would search a gene name would get a hit on CompDate, IMPC there. Yeah. Just to follow up. Yeah. So Monarch does have a mechanism for that. It's not widely taken up. We were talking about this at the IMGC meeting. And what the most community is doing is just blasting matchmaker with genes. And in fact, PLR talked yesterday about central genes, and she's hooked up with Damien Smedley with a couple of researchers with genes we predicted would be disease genes. And sure enough, they contacted that, and they're working up as disease genes. So we can do better than that, but Batch Baker is across a certain threshold with the clinical geneticists. And so we're really kind of trying to hit that.