 Hi, I'm Lisa Basterash and I'm from Vanderbilt University Medical Center and I'm a data scientist there So I'm really happy to speak to you here today because I see a lot of people in these room as sort of a client of mine I'm hoping that by doing data science on the EHR I can provide Interesting methods and opportunities for you all to put into the clinic and it's just an exciting space to work in So before I start talking about a method that we've developed recently to use Phenotypes in the EHR to identify Mendelian disease gene variants. I want to just give a brief overview of the way I as a data scientist see What Mendelian disease looks like in the clinic so if you imagine in a clinical population There's a certain percentage of patients who actually are affected by a Mendelian disease and among those some of them are diagnosed and others are undiagnosed Among diagnosed patients you'll see ones like this This is a patient who's diagnosed with cystic fibrosis and really has the classical manifestations of the disease and has been found to have two copies of Delta 508 But then you also have patients in a clinical population that look like this one This patient was born with some pretty severe problems with hypoglycemia Was later to have found to have failure to thrive and large liver and other features that cause Clinicians to think that he may be suffering from a genetic disorder. However Typical clinical test didn't reveal any specific diagnosis He was actually later then enrolled into the undiagnosed disease network at which point he received a whole exome sequence And that too didn't turn up any known pathogenic variants Though it did provide a number of interesting candidate variants that may underlie his disease state So this patient sort of exemplifies that it's really my I think my oh my slide isn't cut off It's really variant knowledge that's limiting the diagnosis of this individual So and I think that that's true for a portion of people who have Mendelian disease who are not diagnosed It's because we don't know enough about rare human genetic variants And their phenotypic implications to really utilize them and apply them to two patients But then among undiagnosed patients you also have ones like these patients Patient number three and patient number four both have long phenotypes such a including bronchiectasis, but also sinusitis However, they're adults and they haven't been diagnosed with a genetic disease However, if you were if you were to genotype these individuals, you would find that one of them Had atypical cystic fibrosis while the other had two copies of a known pathogenic variant and primary cilia dyskinesia gene Because these patients have an atypical presentation because they're adult. They don't immediately They don't immediately prompt a clinician to start doing genetic screening And yet if that genetic screening was done, then they you know, we could they could be properly diagnosed and in some cases treated differently So on to a description of the method The method I'm going to talk about is called phenotype risk score and the basic idea is that it leverages Patterns of Mendelian diseases to identify patients in the using EHR phenotypes So this is a clinical description from OMIM of cystic fibrosis like a lot of Mendelian diseases cystic fibrosis is characterized by phenotypes that That affect a number of different organ systems OMIM is a fantastic resource and it includes thousands of these types of clinical descriptions They were initially written just in regular free text But several years ago the people who made human fetid phenotype ontology Mapped all of the clinical descriptions from OMIM into HPO terms So that means that any given Mendelian disease that's described in OMIM you can get a set of associated human phenotype ontology terms What we did was we mapped all HPO terms that we could to something called fee codes Fee codes are claims data or ICD billing codes that have been Collapsed together into meaningful clinical entities That can be very easily extracted from the EHR And so what this series of mappings enabled us to do is to describe any given Mendelian disease in OMIM In terms of phenotypes that are very easy to extract from the EHR Now we've spent years doing validation work on these ICD based phenotypes And we've done a lot of replication studies with common variants and replicating known associations in the GWAS catalog And so they have been demonstrated to to capture some amount of phenotypic variability in a population And the real benefit is that they're highly portable Even though EHR systems are all different and heterogeneous most of them require their people who use their services to pay and ICDs are a way of getting that payment to happen. So they're basically ubiquitous as well And we get a really broad picture of phenotypes by using phenotype or fee codes We can get about 1,500 phenotypes out of population and like I said, basically Instantaneously, it's it's not a heavy computational load So what we did with these fee codes in addition to to form the phenotype risk score is we weighted them on the log inverse prevalence In a large cohort and that means that that simply means that we tried to weight Features that are rare and unusual like bronchiectasis more heavily than a phenotype like asthma, which is fairly common So to apply this risk score for an individual you look at the features that they have or fee codes that they have and If they have a fee code he sum up their score if their score is high It means that they're a decent clinic match to the clinical description or a good match As it's presented in OMIM if their score is zero it means they have no clinical features that are overlapping with the disease So the first thing that we did with the phenotype risk score is we tried to answer Try to make it pass sort of the You know if they could just perform a very basic function Which is could it distinguish between individuals who are clinically diagnosed with cystic fibrosis versus those who are not and the answer Is yes, if you take a group of Patients who are diagnosed with cystic fibrosis their phenotype risk score is about two and a half standard deviations away from what you would expect In a normal healthy population And so this demonstrates that you can differentiate a group of individuals with a diagnosis for a disease using only the features of that disease We applied this to six different Mendelian diseases that were chosen by clinicians that were common enough that we could have enough exemplars to work with And it worked quite well in all cases. The one exception actually was for PKU. We hadn't thought of it but one before we selected PKU but That's a disease that's on the newborn screening panel And if if a per if a baby's diagnosed with PKU and they take a they have proper dietary control They don't experience any of the clinical symptoms that are described in almond like intellectual disability and seizures So we accidentally sort of recapitulated a really wonderful what good the newborn screening tests do and How important it is to get a relevant diagnosis to patients at the right time because these patients in general look exactly the same as healthy controls So we didn't develop phenotype risk score so that we could reproduce the diagnoses that clinicians already made What we did we developed this method in order to address the question of rare variants and ask about the phenotypic impact of rare variants Several years ago a cohort in bio view, which is the EHR linked bio bank in Vanderbilt Was genotyped on a platform called the exome bead chip The exome bead chip was an unusual chip at the time It was designed as sort of an intermediate intermediary experiment between GWAS and whole exome sequencing Which enabled researchers to look at rare variants that hadn't been explored at scale or in GWAS very often but We didn't have the extreme expense of whole exome sequencing which five years ago was quite a bit in particular so so we had about 30,000 individuals on this Exome bead chip platform and once you QC it and filter down to rare variants that are in coding regions in a European population We'd about 60,000 rare variants to look at using fee codes We could ascertain like I said about 1500 phenotypes But the unfortunate thing is that we if we were going to serially test every rare variant against every phenotype We'd be doing about 90 million tests. That's basically a non-starter. I mean with that kind of correction You're not going to be able to discover anything new And you're going to be testing things that are simply not interesting like a variant in CFTR with foot pain So in general you need to be parsimonious about the or you know a little bit more frugal with the number of tests that you perform Especially when you're dealing with rare variants, which are hard to study because by definition you have very little information associated with any individual variant So what we did is we leveraged the amazing resource that is omem to form hypotheses that we could test So we could scale down the search space And we did that by creating a general hypothesis Which is that if a variant in a particular gene is linked to a phenotypic pattern then other variants in that same gene Will produce a similar pattern And by doing this we took our 60,000 variants filtered it down to 13,000 that occurred in Mendelian disease-causing genes and then we further filtered it down by These diseases that were amenable to the phenotype risk score method in order to get a good profile Using phenotype risk score you need to have a few at least three features. That was at least what we somewhat arbitrarily to admit We chose at least three features in order to be profilable So there are a lot of Mendelian diseases that have only one feature associated with them like diseases that are characterized by long QT Interval we can't create profiles for that. So we had 6,644 variants that we looked at overall And when we applied them to the XOB chip data, we found a number of significant associations I'm not going to discuss the specifics of these But I encourage if you're interested in these results We put out a paper in I think April or March about phenotype risk score And you can read about all of the stuff that we did there. We did some replication work Which went very well. We did some whole XOB XOM sequencing that was targeted and some wet lab work as well But by generating this data, we were able to actually add information about what is known about these variants from population-based cohort and The type of vary the type of information that a phenotype risk score can generate actually fits. I think really nicely into the ACMG guidelines of of Finding correlation between phenotype and genotype so that Information that's generating using this method may be able to be put into context of other types of information to help interpret variants The second application. I want to mention is work that we do with the undiagnosed disease network For those who are not familiar the undiagnosed disease network enrolls patients who? For whom clinicians believe may have a Mendelian disease, but haven't been diagnosed I mean Patients who are enrolled and accepted into the program are phenotype using human human on phenotype ontology terms Which is excellent for us because we can really rapidly create a phenotype that is based on the pro band What we do then is we get a list of candidate variants from the UDN Physicians as well and when we have genotype information that's available in the XOMB chip which occurs Roughly 15% of the time we're able to ask our people in bio view in our bio bank Who have the same genotype as the pro band? Do they look similar to the pro band? Are they enriched for the features of the pro band? So we produce a score that looks like this Typically and most most of the time the useful information that we can provide to the UDN network is that there's usually a Handful of variants that are very unlikely to underlie the the pro bands disease Because we have an abundance of healthy looking Individuals in bio view or if they're not healthy they at least are not suffering from similar a similar looking disease But occasionally we get good candidates as well and two of those candidates have actually ultimately been found to be on underlying that the pro bands disease Finally I want to touch on the idea of finding undiagnosed patients So When we applied this method to variant interpretation we started with gene type information We split up the population and we looked at people who had these rare genetic variants and then asked if they were different from the rest of the population but The success of that very interpretation Process made us think wouldn't it be wonderful if we could kind of reverse this and Start with an entire clinical population and find individuals who are likely to be impacted by a Mendelian disease And I find this to be a very exciting idea because although I don't know how many people are undiagnosed and are impacted by a Mendelian disease, I think any clinician agrees that that The knowledge of pathogenic variants that we have right now isn't perfectly and uniformly applied across the clinical population that these patients do exist The issue with so I'm just going to kind of speak philosophically about this because I don't have data yet to address How well this kind of method works or what is actually needed in order to make this work But anybody who's going to try to find undiagnosed patients is going to find themselves in what I call the valley of improbability If you take any random patient and test them for a random Mendelian disease your prior probability of getting a positive test is very Very very very very very low And so in order to climb out of that low valley of improbability You need to have information that's going to make that patient Set apart from the rest of the population and the thing is that any one Disease there are a lot of pathways to get there somebody could be in a really terrible traffic accident and be paralyzed from the neck down and It's difficult to actually contextualize the phenotypes that you extract from from the EHR But we're working on ways of combining the phenotype risk or with other types of EHR based resources to do just that the one thing I will say is that Based on the research that we've done I think that it's really important if you're going to be looking for people who could be diagnosed with a Mendelian disease But are not I think it's really important to think about leveraging all of the knowledge that's been generated So far about these Mendelian genes Instead of doing a completely agnostic approach In general I have a lot of questions that I'm interested in and I'm learning a lot about as I'm here at this conference about the utility of this type of This type of method Where would it be most helpful to apply? What are the problems that that exists right now and barriers to clinical implementation that this could help with and I also think that there are some really interesting questions too that we should address about Exactly where these undiagnosed patients are what diseases are most likely to be undiagnosed in the clinic And and for that I don't I don't really have any necessarily any answers But I'm interested in anybody who wants to discuss that like to acknowledge my colleagues who are uniformly really excellent and That's it. Thanks so much. Any quick question Okay, we will continue that in the discussion. So next up is Mark Williams implement implementation in a learning health care system