 Great, so I'm going to change maybe a little directions a little bit here. I'm going to talk about experiences from the Emerge Network, which you heard from Teri about and Mark, and also talk about some experiences at Vanderbilt, really talking about getting these kinds of data out of the medical record and how I think we can get high-quality phenotypes. This shows a map of the Emerge Network, the 10 sites that are currently in it, and the goals of the Emerge Network when it was founded in 2007, with five sites at that time were to use EMR-derived data to identify high-quality phenotypes that were validated and then to perform GWAS on them, and each site did 3,000 to 4,000 patients for a particular phenotype, and then we started realizing that we could pool these data across different sites and deploy phenotype algorithms across these sites to collectively explore new phenotypes. I'm going to talk a little bit about some of those in particular. And then we have moved on to actually implement genetic data into EHRs for genomic medicine. We have both pediatric and adult sites involved. So there's a process to defining a phenotype. I really want to emphasize that we're trying to use the whole electronic medical record to do this, not just using billing codes and things like that. So we start with the phenotype of interest, work with local experts clinically, and define an algorithm that will get at what we think will define a case and a control. We really need algorithms, so you'd want maybe target drug exposures without adverse reactions for controls in this case. And then you evaluate that algorithm. You have physician experts look at the cases using the EHR and decide whether they're cases or controls, and assess it, and if the precision or positive predictive value is not sufficiently good, revise your algorithm, review another set of random cases or controls until you can get it right. Now, this is the model we usually use for common phenotypes, which is most of what we've done in eMERGE, but we've also looked at some rare ones. And with rare phenotypes, you may not want to get a positive predictive value that's perfect because you don't want to throw away any cases. But you can still look at multiple classes of information to make your review set valid. And so once in eMERGE, once we get that precision good enough, we deploy it at given site, we would validate it at a few other sites, then deploy it across the entire network, get our cases and controls, combine it with our extant genetic data and do our GWAS. And so that part can actually happen pretty quickly once you figure out who you're looking for. Our general rule is kind of four elements to combine. So billing codes, IC9s, and CPT ends up being kind of the floor necessary but not sufficient to define a phenotype. As we've all heard, they're imperfect, but they're useful. On top of that, we layer things like notes, data that from pathology notes or clinical notes, dermatology notes, you can specify things like that. We do natural language processing on them to identify people that had a disease versus didn't have the disease, identify the difference between family medical history and the patient history. And we can do that with pretty good validity. And then looking at medication data and exposures becomes important. We've talked about using things like location, like burn units, stuff like that. You can do that as well. And then looking at labs and test results becomes important. And then you combine these classes of data in certain ways, usually using Boolean logic but you can also use more sophisticated methods like machine learning and regression models, score, algorithms, things like that to define your true cases. This just shows one example. We did rheumatoid arthritis early on at Vanderbilt. And there were about 10,000 people we had genetic data at the time. And I just want to emphasize that we have a group of people that the algorithm really worked for, a group of people that we really knew her controls, and then people in the middle. So then people in the middle, you can review the ones that you think are interesting, you throw away the ones that don't have enough information to be either a case or control, and then you do your analysis. And we looked at five different diseases in this process. And 21 SNPs that have been reproducibly associated with these diseases. The red represents the published odds ratios at the time. And the blue diamonds represent what we found in our study. We're underpowered from most of these analyses. We're actually only adequately powered for, ended up at the end of the day with one analysis, but we replicated eight to nine of these, depending on which genetic model you used. So importantly, they're on the right side of the odds ratio equals one line. We feel like this represents that we can replicate known things using the EHR. And now I'm going to tell you about another story from Emerge. So after we had each done our individual GWAS studies, and you can see the phenotypes there, dementia, cataracts, peripheral arterial disease, diabetes, and normal cardiac conduction, we pulled that data and said, let's investigate a new phenotype and reuse this data. And so that phenotype was autoimmune hypothyroidism. We developed an algorithm at one site, deployed it at the other sites. The algorithm performed really well at four of the five sites and then okay at another site and then we did the GWAS. And we discovered a thyroid transcription factor, foxy one that was associated with autoimmune hypothyroidism. And this was replicated in another population and subsequent GWASs have found the same result. So this is what that algorithm looks like, just to show you what one of these looks like, so we have medications, we have billing codes, we have lab values. We exclude certain things, secondary causes of hypothyroidism. We looked at timing, so it had to be without a window of pregnancy and contrast exposure, and put all the stuff together to get our algorithm. And this summarizes really different phenotypes we've done across eMERGE and Vanderbilt. The ones in bold represent ones that had significant GWAS results, or significant in candidate, that small group of candidate gene pharmacogenetic studies I mentioned at the bottom there. Many of these also found new findings. Overall, there's more than 40 phenotypes we've looked at. We've also contributed to a lot of large studies. I want to highlight a few that deal with more rare events, hyper induced thermostatopenia, drug induced liver injury, and warfarin related bleeding events. For these, we developed algorithms where we actually didn't get perfect positive predictive values. We had to go in and do some manual creation of the data, but we were able to get things that worked in cases that we were happy with at the end of the day. So both pharmacogenetic and disease phenotypes have worked. These show two results from two of these GWASs that have significant results. With ACE cough, we did simply natural language processing of allergies in clinical notes. So doctors reporting that people had cough on ACE inhibitors. We looked at all the different ACE inhibitors and that sort of thing. And that was automated analysis, and we had lots of cases and controls. And hyper induced thermostatopenia, it was a tougher phenotype to do. So we looked at lab results, we looked at NLP, and then we manually adjudicated the possible records to identify a true case and the controls, and found a signal there too. And both of these were replicated. So if you look at all the phenotypes we've done so far and collected in a website we call VKB, you can see the performance of these here. The positive predictive value is generally really good. You see the median and red there. But there's some that are lower, and I mentioned this before, drug induced lung injury happens rare enough that you don't want to throw at way any possible cases. So this shows these results. And then you look at that 30% PPV, you have to review three cases for every one real case instead of 100. So looking at SJS and TEN, we've done some preliminary work. It was alluded to earlier that the IC9 code system was revived in 2008 to specify specific types for SJS and TEN. Before that, you just have a meta class of erythema multiforme that would have been used. And in our analysis, we also looked at keywords for SJS and TEN and misspellings. And after review, and I tried to look, find the denominator of this. I don't actually have the denominator, but we reviewed several hundred cases. And it's in the day we felt like 72 were people that the physicians really thought had SJS, TEN, REM. And 17 of those are treated at Vanderbilt in our burn units. And nine were actually marked as EM as opposed to SJS, TEN. 17 had good descriptions of the case, but weren't actually treated at Vanderbilt. So you could think, how much do you want to go with those? And then there were 38 that had more fleeting mentions that you could tell the physicians believe they had the diagnosis, but it's unclear whether or not, you can't go back to the biopsy results and that sort of thing. 35 of these 72 had a drug identified with 20 different drugs, sometimes multiple drugs considered as possibilities. The biggest I saw was, and one patient was three, mentioned as possible. And they're all ones that you would expect. So a lot of times we have path reports, but not universally present. Interesting problem is they're in PDFs for derm pathology reports sometimes. So we'd have to, in our system, that requires extra effort to go get. And in recent years, pictures are there for a lot of the patients as well. So it would provide another form of evidence you could use. Those also require special practice to get into our system. Of note, at Vanderbilt, our systems now, we have about twice as many patients with DNA, so we could expect to find more cases than this if we repeated it. This is a study data provided by Bob Davis. And so I've summarized it here, looking at ICD-9 codes and their accuracy across the HMO Research Network, and about 8 million patients. They found about 50,000 people with one of these groups of ICD-9 codes. So if you look at the old ICD-9 codes before 2008, you can see the positive predictive value there isn't very good unless you're hospitalized for more than 14 days. And if your hospital is there, you get a 77% PPV. And then if you look at the specific codes, if you get hospitalized three days or more, you're looking at a PPV that's 57 to 92%. You can see the number of cases over there. And then those nonspecific rash codes outside of the erythema multi-format code or its siblings in the ICD-9 code system really didn't work very well at all. And across the whole network, they estimate that you'd have between 1,000 to 2,500 cases out of that 8 million. Interestingly, that ends up being about the same incident rate or pretty similar incident rate to what we found at Vanderbilt, about 3 to 5 per 10,000. Of course, these represent referral centers, so you'd expect overrepresentation at these centers. So another example to talk through is dress. So searching the EMR for dress is not very useful, as we can all imagine. And it's also under recognized by physicians, but we can go out this other way. So we could look at the drug exposures that we're interested in. We can look at presence of a rash, eosinophilia, other laboratory abnormalities. We can look for fevers, because most of us have vital signs. We can look for lymph adenopathy in physical exam sections. And we can of course look for target end-organ damage as well. And model that and validate it. So I think you can do sort of, you can go at things like this, develop more complicated models to find cases. We haven't done dress, but I think we could with time and input. So some strengths, it's rich and longitudinal, it's collected prospectively. So you have the potential to find fatal diseases as they accrue in a population. It's where you're actually doing the discovery as well. So you have the possibility for this closed-loop process. And you get all the expensive testing that would be done essentially for free in the EMR, but you also don't get it on everyone like you may desire. So developing algorithms takes time is a challenge. You definitely need local expertise. The phenotypes are rare and the drug exposure is important. So you want lots of people. The key data is often in PDFs, at least at our site, which requires special work to get and is not an all-research repository. I mentioned the causes of drugs can be hard to find. I think we found that other data presented, I think, we near presented it earlier, I may have that wrong. We found that here too. Treatments, obviously, are variable at our sites. And we have to do special things to worry about getting names, eponyms can be suppressed in our system. And that's true for other research repositories as well. After we were to find these situations, we could look at comorbid disease. We could follow from mortality and other outcomes and evaluate for treatments and then find controls. I think would be a strength because we do have very large populations. I wanted to discuss, this is switching directions a little bit, but a discussion came up about sort of commonality of how you would go about testing and so we've looked at exposures to pharmacogenetic medications outside of SJSTEN and I went and just modeled that in SJSTEN. So this is what we found in 50,000 patients that sort of get routine care at Vanderbilt and looked at for one of the 57 medications at the time that was on the FDA list of pharmacogenetic medications. And you can see that the key thing here is that the top left and you can see that if you get overall about 65% of people get at least one medication in five years of those 57 as a pharmacogenic story. But you see about half that number also are getting two and about 16% actually get four medications in five years. And you can see, so if you get one you're more likely to get another. And so if you can think about it if you actually genotype for these people and get the full HLA type, you could reuse it. When I looked at Vanderbilt data for patients that had at least three year of contact that were adults, and I looked at these, I think it was five minutes and just, I looked at allopurinol, limotrigine, finitoin, cardamazepine, and baccalaureate, so all ones that have certain HLA types identified in 98,000 patients, 12% of this population took at least one of these medications. Now we didn't see as many occurrences of having one medicine getting a second medicine, but in this case, 6% took it greater than one medication of these five medications. And interestingly, if you think about that, if we were able to implement a baccavir, I'm sorry, if we were able to implement Bactrum as a target, virtually all these patients are either allergic to or exposed to that medication, I feel like in our EMR. So if we discover more, there's a rich population for which we could apply it if we had full HLA typing, I think. So in eMERGE, there's other sites that have EMR, link data, Kaiser, million veterans program. We have international groups like UK Biobank, Biobank, and RECAN. And just in the US programs, we have greater than a million people available within these resources that have DNA linked to electronic medical records with 350,000 or more that have existing GWAS data and routinely do these studies in eMERGE. So I think that these would provide a platform that we would need of this kind of size to do these kinds of studies, though it is not trivial and you would have to get full access to the records to go and do it. I think we've shown that I think you can do these kinds of investigations. So that's all I had. That was great, Josh. A couple of comments. One is just related to the FDA talk previously. HMORN, which you demonstrated is actually part of Sentinel. So it seems like opportunity to maybe in the prioritization within Sentinel, since we have something that seems to work reasonably well in the HMORN data warehouse to maybe put that forward as a project. The questions that I have about the phenotyping, you didn't mention anything about ophthalmology. It may be because we know that ophthalmology notes are way worse than dermatology pathology reports because of their tendency to draw pictures in that, which are really hard to parse with NLP. But it strikes me that as we think about the longitudinal nature of the data, that we may be able to look at encounters with different specialists that are involved in the sequelae of Stevens-Johnson Tens to maybe derive a pattern of consultation that might be informative in terms of identifying individuals. So it may not be so much what did the ophthalmologist say in the note, but if the person has recurring visits to the ophthalmologist, particularly related to corneal abnormalities that that might be a trigger. Have you looked at that? So in eMERGE, we have looked at a couple of different ophthalmology phenotypes and I feel like we can actually do fairly well with those, even if we can't actually get the ophthalmology notes because they can be scribbled handwritten documents that can be hard to read. Interestingly, one of the sites actually went through and did optical character recognition on top of the handwritten notes and were able to identify some interesting metrics. That is not universally easy to do, but the diagnoses you get by saying, okay, they're seeing an ophthalmologist. You're using the billing code by the ophthalmologist. It's happening multiple times. That actually works really well and clearly by the time they have a procedure that's specific for a certain diagnosis, that also works very well. So I think even in that realm, I think we found that you can find ophthalm data and get meaningful information out of it. Howard? One of the things that Andrea mentioned was the side effects or adverse events that happen six months down the road. How does your system handle some of these temporally distant phenotypes? Because as she was describing that, it resonated, but no one's really looking at it because often they're not at our center anymore. I think that's actually a real strength in these cohorts because we can find when they had SJSTEN and then look forward for different outcomes and see when they develop and since a lot of them are prospective since the 2000s, even for patients who ended up dying from SJSTEN a year or two later from some complication, you could go back and I think you could get that kind of data. All these resources have time stamp data that you could go in mind and a lot of our algorithms do look at that. A lot of the pharmacogenetic algorithms I showed you include things like an exposure, an outcome, a second outcome, all temporally sequenced. And we do pretty well with that kind of stuff, but it just takes input of time to go in and mine and build. But it's an important outcome that I think is worth diving into. I wondered, Josh, if a good number of your patients are actually referred to you from other hospitals, how well are you able to follow them within your system once you discharge them? So that would be the key problem with that. Thanks for mentioning that. So some systems are going to have better capture that than others clearly because it's more of a network and that might be one of the strengths of like the HMORN if they went deeper. At Vanderbilt, we have quite a bit of fragmentation and so they may follow-ups for specific subspecialties but may not have as comprehensive a collection at our institution as you may get at Geisinger, for instance. It's interesting if you look at the one hypothyroidism that one outlier was people that were sort of didn't have as much, didn't have as comprehensive care at one site as sort of the other sites tended to have in their population and not lead to lower positive value. So it used to be a strength that like the HMORN Research Network had very good follow-up but it turns out when we did a study to look at that, we had about 90 percent follow-up after one year and it actually goes down by about 10 percent every year and I mean it seems to get asymptotic at around that point but at five years you really only have about 50 to 60 percent of people still under observation and it's actually remarkably under-unswayed or not really altered by whether or not they have comorbidities. You would think that the sicker people would stay in the system but it's just the nature of the healthcare system these days. It seems like people are just, you know, moving around. I wonder, Bob, if you might take a minute for our international visitors and just describe very briefly what the HMORN Research Network is. Even what an HMO is? No. I can't. Okay, thank you. So the HMORN Research Network is a conglomeration of research-oriented health maintenance organizations that in essence, for lack of a bare word, sort of agree to share standardized data in collaborative scientific projects with a combined population sort of depending on who participates of, you know, somewhere between 13 and 16 million people, I think, at any given time and they've spent a lot of time developing common data models so that things like what Josh showed in the Emerge data set where you could sort of come up with a standardized algorithm to identify people, say with stroke or Stevens-Johnson syndrome or heart attacks or various other conditions can be assessed and used for various epidemiologic and health services studies. The one thing I'll add to that, and correct me if I'm wrong, but I think when you look at this phenotype, it's going to break any common data model because the amount of data you have to go in and get to each site, you know, I think even from one epic installation to another epic installation, you're going to have to have some local expertise because I think this is the kind of phenotype you have to dig deep. Like we found with drug-induced liver injury. We found some of these other, you know, thornier, temporal-related complex drug exposure kind of things. I think you're going to have to do some digging, but I think that we, I think it's very doable. I think it's doable with accuracy. Okay. If there are no further questions, we can go on then to our third speaker in this session who is Dr. Huimon, sorry, it's a long name, Huimon Suwon-Kesawang. Oh, thank you. You should introduce yourself. No, I won't make you do that, but at any rate, and Huimon comes to us from, she's the head of the Health Product Vigilance Center at the Thai Food and Drug Administration.