 Okay. So the first group is electronic phenotyping for genomic research. And George is leading us off. And Josh or Marilyn, you're the timekeepers, right? For your group. Okay. Okay. All right. Very good. Good morning. And so I'll be speaking, representing the phenotyping workgroup. And we did review our slides together. And I'm going to, and so thank you, Rex, for doing that excellent introduction. And then actually, Michael did a lot of stuff on machine learning and phenotyping, which also is useful. So thank you for that. So we're going to go straight into, our job is to go straight into the questions that were asked. So how can eMERGE improve upon the current labor-intensive phenotyping towards fully automated phenotyping methods to increase efficiency and validity across EHRs? So first, I'm going to put one thing aside. One thing we're doing, and take the sharing part, because the tail end of phenotyping is about sharing. The context is we have academic medical centers, they each have a clinical data warehouse, they've each already done their abstractions from the EHR, and they have a table or something, a diagnosis, and each of us has picked a different column or row or name to give it. So the thing we're doing now is the easy, the low-hanging fruit is just, since we've already done that or the hard work, let's just put it in a common format so that we can share our phenotype and cut two years off the phenotyping effort. We picked the OMOP common data model just because it was mature at eight years and shared in other efforts, including all of us. As I said, we're putting our current, and then it preserves just a comment. We had a conversation, Sandy actually alluded to it. There's a lot of work to take your data and put it into a standard format, any standard format. 4,000 lab tests is actually representative of what Sean and I also found. We have lab tests both coded in LOINC for autoimmune disease, the tests, the serology you do around them. Our LOINC codes are non-overlapping almost because we've given different names. So that's a process you have to go through, that curation process. The data model doesn't cure it, you need knowledge and skills to put that into a common thing. That's what Michael was referring to, whatever the model is, doesn't matter underneath it. What you can't do is avoid the work of putting it into a common format. OMOP happens to have a data schema and 43 terminologies to map from, but there's still a lot of work getting from your local vocabulary codes to, say, LOINC and even a common LOINC. So now we get to the front end of the eMarch phenotype generation process. And so I was asked not to review all the work, but I just want to point out on, I still want to show on a slide, there's a lot of research has gone on in the eMarch network and how you carry out a phenotype. So what I've done on the next slide is just extracted from this work what were the lessons. So first the challenge of billing codes, the healthcare process affects what we record. It's not a research database and electronic health record. It's good data. I mean, yes, there is noise in it, but it's good data, but it's meant for a different purpose. So the first thing is that we need to, and that's why we have phenotyping. It's going from the raw data to an intermediate state before we do the research. The importance of NLP that we found that in many phenotypes, it was really essential. Sometimes there just is no structured code for the thing we're looking for. The complexity of our phenotype definitions we learn that it's not, that they tend not to be just a simple, you know, use a code or use a code in a med. And you saw some examples earlier in the previous talks. What we're seeing so far is we can get improvement from tools and reusing definitions, but mostly what we're doing today is just slogging it out. As we go to the next definition, just a lot of hard iterative work. It's always iterative to improve the definition and the tools do help. Lastly, I want to point out that there are different goals. Knowledge discovery via GWAS is mainly picking cohorts. So that's getting a high positive predictive value. As you move into other types of studies or delivering results to patients, then the sensitivity also matters. We want to reach out to everyone for whom this knowledge is relevant. So there's two different kinds of phenotyping going on and it's important to keep track of those. So now the main thing I'm going to go into for this is going to be machine learning but since that's the second question I'm going to skip over that and go to a slightly different topic and just talk about high fidelity phenotyping. Okay, so now the other thing I want to say is that we're not showing today is what the future RFA should be. We're just showing the context of what is going on in the future. We should know what's current research today. If we're picking what's going to go on for the next five years, we should know what's state of the art today. So one thing is high fidelity phenotyping and I'm going to go into some examples in a moment. First, encoding degree and severity of a condition, mostly what we've done, mostly, is deciding binary, yes or no, do they belong in a cohort. And I think we'll see our research needs and for patient needs, we're going to have to have different probabilities, severities, degrees, you saw some examples in Mike's talk. Our chronic kidney disease phenotype in the network is five levels, the different stages. So we're moving to more graded phenotypes. Time is essentially important, we're using time but I think there's more we can do, both two halves of that. First, using time to decide, even if it's a binary decision. Or second, expressing the timing of something, the chronicity and also factoring time, accounting for the biases that the healthcare record produces. To give you a concrete example, people are sampled in the health record when they're ill, they're sampled. You don't get a creatinine at three in the morning unless the person is sick. So if you look at creatinines at three in the morning, it looks like the population gets sicker at 3 AM. They're not getting sicker, you're just not sampling healthy patients at 3 AM. That's the kind of thing we have to count for as we do more sophisticated phenotyping. We can use the phenotypes not only not binary, but maybe not even a set of categorical, it may actually be continuous states. Using examples, this is in breast cancer and in diabetes where people had to continue them from one state of a disease to another. Using our physiologic information that we have sitting there in the literature to help us do better phenotyping, don't be purely empirical. Try to incorporate some of the knowledge to build better phenotypes that are more resilient against the noise. I don't have it here, it's a different slide but shows that given physiologic model knowledge, you can learn things from sparse, noisy data sets. And actually learn things like kidney function, renal function, insulin resistance, insulin degradation rates and everything you can estimate. Things that we could be doing with an e-merge with our say diabetes studies. I guess this overlaps the next question, but learning late in states, using modern mechanisms of deep learning. I think I'll go into this a little bit more in the next one, so I'll just state it. And then I guess I've been talking about the whole time accommodating the bias in healthcare records. So these are things we could be doing in the future in phenotyping. How might machine learning and other advanced computational tools be used to improve electronic phenotyping in the e-merge network? First thing, natural language processing has turned out to be an important tool. And there are two directions we need, breath and depth and effect. We need to be able to, first of all, and show the last one with the exclamation point sharing natural language processing. It takes a long time to share a phenotype across the network with each one using a different natural language processing approach. And we do try to share our code across the network, but because the reports going into that code are different, it creates a lot of work. In addition, there's work, modern approaches to natural language processing, which many of the groups have been using. Machine learning research for e-merge research, I'm going to show you some examples that the group sent to me on the following slide, so I'll come back to that. Some other things that aren't so much on the slides, if you're going to do machine learning, and as you saw in the previous presentations, you need a training set. And that's the money in limitation, we have to sit there. Now we already need, say 100 or 200 cases to test the accuracy of our phenotypes. We're already doing manual curation. But once you do machine learning, you tend to need an even bigger training set. And it becomes onerous. So there are advanced methods for helping us do that. Sontag at MIT's anchors, or Shaw's noisy training sets and so on. To reduce, to focus, which cases should you curate? In other words, here's what we know about the algorithm so far. Tell me what cases would be most useful for to have an expert look at instead of doing a broad swath of cases and trying to, say, review a thousand cases. Similarly, active learning, which is a related topic, basically accomplishes the same thing, reducing the training set labor. And then I mentioned deep learning. For example, the stuff being done at Mount Sinai on diabetes. And then the other two I already mentioned. So we've had some success here at Harvard. We achieved over 90% PPV, purely learning our, I'm sorry, rheumatic. R.A. Rheumatoid arthritis phenotype. And that was based on just 300 training cases. So that's getting the number down. That's pretty good performance rate. So it takes a long time for a doctor to review 300 cases. But nevertheless, that's still a relatively known moment, and they achieve good performance. Vanderbilt having carried out similar work and also achieving a PPV and that of what was it, like 0.95, 0.96. So we are having success across the network in using machine learning. This then affects a lot of other things you don't realize. So the ardent syntax has been around for decades now for encoding medical knowledge. But it was focused on rules that we write, and that's what most of our phenotypes that we've produced so far are. As we move to more machine learning, then it's really a matrix of a coefficients that apply to your set. And sharing it means something different. It's not so much tweaking rules now. It may be you creating a small training set to tweak the coefficients or doing no further training. But when you share the thing, it's basically a list of real valued numbers. Further work on, this is Mayo, syntactic interoperability related to natural language processing and phenotyping. Again, I'm just gonna go through quickly so that we don't, the tools, so that we have time for questions and further discussion. The phenotype execution and modeling architecture that is producing sophisticated tools that help us build them, especially if they're built as traditional rules. But also if you do machine learning, that can be a module that then creates coefficients and it's been extended to the models that we're using, like OMAP and I2B2, and incorporating natural language processing and machine learning. And that's been funded for the next cycle, so that's still relevant to us going forward. Further work in, this is Cincinnati and Boston Children's on combining NLP and machine learning. Again, I don't want to go into using standard vocabulary, SNOMED, CT and RXNorm, incorporating a tool that a lot of us use, Ctakes, which is when we try to share NLP using a tool, we often will go to Ctakes. Using advanced machine learning methods, just pulling this one from Marshfield, well, from the group, to show examples of the eMERGE network using advanced machine learning methods, and active machine learning, which I mentioned earlier, again, this example coming from Marshfield, reducing the workload on human beings. How can eMERGE assess phenotype compatibility across diverse patient populations and diverse healthcare settings, academic and county hospitals, community clinics, national healthcare systems? So we can design specific eMERGE experience to do this. And as I read the question, a bunch of thoughts came into my mind of what we could be doing. Then reality set in and I was realizing how many phenotypes we've yet to do in our department just to keep up with the workflow and there's Peggy. So it's hard to do new experiments in this, but I think there is a lot we can do within our existing infrastructure. But then we heard a talk about all of us, and we are using the same OMOP data model between us between eMERGE and all of us making the sharing of things a little bit easier. So all of us is getting up to speed as you saw, but I think that program will allow us to test what we do here in a very diverse environment. Also, we're sharing with the Odyssey collaboration, which already has 400 million patient records. So it doesn't have the genotypes. But if there are any questions we have that are just phenotype related, we can run that on the network today. And there's no reason not to go forward there. So that's what we prepared from the phenotype worker. I'm sorry, are we gonna go straight to Ken or answer any? Okay, good. Unless there's a clarification. If there are clarifying questions, we can go to this. But we'll have discussion after, Ken. You're very clear. This is Terry. I might just mention since these screens are not working, if you guys want to get on the Webex, and if you can't see, it's quite nice on the Webex, so I would suggest that.