 Our other important context, and that is both all of us, and to some extent I think eMERGE has shared and learned and contributed to the the vanguard of large-scale cohort sequencing, which would be the VA's MVP. And so we have the wizard of MVP, Mike Oziano is here to give us an update on that project. Thanks, Dan. It's a pleasure to be here and it's been a pleasure to exchange with eMERGE and to get an update on all of us. To exchange best practices in this space, we also agree that EHR data is really very valuable. We had our coming-out party at the ASHG. We did a two-hour symposium on the first night and presented 22 abstracts. And the basic story for the early MVP data is that the standard EPI works and the genetic EPI works similar to what others have seen in more traditional cohorts or consortia cohorts. So I'm going to breeze through a number of slides on where we are in general with MVP where we're going with complex phenotyping and then a little bit about providing access. So if you remember, we collect health and lifestyle information for the participants and a blood specimen. We access medical records forward and backwards in time and we can re-contact the participants. Now, we don't get the kind of extensive data that all of us gets, but we do get some questionnaire data on most of our participants, but not all, and we've got a big web contract to expand the capabilities that we think that our population is not quite as nearly as web savvy as the all of us cohort. So this is a little scrunched up given the formatting, but we're at 65 main sites and 60 satellites around the country. And here's where we are. We're actually 612,000 people recruited. We hold genotype everybody. We'll be releasing the second transfer data on about 500K genotypes on the AFI platform similar to what the UK Biomech is using. We've got contracts out for 45,000 whole genomes and on our way to 100K, we think we have funding this coming year fingers crossed and the subsequent year to get us to 100K whole genomes. And then we're in discussions with a potential collaborator for a very large number of whole exomes. We've got a contract released with Metabolon to do piloting several thousand and in discussions with proteomic leaders and we've got a microbiomic pilot underway. Again, these are going to be in hundreds of individuals. We have 20 teams accessing the data and the big constraint there is our computing environment. It's all still behind the firewall, but we will be moving, we'll go live with the Department of Energy on November 17. On November 17, our VA investigators will be able to access, starting with the core team and then more investigators access our data that has been stood up at the Oak Ridge National Laboratory. I was there last Tuesday and got to see Titan, the country's biggest computer and then Summit where the IBM racks will be for the hopefully world's biggest computer. And I mentioned that we've presented a number of abstracts and we'll present another 10 at the AHA, about 50 abstracts out. So the science is moving while recruitment is still going on. This is the main state of our omics backbone, which is the aphimetrics array. We've imputed to the phase three of 1000 genomes, but we hope to be able to impute to more African American heavy cohorts, including our own in the coming months. And this is the basic model similar to what Stephanie described. We're keeping our data in our space, we have a model for the data coming into our landing zone and library, and then we provide it to investigators as they come into the space. Right now that's in Genesis, it's in a Pittsburgh cluster, and unfortunately it's behind the VA firewall, so you have to be either a paid or a non-paid VA employee to actually get into our computing environment, but that's changing soon. We will be in the Department of Energy space, which will provide broader access, and then we're in the process of creating an arrangement for a cloud-based data commons, which will provide even broader access. So here's our data universe. We talked about the molecular data. Above the equator is what we collect under consent. Below the equator is the administrative data. I'll just give you a flavor of what this is the health system, and it's the largest healthcare system. We have about half of the 21 million veterans. At some time, we used the system, 8 to 9 million users in a single year, and we had data headed to the corporate data warehouse from four regional data warehouses, and right now that number is up to 24 million users over the last 20 years with 8 billion labs and 3 billion notes. That presents some unique challenges for us as we enter into the library and curation of the data. We have to get a lot of permissions, even from within the VA. There are various data landlords. The corporate data warehouse is the bulk of the electronic medical record. The only thing that doesn't come forward from many of the hospitals is the actual images. They stay local, although you can access them, either bring them forward if you have funding and a place to put the image data or even potentially do analytics. But there are a number of other registries, actually about 120 different registries within the system. HIV registry, diabetes registry, or registry of all cardiovascular procedures that have different data landlords. And then we have other data sets, DOD, CMS. We've worked our way through getting general permissions for the top of the list, but certainly not all of them. Now, the data is structured data on a small fraction, but most of the data is actually quite messy. And we've developed three cores. One is to wrangle the data. The Phenomics data group wrangles the data and is actually doing the library creation. Not only the complex phenotypes, but an extensive library that I'll show you in the next slide. Then we have the team doing simple structured data curation, and then we have the complex, the third core. We've engaged Zach Gohani's group and Kat Leal and Tanshi Kai to do our complex phenotyping, moving to scalable language processing and automated intelligence that I'll talk about in a bit. So this is the general library. We have actually, Terry Minoli and I were over in the UK talking about how many phenotypes do we have. Well, we think if you add up all the numbers, it's going to be hundreds of thousands of variables, whether we call them phenotypes or not. 100,000 variables, if we add all the procedure codes and the ICD codes, the lab values, the medication codes, which actually get complex. But we've begun to library them. I'm going to walk through a few examples of how we work through that data set. So for laboratory adjudication, we use OMOP. There is an OMOP overlay to the VA, but I would caution using OMOP as the architecture for the library. OMOP probably works better in EHRs that have a less diverse input of the data, say a single health system and a less duration. But over 20 years, and one of the examples is the lab values. We have 8 billion labs, 400,000 terms that mark those 8 billion labs. And we have a detailed structure of how we move our way through the labs. Now, we're moving our way through the top 150 labs in the system. And by way of example, serum albumin has 4141 terms in the database that say albumin in it. And so we have to get through a cleaning. Now, if we just took the OMOP terms, you would end up with an awful lot of noise in with the signal because only 644 of those are actually serum albumin. So what we end up doing is we do a calling based on names where there's a qualifier that tells us it's not. It's not serum albumin. It's urine albumin or CSF albumin or peritoneal fluid, et cetera. Then we have to get to the candidates. And we actually have a clinician we hope to be able to audit. We have two clinicians reviewing the mean, median, a number of rows. And there's a checkbox that they check and the two clinicians must agree to get to the 644. It doesn't take that long, but we would like to automate that process. So medication adjudication is actually quite complex as well. But the PBM has done some of that for us. And I'm not going to give you a detailed specific example, but we have as complete for a number of the common drug types. But there's eight or nine different types of erythromycin, for example. Eye ointment, IV, oral, topical, that all end up in the same term with erythromycin, but have to be really parsed out a little bit more carefully than that. And this is the basic underpinnings of our medication adjudication process. But in the interest of time, I want to get to two more complex examples. So as we curate the data, you have to sort of understand the potential, the timing of the data drawn. This is work by Jason Vassie that shows you if you want a value such as blood pressure, that you end up getting about 90% of the people if you choose within 60 days. But remember, there's a little bit of biases, whether it's labs or blood pressure levels. The sicker people at the hospital more often and have more of these values. So you have to open the window if you want to get a complete look at the entire population. And this shows both the two-sided and one-sided look at the distribution of blood pressures keying on a single point in time, for instance, enrollment. So smoking phenotype, we went through a number of health terms. So each hospital is mandated to collect health factors, but there's not a single structure. So we found 1600 different terms that say something about smoking from the different hospitals. We had to bend those into 11 categories. And then we used those to create this algorithm. And what you can see is that we keyed on high specificity of current users and high specificity of never-users. And we compared this against the one-gold standard, which is our single-point-in-time questionnaire. Now we're going back to try to get an area under the curve of smoking or smoking cessation time periods. But we've identified a dozen different smoking algorithms within the HR, but none of them have been validated. Now this one's been validated against several hundred thousand questionnaires. For stroke, we create some gold standards for a training set and a validation set. And then we've begun to use neural networks. This is the first example that we've been using some artificial intelligence to try to find the best algorithm for. And this gives you some comparisons. This is our case control definition. I'm going to show you where we're really going, is assigning a probability of caseness and non-caseness to every individual and then setting the threshold according to the needs of the investigator. So this is what that looks like. So here's the probability of caseness is very high. Of non-caseness is very high. It's an intermediate possible case and then definite cases with some overlap. And so you can see where thresholds are set. You can end up with very clean numbers of individuals with a high caseness. Now we've been assigning a probabilistic caseness, which gives you the opportunity to A, shift the thresholds, but B, it allows you to perhaps do some of the modeling, not with caseness as a discrete variable, but as a probabilistic variable. And there's some suggestion with our collaborators at Zach and Henry's group that that improves the modeling. PTSD is one where the codes work very well. I'm not going to go into the detail, but there is a three-tier process that we use. An intuitive algorithm we start with. We find cases and non-cases validated. Then we build a second algorithm. Then we go through the validation process where we're assigning the probabilities of caseness. And we found that the tier one algorithm actually worked quite well. We didn't get massive improvement when we went to a higher tier algorithm with more variables in the model. This gives you some of the data. Now here, this shows you how, again, you can use the data to define caseness. You can key on the number of cases you want and the sensitivity and specificity. This was what was settled on. So we end up with 16,000 cases of PTSD and the first 350,000 individuals and 43,000 definite non-cases. Now if you end up with lower cases and higher sensitivity and specificity, or in the other direction, and you can model it whatever way you like. And this is that process that we go through in a manual way. But we need to move to the next step, which is scalable language processing, which was going to be deployed in our department of energy space, automated future extraction to get to the ability to create an automated pipeline. The automated pipeline briefly works. This is Kat Liao and Tanshi Kai's work. Out of the rubbish bin, you cast a wide net and come up with all possible cases. You end up with that validated training set. This is the time-consuming part and they're working on ways to even speed up this process presenting clinicians the data or finding cases like the ones that the clinicians originally deemed as cases. Then we create a training set and a validation set and this is trying to be deployed as an automated pipeline for doing that in what we call semi-automated way. We don't let the artificial intelligence machine just rummage over the rummage bin looking for correlations. It's constrained, constrained by certain variables and parameters. But we do allow a lot more latitude and what the machine allows us to do is it allows us to decide what variables we need to spend time cleaning and it may be the case that we don't have to clean very much. This is my last slide, just the movement toward the idea of big data. Where is this happening right now? It's going to happen in the Department of Energy Space. As I said, we go live November 17th. The next step is to take that library that I described on the one slide, port it to a data commons that's cloud-based, port the genotype data and then make the access much more broadly available. You will no longer have to be a VA employee to touch the data. I'll stop there and answer any questions. It's a pleasure to be here to learn more about the field of phenomic medicine. I think we have time for one question and it goes to Lucila. I have a consolation statement that transforming data is always hard and from one system it's always hard. I think the difficulty we've seen as well, however, transforming to WOMAP to us has been more rewarding than any other multiple common data models that we have been doing. The reason being there are tools associated with it and we also encourage what we saw the dissemination of the quality control algorithms that you guys used to the VA to everyone because it was very helpful. Mary Wooley from UCSF has introduced that to us. That was more of a comment than a question. I think that we have to look at many ontologies and I think some of them were designed on a single system, I2B2, WOMAP. I think the utility is there as a potential starting point. When we did an experiment with PTSD, we actually started with WOMAP for PTSD cases and we found about half as many cases as when we cast or brought our net, not being restricted by WOMAP. I think you have to be careful that WOMAP defines rough bins which could be a very valuable entry point for your beginning. I don't think it should be viewed as a replacement for the complex curation of phenotype. Yeah, and I think it will evolve also as this various projects define other needs, data they are not yet in accommodated format. So my real question was actually for NIH in terms of I see a great value in having all of us, the MVP programs here, but what about the other programs like the CTSA and the Neo BD2K programs? Is there an intent to have this all converge into eight data commons? So there's always the hope that that will happen, Lucilla, as you know, making it happen practically is a real challenge. The CTSAs are actually represented around the table in a merge, almost every emerged site has a CTSA. Working with them to reach commonalities has been a little more challenging than we would have liked. I'm not sure about the BD2K because maybe Eric, you could comment. It seems to be the future is not quite well known. But I think what you're saying is we need to keep in mind that there are a broad variety of things out there that would be really great if we could collaborate with and share models. Would you agree that using the fair data standards in OMAP is a step in that direction? So Sharon and I have decided to invoke chairs prerogative that having finished our contextual presentations, we think is actually a good time for a break now. So what we're going to do is take our 20-minute break, plan to reconvene at 10.15, and then we will launch with our panel presentations, which will be much more focused and deep slices. It'll also give attendees a chance to ask your questions that we didn't have time for in the open session of our presenters. And I'd like to thank all of them for the great foundational context for the topics that will follow after our break at 10.15.