 Okay, so taking a step to the other side of the association equation, consideration a little bit for phenotype and exposure, harmonization and standardization. So just a little bit of background. Recent genetic data has tended to have some level of harmonization or standardization. There's been relatively limited number of platforms, limited number of GWAS panels, exome chip, et cetera, that to some degree limits heterogeneity across the data. And the variable names tend to have standardized names, so RS numbers, chromosomal positions. Phenotype and exposure data, however, is individual to each study. So, contributions to these differences are differences in questionnaires and data collection forms, variable names, measurement units for quantitative variables as well as definitions for qualitative. Biomarker assays differ across studies and over time within studies. And some studies have begun many years ago, so there are error effects, et cetera, and just the data I've been collected over through the large time span. So, I believe our overall goal is to maximize the sample sizes of the phenotype and environmental exposure data for samples that have existing genetic data to increase statistical power to detect association. Also, facilitating identification of variables that are needed by investigators utilizing data in dbGaP as well as just reducing the duplication of data harmonization efforts. So, in that, we want to basically maximize the utility of, a first step will be maximize utility of the existing phenotype and exposure data. So, usually the step for that is performing harmonization of a panel of phenotypes. I think there's probably a finite number of phenotypes and exposures that we could identify. It's somewhat large, but not so large that it's not out of the realm of possibility is to produce some set of phenotypes that we want to go and harmonize across all studies. And then ensure that all potential existing phenotypes and exposures that exist in the various studies are actually incorporated to dbGaP when possible. As there are phenotypes, et cetera, in the studies that haven't actually been submitted to dbGaP, that actually could be if some effort were taken. Also, obtain new phenotypes and exposures on existing study participants with genetic data. So that means going back to the existing participants and getting additional data. And then for new projects, encouraging the use of standardized phenotype and exposure measurements. So this would be going forward for future studies, but obviously that effort would really pay off in terms of both reducing the need to harmonize phenotypes in the future as well as being able to identify people with specific phenotypes that you can then go and genotype for studies. So phenotypes in dbGaP, there are often many variables for a given phenotype for a given study when a basic search is done. And there are lots of reasons for this. Multiple visits within a study, sub-cohorts within a study, for example, Framingham, and different definitions. There will be multiple variables for different definitions. The variables and definitions may have different keywords to indicate them. So if you do a very simple search, for example, if you're looking for hypertension, some of the variables may be under high blood pressure. Some may be shortened or just partial words, et cetera. So it's not always a simple search. And then there's varying levels of documentation submitted to dbGaP. So there tends to be documentation submitted with the variables, but it can vary widely across study and again within study. Sometimes ancillary study variables don't tend to be as well documented as original study variables. And then finding the additional documentation when there isn't enough to make informed decisions is not always readily available. So just as an example for some of the NHLBI heart-go cohorts, utilizing care data to go through phenotype harmonization, start out with there's a total of 55,000 variables that are in the care data sets. So breaking down my six days, you can see there's a variability in how many variables there are with Framingham as over 20,000. So as part of that process, as part of ESP, we've worked to try to create a set of harmonized phenotype and exposure data across approximately 140 variables where it's harmonized to the degree possible for these cohorts as well as Women's Health Initiative. And we came up with composite names for them, such as BMI at baseline, current smoker at baseline. And there is documentation that maps to the original study variables. But this is so that investigators within the study can not have to each pour through multiple variable names. So the process of phenotype harmonization is multi-step and it's iterative. Usually in consortia efforts, it's started by where there's a point person or a working group where they work with phenotype-specific working groups or project teams as well as experts on the disease or trait. And then usually there's a first process where you scan through all the variables so you start out with the 55,000 and you try to whittle that down. So for each phenotypic category, you try to collect all the variables that are related to your category of interest. And then you go back to the working group and the experts and you try to hone in on a common definition that you can reach across all the studies and that will address the scientific question of interest. And throughout the process, there's lots of things that you have to look into and consider, including sample size, measurement units, distribution of the trait, assay information, what visit it came from, et cetera. So you probably can't really see this but I just very quickly wanted to give people an idea of what you're faced when you start to do these variable searches. So this is asthma and Framingham and this is actually only two of the three sub-cohorts. So there's a lot more and there's probably even more within the study itself. But obviously somebody going to DbGaP and just, you know, may want to say I want to look at asthma. They can look at that and it's not straightforward how to incorporate it and pick which variables to use and even how to decide, you know, what variables are for sub-cohorts, et cetera. And this is the listing just to show you hypertension, just to show you that it, this is across several different studies but just to show that, you know, you would have to search under hypertension, high blood pressure, HTN, et cetera. And then it's not always clear what the definition was used for each one. Some of them will say definition 4 and 5 and how do you find out what definition 4 and 5 is. And again, some of the documentation is there. Sometimes you have to search for it. Sometimes you need to go back to original study websites. But again, it can be a lot of work and a lot of people just don't know where to start. And this is just a, I wanted to mention medications. Medications is a whole new level because obviously I think medication by variant interactions are going to be of interest. But this was, you know, just looking for hypertension medication status. Some of the variables are more overall composite measures of any hypertension medication. Some of them are listed out more specifically. And there are, the medications change over time so it can depend on what visit. And then some of the studies actually have databases where they use codes and I didn't list those variables here but that's the way to get to the medication information in some of the studies. Challenges of retrospective harmonization include that it's time consuming, obviously, that there's always going to be differing levels of ability to harmonize across the studies. There's some things that you really can't harmonize very well. Inconsistent measuring units and our definitions, sometimes actually even within study across visit but we've also run into changing units within a single variable. So again, you know, you have to really look at the distributions. And sometimes there's not enough documentation to figure things out. So in the process that I've been working on with several others, there's a lot of going back to representatives of the cohorts and asking questions and sometimes asking multiple questions. And also the impact of medications is a big challenge because lipids measured in people 20 years ago versus lipids measured very recently, you may adjust for a lipid medication status but the kinds of adjustments that you would apply for a recent, a more recent, which is probably a modern statin versus years ago, could be very different. So another issue is that data that's submitted to DbGaP is often limited to the primary study variables. What we've seen is that a number of times there may be additional phenotypes or exposures that were measured in ancillary studies but the investigators aren't aware of the genetic data being in DbGaP and there's not necessarily a mandate they don't have to submit it. But a lot of times if you go to them and ask them, they will. So sometimes additional visits, just making sure that additional visit information is incorporated as soon as possible. And then as a set of recommendations that I think most people would agree with is that there's a need to develop a panel of some number of harmonized phenotypes that have common variable names, common units of measurement and definitions. And as well as some revisitation, potentially of the documentation that's currently in DbGaP, making sure every single variable has a definition, every single variable has visit information or sub-cohort information very specifically and also flags or notes for special issues. There are some variables that are studies that are put in DbGaP and then when we go to analyze them and investigate, I'll say, oh, you don't want to use that. That's not a good variable. So there's no note there that will tell people that. So one thing that we feel would be beneficial is to identify a point person or committee that could work within DbGaP to respond to questions from studies about the process of going back and providing some additional information and or helping to standardize variables. And that person needs to ensure that the studies are providing information in the same way. Even when we ask for standardized measures and consortia efforts, you'll think that you've got every detail listed in your documentation, but you'll still get differences back. So there needs to be a process where when people have questions or something comes back in a slightly different way that you can follow up on it. And then it seems critical to obtain input from phenotypic experts and from representatives from the cohorts in order to identify composite variables or standardized variables. So another level, of course, is gaining additional phenotypes from existing studies. And I think I actually mentioned this already, but in terms of the going to ancillary studies or study that, sub-studies of the main study that may have not submitted data or additional visits, obvious pros are that it's relatively cheap and fast because it's already existing. It just takes going and asking some of these and looking into it. Cons, it tends to be a lot of these variables are only on subsets of the total sample and obviously not standardized across studies. So prospective collection of new phenotypes on existing studies also definitely have the place before that would be done. Ideally, you would get a standardized set of phenotypes defined. So a big pro to this is that you can get input on this panel before you actually collect the data. Cons is obviously that it requires an additional visit or visits, which in turn require resources as well as consideration of the burden on participants. So there are LC issues there involved. So in general, there's been a number of efforts that have addressed harmonization and standardization issues. And when possible, I think we should leverage off those. Efforts include a number of consortia, including Geneva Care, Charge, ESP, and others. There's NHLVIP Finder, which is a study where phantyparminization tools are being developed as well as existing variable standardization efforts, including the Phoenix Toolkit. And I've heard some really interesting work being done by NIA where they're developing a panel of standardized measures that they're hoping a set of measures that they want to identify that could be measured in an additional one hour visit that you could revisit existing cohorts and get a panel of measures in a relatively limited amount of time. So this is open for discussion. Yes. As a point of interest, how many person hours did it take to go from the 55,000 terms? Basically, I haven't slept for the last three years. So three person years. Well, no, there's many more people. No, I'm actually looking for... So when we start talking about doing this across all of DBGaP, how big a team is it to... How many volunteers are needed to do that? It looks... What I was showing is Framingham, which no offense to Chris and others from Framingham, but that's about a worst case scenario there. It gets much better for a lot of studies. You can see Armageddon of harmonization. Yeah. But I do think it's relatively feasible. I think if you have interaction with the cohorts and there's active effort to do this, I think there needs to be some support for the cohorts because when I go to them, it takes time for them to do this, to provide the information. And they know the variables far better than I ever will. But, again, it's going to take some level of resources to provide effort, but I do think it's doable to come up with some panel, maybe 100 to 200 measures and harmonize them across dbGaP. I think it's reasonable. Yes. So I'm just wondering, you know, in trying to consider various kinds of studies, you're trying to harmonize studies, essentially cohort studies, that all started with a single... Each study began with a single protocol that was discussed and planned. If one just goes into other kinds of data, for example, medical record data do you have any sense as to how well they can be harmonized? I know everybody believes in the magical quality of EMRs, but... Except for anyone who's actually ever... I have read one. My focus has been on, you know, these cohort and case control studies. I haven't worked as much with the medical, but that's a great question, is how much resources do we, you know, do we feel that there's a payoff there, and we need to consider... But you'd have to phenotype, meaning... Yeah. ...for a while, phenotype individuals to get the kind of research questions. Right. David? So I was struck that... Two things. One is a question. So can we assume since CARE invested many years in doing this for a set of 50,000 people, and now ESP let... And you've led both of these and done a spectacular job that for at least those 50,000 people, there is a file now. That's been done. I think that's largely true. It is now pretty much collected, and, you know, again, a lot of these efforts are duplicated. You know, CARE has done a lot of this, and a lot of that went into CARE. So that's sort of... So one point is just that the nation has invested a lot of money between CARE and CARE and ESP in doing this, and so hopefully we have a good foundation where it could be uploaded immediately. Yeah, it's just a matter... What I think was missing is just documentation that can be gathered together, and that's what I've been trying to put together. So the other question I noticed in genomics, I realize this is not how the culture and community works, but I'm going to be provocative, like Eric said, he was going to be provocative. What you described as an extremely labor-intensive and sort of meeting-intensive project about a lot of input and discussion. Aren't there any software solutions? Like, to be honest, as we've been involved not in these cohorts, but in others, there's a set of moves that are routinely made, and we are actually trying for some other studies to actually automate them. Yeah. And say, not that it would make it... not that you wouldn't need input, not that you would need consultation, but there's some amount of stuff that, in the same way that everyone's calling their genotypes alone and then not putting them together. And again, I'm not saying we should standardize as a nation. That's a committee process. But couldn't there be efficient software that if there was a team that would have to have some number of people for some number of years, that they wouldn't be doing it in the same handcrafted, meeting-driven way, but they might be more efficient? Right. I mean, there is the PFinder study, which the goal of that is actually to develop software to do this. I'm actually involved in one of them. And it's interesting, they're using kind of smart searches, et cetera, machine learning, et cetera, to try to automate some of this. And it is very interesting, but just from my perspective, I don't think you're ever going to... I feel like there still needs to be, at least now, there needs to be input from the investigators because the information that you're getting from there just isn't all there right now. Yes? If I could maybe speak to both Eravindas and Davis questions in terms of electronic medical records, there are algorithms, electronic algorithms that can be developed to do this and test it. And it takes about one year to do that of maybe two people in groups that are assessed by predictive value comparing to a gold standard of a clinician reviewing them. But having said that, that same process could conceivably be applied to some of these cohort data. And I think that hasn't been tried yet in a project like Emerge. One of the challenges with it is that you end up with a bunch of people that you're really quite confident are cases, a bunch that you're quite confident or control and a lot in between that you can't classify. And that's not what you want in cohort studies. Yeah. Yes? I'm curious and maybe I missed it. Do you talk to the P3G guys in this data shaper world and these epidemiologists who hang out together in that kind of zone? Because it's a very similar task that they've set themselves about harmonizing across observational studies. I haven't. You haven't come across these guys? I haven't, but it sounds like I need to. People need to. They've taken on precisely the same task as it were across epidemiological groups. Phoenix is talking to them, Ewan. I was just going to say the Phoenix toolkit, which has about 350 standard measures in an online resource, more for moving forward with adding standard measures to new studies. But we've been working with P3G for some time and have mapped the Phoenix measures to the P3G variables and you write P3G is used more for harmonizing data and existing studies in biobanks. Yes? Is there a plan to deposit all these mappings back into DbGaP so everyone can just get them? Is that reproducing them? So that's largely what I've been trying to do is just collect the efforts that the people in the cohorts have been doing in their discussions during these other consortia and through the ones we're working on now. And trying to just compile it all together with the goal of depositing it. I need to work with people at DbGaP or figuring out how to deposit it or in what manner it's not there now. Well, we'd be happy to work with you the instant you're ready to give it back. Okay, great. I would like to talk to somebody of what would be the best way to get this information into it. Yeah, we've done quite a bit of work with the Phoenix group. Yeah. Trying to do some remapping back on to DbGaP and anyone else like yourself who's done it. There is a way to put it back in and to make it visible to people and credit you for it. Great. Great. It's just the number of people that have actually done this is rare. Yeah. It's a small number. Great. Yes. So I think really what would be very important is to make sure that's going forward we have standards and harmonized standards which we can use very easily, right? So to have a software solution to reflect back to that question where you have all these variables and you can pick and choose those variables when you design your study would be really a good step forward. The math what we have already in the past it's not that easy to solve but maybe going forward this group can... Yeah, I mean I spoke with a little more on the harmonization because it seemed like that that was a little more of the focus of today but standardization and moving forward is absolutely critical. I mean it's going to make things much more feasible as we move forward. Yes. Given the fact that electronic medical records are really... the primary users seem to be the payers is there any activity among the payers to try to demand more uniform phenotyping? So... I mean I'm just wondering if that would be a good coalition... Yeah, I'm not familiar with that but maybe somebody else knows. Yes. Yeah, I was thinking along the same lines because if you're incorporating EHR in the future as well as cohort studies have you thought of using OMOP or similar models for this mapping or you are creating another one? So again I'm not familiar with that and maybe I've missed, you know, at least in the efforts that we've done in ESP care, et cetera maybe something that we've missed and duplicated but there needs to be a way to I guess make sure people are aware of this. I'll have to find out what that is for me. Yes. So I was just going to make a quick comment in response to the payers question. Yeah. So the answer is that the payers are not requesting this for the most part and in the eMERGE network we actually have done some work to try to harmonize the languages used in electronic medical record systems to four cross-institutional studies and one of the things that we came across was that the same phenotype was being specified in different ways at different institutions mainly because of business processes and this does not change the payers' interest because from their perspective the phenotypes were sufficiently related that they were still paying for the same thing. So from their perspective they're not really trying to force us to standardize it because they recognize that business processes are different from one institution to another. Interesting. Eric, I knew you had a comment before. I want an opinion from a cohort. No, no, I didn't say a comment. Okay. Thank you.