 Thank you very much. We have touched on a lot of the important issues in design and selection, so when I can, I'd like to just refer to some of our experiences that might bear on those. I might skip over some and I hope it'll leave a little bit more time for discussion at the end. So I try not to be completely bound by my experiences, which largely come out of working from the NCI Cancer Cohort Consortium, but it's only fair to let you know that I'm certainly informed by them. So genome-wide association studies essentially brought epidemiologists together in a way that I'm very grateful for. It made us step up to the plate and we combined cohorts to create what you might think of as a synthetic cohort to attack that problem early, not too long ago, and then realized it was especially important to combine cohort studies, that is with prospectively collected data, to look at lethal diseases because if you tackle them after the fact you lose a lot of the interesting cases. It's not just true for genomics, it's true for other important things, particularly where you really want your biobank to have things collected in advance. Vitamin D was a good example, and then it applies more generally. We've also worked on problems like obesity, where you don't really need a biobank, and we're most recently turning our attention to making consortia focused on the minority populations, and our very most recent effort is in adding on survival and second cancers and treatment efficacy. So this is a pretty typical example of cohorts getting together because in order to do a genome-wide association study of any power you need to, and then afterwards, this is long past discovery with 6,000 cases, and because the cases were nested in prospective cohort studies, we could ask the question whether we were actually getting closer to personalized medicine. This is from the same consortium, so now the synthetic cohort includes everybody that was in this one, but then another seven cohorts, including the women's health study that Julie described. An interesting thing here is that just as I think Lynn was mentioning, we've known a lot about blood group, but for some reason it took GWAS to kind of bring us back to it and realize that it was ABO blood group B that was paying a major role and also discover a number of other variants. Because this study was nested in cohorts, we could then go to a couple of the cohorts that didn't have DNA, but had asked doctors or nurses 18 years ago, what's your blood group? So within the same overall structure of cohorts, we could get an independent replication of what we found through the genome wide association study, because lo and behold, the very people who said on blood group B, were they not tested, had exactly the same relative risk, which was two. This is not sequencing, but the situation is similar in the sense that before we did a pooled study looking at serologic measures of vitamin D, the literature on these cancers, one of them is Nyanha Jinslam Fulman, which I've done some work, was a mess and we were wasting time and we're chasing false leads and it wasn't inexpensive to do. So by getting 10 cohorts together and looking at seven fairly rare cancers, we really could, I would say the important discovery here was the dog that didn't bark. This was to show that we could stop it. And that's an important value I think we have and I'm afraid that if we underpower any of our sequence work, we could risk post positives, not so much that I think we'd drive bad clinical decisions, but might waste our time too. So I think we have to embrace the reality that the sample sizes are going to have to be big. And this one is, again, an extension of that same synthetic cohort, but now we're out to 19 cohorts. And here the point is that we were able to address something that really individual cohorts can, but when you put them together, we could put some controversy to rest. Here's the one where I said the harmonizing didn't kill you, but the harmonizing wasn't hard. If you want to know more about our particular consortium, the NCI cohort consortium, you can go to our website. But I think we are typical, that is, there are other cohort consortia. Our cohorts are heavily, but not exclusively funded by the Cancer Institute. Many of them look at many other outcomes. Ours are large because cancer is an uncommon event. So by definition, cohorts that started out their life intending to look at cancer tend to be large. We've touched on this in the last day. And I don't think I know the answers. But I think that if cohorts were coming to me now and saying, please, please, please, let the sequencing be done in my study or from my cohort, I think these are the questions we would ask. We would say, so why do you think for this disease there really are some novel genetic influences to be found? Are you in a situation where there's an environmental component that's so terribly important that it's likely that sequencing is going to reveal more of the hidden heritability in an important way? Are you going to find a biologic intermediate? Are you going to improve our ability to predict risk and connecting one disease to another, as was elegantly shown in the Emerge phenotype scan? I think these are the things that we're looking for from the cohorts that we would be willing to make the sequencing investment in. So the thing is, everybody who would make the case would be able to say they'll find novel influences, they will. But the hard question is, will they be many or will they be important? So I think, but I'm not sure, that in the same way that it turned out that heritability was a pretty good predictor of which cancers, which I know best, would have a lot of GWAS hits. I think it's probably still true that if there's a lot of heritability and we don't know where it is, that makes it a good case for sequencing. I'm not positive, but I think it would help me. And then I'm not sure if you said, well, mine's really important to do sequencing in because I know there's a lot of heritability. And I also know that obesity and tobacco are big effects. I'm not sure if that would make me more or less likely, but I think it's worth at least, at least inquiring about. We've had sort of a mixed record from GWAS and cancer in that respect. And then of course, not just for cancer, but for many brain conditions for many things, if we could be simultaneously sequencing the relevant tissue, I think that would make any opportunity look more attractive. So what about genes and environment? If we're going to do this, we have to have not just ask the questions about the environment, but we have to have a decent range of them. We have to have enough people who smoke heavily enough people who are heavy. For some of the interesting environmental questions, you have to have a specialty cohort. If you want to be the person who found that there's one variant, which actually nicely describes who's at risk of getting a second cancer after radiation for Hodgkin lymphoma in childhood, you really have to be in that cohort. You're not going to do this in a general cohort. I've said we are variants there, but I could be variants. It could be the locus that the variants lead us to. I mean that a little bit fuzzy in the little dotted lines are to say, presumably that's what we find from the study is what is the connection. So let's think about two outcomes, cardiovascular disease and endometrial cancer. And we know that BMI is a predictor of both and tobacco is a cardiovascular disease, but not of endometrial cancer. And we now have ongoing a lot of multi marker panels, because we suspect the role of inflammation and immunity in both. So I think you're a more attractive candidate if you say, I'm in a setting where my rare variants might illuminate this sort of biological mechanism. But again, this is a reach. This is a hope. We've heard people, we've seen some examples of a beautiful one where it was really clear. And a lot of it is just I hope it will be so. But if it is so, and we do want to go to places where you can look at say inflammation markers in the blood to see if this is what you found. Again, I think you have to be in a setting where a biobank existed to collect stuff in advance. This is a sort of stylized prediction curve. I made a very bent one there. I thought that maybe in heart disease it exists. It certainly exists in the HPV and and cervical cancer, not in a lot of other settings. Some of the work we have done not utterly surprising is that by adding discoveries from GWAS we bend that prediction curve. Some were still not out for most cancers anyway, into an actionable zone. But I think a study would be an attractive candidate for sequencing if it said some of what I'm going to be able to do is possibly really bend that curve. So we're looking for a cohort which is perfect. It's modern. It's got sturdy consent. It's got a terrific biobank. Easy access, great informatics. But somehow it also has to be mature, meaning a lot of end points actually have occurred. We really do have kinds of the kinds of occurrences of disease that we need to look at. We need it to be large and we need it to be rich. This is a three layer cake and the bottom layer is we have lots of data on behavior and we have lots of information on the environment and lots of information on phenotypes and now we've sequenced everybody wouldn't be perfect. It doesn't exist but realistic is what we have heard, which is there are many, many good cohorts and I think you could probably even consortia, but many settings in which we have a very great deal of information on a lot of people, maybe more than a million. And then on a subset of those, a lot of useful information like serial samples and on a subset of those later information that we got from questionnaires. The newer cohorts, Vanderbilt runs the Southern Community Cohort Study, which was recruited out of clinics and has a substantial African American population. They routinely bring people back. The Black Women's Study routinely brings people back. There was a question about how compliant are people with coming back and committed and attached to their study, very, very compliant. For many of the cohorts that don't have a luxury to be nested in a clinical care setting or have the ancillary data that they need, they are now matching to the Medicare file so every hospitalization that occurs can be known. So I think that's what most studies actually look like. They have a lot on a lot of people and then little bits and pieces. So is that top layer sequencing? Yeah. So all different cohorts have different things. They look very different one from the other, but for a fraction of the cost, we are able to put them together, save a lot of time, do things in months, not years. Thanks to Genomics, the cohorts finally have gotten in the habit of sharing data with one another and harmonizing and we know that it works. Okay. So what might we be looking at? We might be looking at a subset that we select for sequencing from, say, I've got five or six colors here, six colors here. So maybe there are six cohorts rather different. Some might be nested in a care setting. Some might be in a general population. They all should be used in the base. One of the mistakes we've made in epidemiology is not to use all the data we have. You can estimate main effects using all of the data you have on all of the people. Don't throw it away. Increasingly, I see people use cohorts as if they were just freezers. They aren't. Just because you, sorry, two minutes, I'm almost done. Just because it is a wonderful freezer, you should never throw away the data you have and you shouldn't degrade it. And I do appreciate that harmonization is important. But I think for many of these purposes, it can be done. Okay. What will that top layer there that we select for sequencing? What might it look like? Who might win our contest? Well, I would hope, I would hope that at least some substantial proportion of what we would choose to sequence really would be from the entire set, the entire underlying population. I think we shouldn't just do that. I don't think it would be fun to sequence and only look at things that are measured in medical records. I think we should pick two diseases. I don't know what they should be for sure. We can argue about it. But I think we need to, to make sure that we're actually in the business of discovering genetic causes of diseases. I do think extreme genotyping has a place. The obese participants, I took a quick look at what we have in the consortium and we have a thousand deaths in men and 2,000 deaths in women who have a BMI between 40 and 50. Now, not all of those people are genotyped and not all of them have DNA. But that's enough to be able to make really the world's best estimates of class 3 obesity on many outcomes of whom a subset might be suitable for certain purposes. And then I think we also do have the ability to select participants with a family history of the conditions we're especially interested in. So it's not a piece of cake, but it's tractable. It's quite doable. It's not ideal. It's the thing to do, I would say, in the shorter run. Okay, so there is a trade-off between the mature cohorts and the other opportunities. I think it's wonderful to work in something like an HMO setting. I think a lot can be done. But I also think that, as Julie said, we have sunk a lot into doing some very large cohorts with an awful lot of data. If we can use them, we should. A single cohort sees always easier, but the advantage of multiple cohorts, and I think Terry's written about this extensively too, is then you're in the great situation where you discover that it's ABO, blood group B, that is the risk factor. And yes, only a few of the cohorts asked about it, but they did. Suppose you had discovered something else. The other cohorts asked. So by actually using all of the data, you increase very much the amount of phenotypic data that's available to bring to bear on the problem. It just won't be available and absolutely everybody in the cohort. Many designs, I do think we should stay with a case cohort because we don't know exactly what are the things we might want to link. And it's nice to have some ordinary people that we have been short of sequence. And I think that we've learned so much today, but I think we're still are basically having to take our best guess on what the actual value of the sequencing will be. So thank you very much and I would welcome your comments and questions. Thank you. So in addition to family history, do you have information on family members of any of the people in the cohorts? Yes. So remember that we're just one of many consortia and for most of the diseases, there also are consortia exquisitely detailed and devoted to families. So putting that aside, it's very variable. These are all together 43 cohorts, 20 that are extensively used. It varies from cohort to cohort. Some have a big family component, some don't. Eric? A couple of comments. First, I really like you mentioned the case cohort design. I'm a big fan. I think it fits very nicely into the setting. And then second, your slide, not that I'm fixated on food, a little piece of chocolate cake on the bottom. And I want to make sure I understand what it looks like it's implying. I don't know that I know how to make it come back. I think I can remember it said something about a sample from each of the cohorts in these two case groups, X and Y. So are you suggesting we create an amalgam cohort random sample or nearly random? Yes. And then plus we have a nested set of cases enriched for a couple diseases and I will debate what they are. Yeah, let me, let's just imagine, let's just play. So we know that we will want some of the cohorts that are, were built in size for cardiovascular work and they are smaller than those were, that were sized for for cancer work by and large, but they're very rich and we want them. So just imagine they put in, I don't know, Rancho Bernardo framing it. We put in a very richly annotated cohort. That can be, that needs to be in the, in what we might select. And then we might put in one that's nested in medical records. So the synthetic cohort has cohorts as your little model indicated of different ages, but it also could have cohorts that have different depth and value. So maybe there will be, let's say five, five cohorts together that are going to turn out to be this synthetic we create. By knowing which questions we think may be most attractive, we might say we really want at least one of those cohorts to be able to go after cardiovascular with some data they already have lying around. That would be a possibility. And then we would be sure that a fraction of those are sequenced, chosen at random, regardless of what the outcomes were. Did I answer the question? Yes. Would the nested cases, would the enriched cases need to come from the same sampling framework as the original cohort random sample? Or you mentioned Dan, for example, so let's use him. Okay, that's a lot of fun, he's not here. So would we bring in cases from the Vanderbilt system, even though our cohort random sample came from all of these studies outside, are we creating a substructure problem that Lynn and I guess Peter were talking about earlier? No. Evan and then Thomas. Yeah. Yeah. Evan. This is just a follow-up to the previous question about information about availability of family members. And also in regard to a comment made in the morning about the potential for integrating family studies and population studies. And it does seem perhaps one thing to be paying attention to and thinking about design would be the potential for integrating that way, for example, using particularly interesting phenotypes or genotypes identified in the population studies as pro bands for family studies in which you could then get additional information. That might also have implications for power and cost calculations as well as maybe some of those hybrid designs could maybe a lot more efficient designs not requiring such an enormous end and potentially lower costs. So that may be something we're thinking about. So both efficient in terms of sample but also in terms of genetic analysis too. So one of the cohorts I mentioned was the NIH as sister study. So those sisters are enrolled and engage in studies. I think there are three other consortia that, sorry, three other cohorts that are in the NCI consortium. And I imagine there are great many in other consortia who have made that their special focus. This is analogous to Julie saying that her cohort made a special focus on hormones. There are cohorts that had a special focus on knowing everybody in the family and finding them and getting as one of their questionnaires very detailed family history. But but that's not the norm. That's not what most cohorts do. Eric, did you want to follow up on that idea of maybe comparing a case group to what wasn't actually the underlying one? My sense is we should use cohorts as cohorts. We should vary the selection. We might take a disproportionate number of people who are, say, obese. But I think we should treat the cohorts as the underlying population from which we are sampling. Thomas. And Evan read my mind. I wanted to add about the importance, including families or at least rest of your relatives, particularly if you're looking at the novel importance of the novel variants for complex disorders. Chris and Julie. I just wanted to comment that from a practical perspective, the exome sequencing project that's been mentioned, the charge charge consortium have done sequencing through the error funding with exactly the design you're describing. And, you know, one of the so there's been there's some practical experience. And I think it's worked very well. It's actually had the added benefit of bringing of further increasing the collaboration amongst the cohorts, which has been a great thing. But there are challenges with sample management with DNA amounts with quality with phenotype harmonization that all that we talked about earlier and just those all have to be thought about in advance to make and also selection of subjects for for the sequencing who have the most phenotypes. For example, a women's health initiative has a very large cohort with not many phenotypes but a smaller cohort with lots of phenotypes. So that was the that was the source for for this random cohort. So, you know, I think it works and it's something that definitely should be considered and just wanted to mention there is that experience. Judy. So I don't know if this is beyond the scope of what's being considered, but in the sequencing projects, what about incorporating sequencing, the intestinal microbiome, especially in the context of looking at issues like disease prevention, as well as obesity. So that's another thing that happens when they get the cohorts together. This is the harmony part I like is that I think there are three cohorts within the consortium who are actually going to do a joint micro biomic project for just that reason. Terry. I don't think you mentioned at least not on your slide, ancestral diversity in your in your population. So OK. So what that or yeah. So one reason why the very actually on slide number one, we have we have pulled together the three cohorts that are that are all rather new that have substantial African American precisely to bring up that fraction in in what's covered in the maybe there are in the we have also taken a subset of all of the Asian Americans within the BMI which turns out to be a really substantial number and done for zero cost, basically done an analysis there. So there are cohort consortia that are forming around the need to up the diversity in what's generally covered. But no one, I believe, would think that we that there would be much possibility of actually taking a sort of random sample of the population, but neither is it needed that if you do look at where we are purposely over sampling for African Americans in some of the newer cohorts and sort of look across the board, there are and I'm not sure I'm going to have to put a foot note on this for sequencing because I have no idea if it would be true for these very small ancestral populations. But for broad continental populations within the United States, I think we have among all of the cohorts, lots of coverage of the population groups in the United States. Does that is that getting at what you want or you mean something else? It is to some degree, I guess that, you know, I am nervous about the rare variants being really very, very, very population specific and African Americans did tend to come from one area of Africa, but it still is a pretty variable area of Africa. So I'm not sure, you know, how best to address that other than to say that G, we ought to think really hard about being sure that the diversity that we include is interpretable, so that we've got adequate reference populations that we can refer back to as needed. That's where I turn to my friends in genomics and say here's a problem for you.