 So Steve Chan? All right, I Hope the slides will go live if they're not. I'll just read off of them I believe my charge was to listen to this first very very interesting hour and 40 minutes and Quickly alter some slides that I had and raised questions that had come out of the discussion and the terrific presentations and hopefully we're gonna go live or All right good The first thing I wanted to say was we have to think about what we can do and what we want to do And those are two very different things in the next couple of slides. Hopefully we'll Give us that ability to begin to do what I think Eric put in his first slide Which is develop the guidance for NIH for deciding who win and why Sequence at this point and we really have four major elements and sort of hearing in the discussion here The technology is the choice of technology when to plug in when perhaps to re-plug in The wealth phenotype studies And then I think Peter raised a very important question. How do we actually analyze and call the variants? That's a very unstable world for a good part of the genome and so The availability and having these spaces where we can continue revisit and improve things in the iterative nature I think of of genome sequencing and analysis is critical and then of course the availability with adequate consent So if we go from the general to the very specific considerations and thinking about mapping diseases and traits Let me throw up a first set of questions for the group to think about tonight and then tomorrow Starting with pilot or pilots Here we know that they're very rich sample sets where we have multiple outcomes And we can certainly come up with compelling scientific questions But it's the third and fourth bullets I think that are important the economy of scale Developing a process these things are going on in other studies as Peter showed, you know some You know thousands of individuals being sequenced elsewhere But if there's going to be a large scale a much larger scale effort Towards a million sequence genomes and a million wealth phenotyped individuals We have to think about the economy of the scale of how we develop that process and how we feed into That particular process and then the validation thereof because error rate is a real issue And it really gets at this question and moving to the larger studies of really thresholds as we all know around this table The history of genetics is thresholds. Just what can we eke over? Statistically or in enough members of a family to find something the question here and thinking about this Do we have to think the Hail Mary pass boldly and go with a few phenotypes? To a much larger number to be able to work backwards and hide a subset of those individuals and say with what real accuracy or precision Will we able to discover what we're interested in power calculations are one thing But the difficulty of calling and looking at errors and phenotype heterogeneity and phenotype errors Really raised this I think very important question of thresholds and we would think of that in terms of different effects and frequencies and then Certainly the pipelines the kinds of pipelines that have already been talked about today genetic pipelines are absolutely crucial Determining which particular technology and then how we would analyze those and then the problems and calling is Suggested down here if we just take the genotype error for for doing an Illumina I'm the express chip, you know 0.01% and if we just apply that to the genome is our best tool So to speak for genome sequencing calling. We're still looking at 300,000 errors now most of those errors are going to be in places We don't really care about but some of them may be in places that we care about and I think this is an important PR issue that we face in talking with the public if indeed, you know People are used to tolerating only a very small error with respect to clinical judgment So I think that's a very important issue that we have to think about and then of course the posting that information So let me talk about the strategic use of the technologies We've already had a very nice talk from Rick about where we are with respect to exomes and whole genome And the question I would put on the table is for some of the studies that we start with we would want an availability at least to consider Revisiting those samples with supplementary technologies as we may see whether it's with PacBio or somebody else has a way to Fill the rest of the exome or fill the rest of the genome that at some time We can get closer to saying we've looked at 98, 99 or 100 percent of the genome is opposed to 85 percent And I think this is going to be very important in constructing and understanding the models And we have particularly in cancer when we look at all the GC rich regions that we can't sequence very well And the amplifications and deletions that take place There's certainly you know small hints that in many of those places of the genome There's very rich relationships with with cancer The second is targeting high profile regions and thinking about follow-up confirmation how we would do that and then the new Genomic spaces as we think of the question of detectable Mosaicism with sequential testing of individuals whether you know and seeing that the genome is not quite as stable as we thought Necessarily over time certainly epigenomics some diseases like retinoblastoma and childhood It's very interesting that it's really the epigenomic change It's been identified and not many somatic mutations that are very important in understanding that disease RNA analysis certainly comes on board and then I think right now We're just about to have a huge crop of fabulous papers from the end code They're 18 that are coming out in genome research and a couple in nature, and they're really a spectacular sort of Convergence of understanding what the space of the genome really looks like and I think out of this we're going to see places and things that we may want to target and think about in Looking and understanding phenotypes better I at least want to put that on the table is something coming downstream in terms of being able to use that information To better look at the three dimensions of the genome as opposed to the current maps of a physical map that we think in more one or two dimensions and Then the analytic questions of some versus many in a particular data sets depth versus breadth is another way of looking at that But I would like to put on the table at least for discussion Possibility that we choose a couple of phenotypes that we would be sure are common to everybody that's getting sequenced Again coming back to this threshold issue that if we had somewhere, you know if we had 500,000 people sequence that we had a BMI and Some other piece of information we would then I think be able to much better begin to really dissect what we really mean by genomic Architecture and it and exceed by an order of magnitude what we would really need to identify Different types of variants with different effects and the like and I think that's something to think about sort of super imposing Over the thought of how are we going to prioritize other some embedded questions? It will be of tremendous methodologic and analytic Value to the community Certainly the development of pipelines with reference sequences Rick raised is just so important and seeing even that dynamic change Does have real implications for what we call and how we call them certainly the cost of analysis Central versus distributed. I think we all would like to be distributed But there is a role for a certain amount of central to be sure that we develop the right pipelines and then The sequence data is really hard to test Agenostically and it's the value of the laboratory correlations as we get to rare and rare variants It become more and more important when you look at exomes and you see 18,000 variants You've never seen before and you start categorizing to really interpret them We're gonna have to use other information and that that's that's that annotation of results. It's really important So let me just say I hope that long term we can think about a goal that this This whole discussion would move us towards a million well sequence genomes in a really a million well phenotype subjects that would be able to converge at some point in the not too distant future Certainly, this is not going to be the only contributor to that But we have to think big in terms of being able to do that with the ideas of Recontacting and improving our lifestyle and environmental exposures And we really live with two paradoxes that I want to end with one is that we are trying to infer Individual insights per person from large studies and we know we need larger and larger studies to get to rare and rare events And these denser datasets is cool as they are analyzed by fewer and fewer people They just have more and more difficulties and challenges in a lot of the analytic pipelines And so, you know, I I fully believe that we have to think about democratizing the Analysis, but it is a real challenge. How many people really can go into the cloud and do the things that really need to be done So there really is a gradient there that I think there's an educational part of this project That needs to bring more and more people into that world and give them the facility to be able to look and creative in different ways Let me end by just saying I'm here in Wilson Hall It's named for this different Wilson, but there was a great president of the United States Who made a very important comment on collective intelligence and he said I not only use all of the brains I have but all I can borrow and to me This is a call for the collaborative nature of working together to solve these problems And it's certainly fun to see everyone around this table here And now the time is really to get down and roll up our sleeves as which Royals and say and get at it. Okay Great. Thanks. So so I think we want to have discussion in the last five minutes or so amongst everyone around the room Stephen did raise an interesting issue and you're welcome to come back and join us You make me nervous when you're standing up there about denser and denser data sets that can be analyzed by fewer and fewer people That's a really great point. And so how how can we address that in a Reasonably effective way is that is that appropriate? You know, do we want to make data sets that are that are so complex that they will give us the answers We want but only very few people can handle them. Or are there other ways of kind of skinning this cat? Hi, Mike Knowles and Charlie. I'd like for you to think out loud for just a few minutes in in in terms of your concepts of new genomic spaces because it seems like These new genomic spaces that you talked about epigenetics and you know encode data and those kind of things may well be you know Vitally important in moving to the next step from plain old sequence. And so what I'm trying to do is ask you to think out loud about how you envision Taking lots of genomic sequence and addressing those particular areas I haven't I have a specific example, which I'll talk about tomorrow But but how do you think about that in the broader context? I Think it I think about it in terms of where we were talking at the very beginning here about the phrase the biology of the genome I think has several different connotations and as we move from a physical map, you know We had genetic maps and then we created physical maps that gave us an opportunity to sort of see the position of places And now we line those up and use various fancy tests to statistically link them with outcomes We still don't have that third dimension of time and of space and You know, whether it's the some of the interesting epigenetic observations in diabetes or You know in cancer certainly those should be put on the table But in looking I mean I have to say I've had the privilege of being able to review 18 papers from encode at the same time and write this perspective and it's really Absolutely fascinating to see and read all these things and and think of the genome in three dimensions as opposed to two dimensions So the way in which we use it and we mark by one dimension or two dimensions and so I'm not sure where this is going to go But I to me we have to think about that and at some point come back to the question of what is CTCF occupancy really mean in time You know and this is where the ability to look at genomes more than once whether it's you know Detectable clonal mosaicism which we know certainly is occurring with aging or whether there are other Functional things that we're going to be able to start to analyze in large or small scale I don't know on what time scale will be able to do that But I think we're going to learn an awful lot because the genome is not flat It's three-dimensional or it's really four-dimensional and we we currently look at it as two dimensions and we try and make a third dimension but we We only do so well with associative and linkage analysis So you know the question is are there going to be new ways of organizing or wanting to hone in on? particular things methylation patterns or whatever in the context of inflammatory bowel disease and someone who's got active disease versus Early disease I mean these are the to me these are really really fascinating questions and the sequencing in the cohorts gives you the opportunity to visit some people more than once Sequencing has been called at recently disruptive technology, and I think Steve's point about You know these data sets can be analyzed by fewer and fewer people bring home the point that we are going to have to think about new ways of Moving to new ways of storing and analyzing data and Eric raised the point Clouds are one way to think about that But we're going to have to be much more creative to make sure that people remain as engaged in the analysis of sequence data As I think they would like to be that this is something that Biologists need to be able to access the information Not just genomicists, not just statistical geneticists We have to find ways of Serving results of the studies serving up the data Making computational space available to people Just add one thing of that. I mean I think the scary thing is the tidal wave of non-academic interest in Sequencing and genotyping with 23 and me people getting their personal genomes You know, it's not very long before googling your genes may be not a catchphrase, but a real challenge and as Academics, you know the challenge is how do we stay in the middle of that and not become you know peripheral to the Interpretation in the use of genome data just wanted to In terms of this issue of sort of fewer and fewer people being able to Deal with heavy lifting inside the analysis of these data sets. I I'm actually not concerned about this problem the Most biologists are gonna need access of the type I think Nancy was mainly emphasizing It is gonna be The field's gonna become more specialized and more stratified all scientists do this is because they become more mature There were there were never very many people actually who knew how to do whole genome assembly Annotation large-scale annotation of genomes But they're an immense number of people who use the results of this kind of calculation And that's what we're gonna see here. You never want to centralize anything more than you have to But some aspects of this will be centralized and others will simply be Largely dominated by subspecialty within our field that really concentrates on it One person can do a lot sitting at a computer terminal. If he or she really knows You know knows what to do so it doesn't all have to be centralized I just following up on Nancy's comment the One thing that was discussed at the aggregation meeting a couple weeks ago was was making Results more widely available from from these analyses and I think that that's one thing we that might be added to the checklist of Steve's Outputs that that would be considered. I mean really making Widespread results available. We've talked about how some of the GWAS Large studies don't really make them all of the results available including the negative results Which actually are very important results to have available So that's one thing the other the other point I wanted to make was that the phenome Not just the genome the phenome is very important for for These types of studies we learned it from the exome sequencing program within the With funded by NHL bi that when you when you aggregate results or data from let's say seven ten or more cohorts the harmonization of the phenotypes is a critical step that that was sort of not really paid much attention to but turned out to be a Time-limiting and rate-limiting step and while it's very unglamorous It's actually terribly important in terms of making those types of data available Widely, so I just would add that to the list as well. All right. Well, thank you every month everyone This has been a great sort of evening snack of brain food to get people Engaged for tomorrow Unfortunately, I'm going to give you homework though, you know If you realize that we have these two main charges of thinking about the questions that can be addressed And then thinking about criteria for identifying cohorts not I think just because this is the first day of class Let's just focus on the first one if people could email me between now and 6 a.m Just I'm serious I'm serious So if everybody could email me about one or two questions that you think are important for you know Looking forward you would like to see addressed from a large scale sequencing of Collective sample sets Yeah, I will try to put those together for tomorrow morning and develop a theme of several questions Just so I you know we will have something that can cede the discussion feel Well, I'm sure we'll shoot it down and edit it throughout the day But I do think it's important we return to our to our basic charge So please email me. It's Eric dot more Winkle at th.tmc.edu Okay What's the question or what's the assignment the assignment is go back to the charge and one of one of the two charges is To identify questions that can be addressed or should be addressed for large-scale Sequence or sequencing in large cohort studies So to help design it these are sort of use case scenarios if you will You know in designing something what what will it be used for what kind of questions? Do we anticipate that we'll? try to ask of this Sample set and phenotype Scientific questions absolutely. Yeah scientific questions And I think if we have a list of broad visionary scientific questions That's going to help us think about how we should design it what it should look like Okay, so please email me and I'll try to synthesize Yeah, and I would just note that Eric was was copied on the note that you all got that that transmitted the June 5th and 6th workshop report. So so you have his email address. That's no excuse if you couldn't write it down that