 attending to that time, and we'll be happy to start taking questions. Anything that folks can think of, although we may reserve the right to defer some questions into the afternoon session if it's something that we have programmed in. Yes, sir, on the right. Hi, Dr. Biserker. I'm Jonathan Epstein. I'm with NICHD. So, like you, we face this problem with DB SNP, where it's hard to trust it, and it's hard to dig deep enough to know whether you can trust it or not. Now, you internally at NISC have your own genomes, your own exomes, and I've heard that you might release some of that data for other people to use for quality control and SNP filtering, and I just wondered what the status was of that project. Yes, so we're in the process of depositing those data into DB GAP, the database of genotypes and phenotypes, so that will then be publicly available quite soon as well. We can't transfer the entire results set to people for collaborative use, but on a reasonably sized throughput basis, if people have inquiries with respect to the frequency of variants in particular genes or smaller sets of genes, we're happy to download those and provide summary statistics on variation detection in that cohort, and are you using NISC for your sequencing? No, we're not. Okay, so there is this thing about, I think, a platform compatibility. Ideally, what you want is, I think, a sample set for controls that was generated under a similar set of conditions and processes as possible, so this would be less than ideal, but it is certainly available to you and to other practitioners in the audience or online. If you have requests, please do let us know, and we're happy to share those data for control purposes. You want to say something, Jim? I'll just, a little further in what you said, yes, it'll all be deposited into DB GAP, but we're also submitting just the variants that we have discovered in ClinSeq and releasing the frequency data on those into DB SNP as well, or something like that, a VCF-type file that you can work with, so it'll be available that way. Eventually, it would come out through the DB GAP method as well, as frequency information for all the samples. Thank you. I have a question for Dr. Mulligan, and I'm Bojik Husser from Clinical Center. You mentioned the depth of coverage, let's say 60 times or 100 times, and then you mentioned the MPG as the measure of quality, and you went pretty fast through your talk, and I didn't really get that, let's say, one year into the future. What is the good measure of quality? Because I'm not an expert on the genetics, I work with data, with clinical data, but this will be increasingly more important. My other question is, I understand there's always this error in reading the bases, so kind of a naive question. If I take the same DNA and send it to your lab in January and send it again in December, what's the likelihood I will get the exact same data set back? And I'm meaning not those 15 terabytes, so many reads, I just want the one, the kind of the cleansed, aligned, and perfect read. Yeah, so that's a very good question. And in fact, it is a challenge for certain projects that have a very long duration is that the technologies that we used a year ago, we don't even have access to now, for example, some of the chemistries that are used on the Illumina machine, they outdate and you have to move to the next one. So you're going to have different sensitivities as things move forward. Now Illumina, as the example here, I'm sure it happens with other technologies as well, is they're always trying to improve things so that they get better balance between high GC and low GC regions of the genome. So to answer your question, if you sent a sample in January of this year and then sent one again now, how similar would the results be? There won't be exactly the same because everything's changed underneath, it may be a different capture kit, so you'll have better coverage, so you'll have new sites that have variants that were never interrogated before. So now, let's see, and other things change too. It's a changing field, it's just rapidly changing. I could go on and on about other things that could change, but the direction we're moving in is that the samples that are sent at a later time should have a better set, overall set, than the ones earlier because things have been improved upon. I hope that answers your question. Just might add, I think, remember when you're sequencing a genome either by shotgun or exome, there is an inherent randomness to that process, where those reads come from and where they come from and how much coverage you have at any given position is a function of probability. So I would say that the chance of getting exactly the same result with all of the same calls and all of the same MPG scores for two different genomes, the chance of that happening is zero. And so you should take that into account, and I think that's a factor in your decision about how many samples you want to sequence for your project because there is a variation in the assay that's intrinsic to, you know, multiplex sampling of a genome that is going to vary from one experiment to the next, and you have to decide how much money you want to spend, how many times you want to sequence, how many samples you want to sequence, and how deep you want to go to make those determinations. Well, hopefully the old data has already yielded results, so it's enormously valuable because it's already published. Now, if it's something that did not yield a result, then you may want to go back and try it again or, you know, think of a new experiment. Is it maybe a translocation instead of, you know, maybe it's a small change in the genome that isn't being able to be, you can't pick up with the capture, exome capture. So you just need to rethink your approach for samples that have failed in the past. There may be later presenters talking about what you can do for those families, for example, that may not have succeeded in the past. Hi, Anelia Horvath from NICHD. So I have no experience with the Illumina, but I've been using solid from applied biosystems for the last year. And it seems like one of the biggest challenges we face is that when we apply two different analysis pipelines on the same data set, we never get more than 70% overlap in the results. And actually, this is illustration how many false negatives we have, and it addresses the question of sensitivity. So I just want, and again, it doesn't address the capture or incomplete capture or missing gene or something like that, because every time when we look in the IGV files, we are able to see the mutations there. They're just not caught by one or the other, another pipeline. And we were considering approaches like making a union of the different, applying different pipelines and making union of all of them. So I was just wondering if you see something similar, and how do you handle it? We haven't seen anything that severe. We have investigated different aligners. And when we tried to use the BWA aligner versus the DIAC CM one that we've been using for a long time, we saw a number of differences, maybe more like a 90% overlap, so not the low 70% that you have, but we decided that it was best just to stay with the system that we had been using all along and try and switch and maybe throw things off. We thought what we were actually doing was better than switching to a different aligner. So you were testing the alignments, not the post-alignment calls, I mean steps? So if you're working from the same alignments and getting different calls after that. Yeah, that's what we observed. That sounds very challenging, and I don't have experience with the solid data analysis. I would just follow up. One thing that we did do, which gives you an idea, though, about variability is I don't think anybody talked about where we did the genome, the shotgun, and the exome from the same DNA sample. Was that mentioned? We didn't talk about it. Jamie has done some comparisons. Yeah, do you want to just summarize that very briefly about what you saw on one versus the other? So in this particular test, we did a whole genome sequencing and then exome sequencing and used the same analysis tools to compare and there is sort of a different question. In this case, the concurrence, the agreement was extraordinarily high, I think, with discordance of 1 in 10 to 20,000. So applying the same methods to even different capture, the same analysis methods to different sequencing methods, in our hands, seemed to give a high agreement. However, that's sort of different than the question of different analysis methods on the same data. And a lot of the some of the different methods do use different assumptions and different dials and knobs to tweak different parameters, and that can make a big difference. That's a challenge. Max. Max Munker, NHGRI. Les, I was very intrigued with the examples that you gave for the different inheritance patterns. And of course, it was very instructive to go through them one by one and to think about the families that we have in the lab and what we might want to do. I was even more intrigued by what you said that other colleagues shared their data that, in essence, really, it was only about 30 to 40% that led to success. So that's really the bottom line. And what I wonder if you could expand on that a little bit, what the pitfalls were or what people were guessing. If you don't have a solution, it's hard to guess. But then the other question that I have is, what is technology and what's biology? And with biology, I mean, what would you expect to be found in exons and what would you expect to find in non-exonic sequences? Yes, so I think the ceiling ought to be probably in the 80 to 85% range. I think our experience, and you've done as much of this as we have in positional cloning, when you have a trait and you are certain that it's Mendelian and you know you have mapped it to a locus and you have a family and you know you've sequenced the genes, you know that 10 to 15% of the variants are not variants that are assayable by PCR and 3100 or by exon interrogation because they are control elements, deep splicing elements. So there's the biology that pushes out a good number of them. Copy number variants, as I mentioned, are not readily assayable by exome sequencing although some efforts are being made. And as well, all of the foibles that I listed of not recognizing the genes, not completely targeting them, the capture, that distribution that Jim showed of that wide distribution, did you notice on the left side it slipped upwards at zero? There is a decent amount of the exome that has zero coverage when you do an exome experiment. And that's an issue. And as well, the graph that Jim showed with the tail off of alignment for base pairs of indels. So a six base pair indel, which we know we've seen in Mendelian disorders, it's very difficult to find those. And so those are, there's just plenty of reasons for that. And then I think the other considerations are, you know, it may just be some of these families that have been selected for these projects that just really weren't good candidates. Maybe they're not as Mendelian as we think they are. Maybe they're teratogenic for all we know. So environmental causes, we are not going to find with the GA2. I think maybe you were alluding to also is the, if you're just going after the exome, how much is going on in the rest of the genome? So we're only interrogating 2% of the genome with good coverage like Les just talked about. But is there something else going on in the intergenic regions or like some diseases could be an insertion in the intron of an aloe element or a deletion that will cause skipping of that, of an exon cause of disease. We wouldn't pick that up necessarily by exome sequencing. So there's various classes like that. If we aren't interrogating it, we won't see it. And you may not see it well with whole genome sequencing if it's an aloe repeat, a recent aloe repeat. You won't get coverage there anyway because your reeds won't align nicely if you did a whole genome. So there's challenges all around. Exome is a good first step as I pointed out in my initial statement. Just one more comment on that. There is certainly there isn't completely useless. I don't know if Dr. Adams is going to talk about an example later on. There was an example where no variant was identified, but in carefully looking at what was covered and what was not covered, it suggested some follow-up experiments with Sanger sequencing that then did uncover the correct variant. So sort of knowing what you, or I guess you do know what you haven't identified has value in itself too. My question is about Exome capture kits which you mentioned. One of them is Illumina truce capture or something like 62 megabases kit. So is it possible to capture the DNA using Illumina 62 megabases kit and then do the next gen sequencing on a platform other than the Illumina? Or is it a platform specific? I'm not sure about that. I'm sure there could be a way to do that because it is just DNA probes that are hybridizing. It doesn't necessarily have to have the adapters of the Illumina kit around it to pull that out. Jamie, you might have a sense of that. I'm actually not sure either. As Jim alluded, it should be possible. Definitely I believe they make the whole protocol available online. So have a look at that and then contact them because I believe it should definitely be possible for any platform to sequence given that your library was prepared appropriately for that platform before the actual capture. Steve? Mark's NIDDK. Today we're dealing with a very powerful wealthy methodology. It's all from the NIH and I wonder a little bit about other centers. First of all is there redundancy with other centers? For example China and second of all are there areas of major disagreement that we're not hearing about? I'm sure there are. This technology is so new. A lot of the things that we are doing are the polite term that my informatics and computational people use is called a heuristic. I sometimes call it making things up as I go along and see to the pants and we're trying and I think I hope I gave a flavor and I think your Dennis presentation was the same. We're trying lots of different things and I think some of the things we're trying probably really aren't all that clever which is probably an answer as well to Max's question which is that the variants are actually there and our analytic algorithms and our filtering are not allowing us to see them and so we have to work on that. One of the things I've been intrigued with that we've talked about a little bit is the unaligned fraction. So we are all focusing here on what aligns to the genome that we can look at and compare to the reference sequence and there is a fair amount of sequence that's generated by these instruments that's sitting in the computers as the unaligned fraction. One of the potential reasons that it's unaligned is that there is a genomic variation substantial one that's preventing that alignment from occurring and that that is the variant that's causing the disease in the patient that you're studying. What about the question of controversy I mean are there people that say you guys are doing it all wrong and it should be our way. I haven't been hearing a lot of that. I just met with a group from BGI last week and we were talking about all of the analysis methods that both groups are using and we're both faced with the same challenges and we're both excited about the same possibilities so there wasn't anything that came up in that meeting that was any disagreement at that stage. Just to add to that it's nice when other groups use different methods and find the exact same genes and the exact same mutations and I think that in itself is just a replica and a good control that things are being done right by different groups. I think it's also worth saying that we're not here to effectively represent all approaches and all viewpoints on these questions what the four of us each described is how we have chosen to solve the problems that we have been faced with these are one or a couple approaches and there are more approaches and other people are using different captures and different sequencing instruments and different aligners and different base collars and different filtering strategies and there's a universe of variables out there and I'm sure that there are some that would work better than ours but ours are working fairly well and you are free to take from us what you think would work for you and ignore what doesn't. So I have a question to Dr. Biesiker you presented an interesting research agenda to focusing on all those 2,500 diseases and including maybe the functional side of it but basically getting some exams or families affectance was an interesting term and let's say we kind of went through all of them and you also mentioned an interesting example what can we learn from this hereditary osteoporosis which has impact on general osteoporosis so once we have kind of used up all the diseases, the monogenic diseases we can go after what's the next step or is this the end of that we can go, what's the polygenic diseases, is there any hope to go after these in a similar fashion? And I think the other whole domain that you didn't include in your question is the somatic question which Jardena's talk exemplified so I think we have a huge amount of work in front of us and I was careful to try and use the phrase a recognized clinical entities because I don't for a second think that those sources of knowledge represent all clinical entities and as you well know and a nice paper that was published that showed and I can't remember was it exomer whole genome, the medical college of Wisconsin case? I can't remember but it was next gen sequencing used as a clinical diagnostic tool because the patient had a very atypical presentation of an autoimmune inflammatory bowel disease that was not clinically recognized by the clinicians, they sequenced it found a high penetrance variant in that gene and then in retrospect they could say oh yes, this is an unusual presentation of a disease we in fact know and recognize and so I think this technology will diffuse out and be used for all dimensions of analysis both in the research lab and in the clinic and I think essentially forever in the cancer somatic sequencing world. Another application as Les was alluding to is that in the clinical domain in cancer for example you can imagine that these discoveries will be applied at some point to the clinic, there will be drugs, there already drugs are being given based on these mutations, this resistance that develops after that drug is given and to find out what the resistance mechanism is, again these technologies could be used to find out in an unbiased manner what that resistance is and so this keeps rolling over, it will be applied again and again in the future. Oh there's plenty to do I mean the Mendelian disorders are really just the low hanging fruit of the heritability of human disease and we and our colleagues I think internationally I think are focusing on Mendelian diseases because it is a more tractable problem mathematically and statistically for analyses of these data sets and it is only a matter of time before this starts crossing over into the multifactorial polygenic world but will need much larger control sets and case controls type study designs to extract that kind of information and as you all are well aware for those of you who take care of Mendelian disorders the determination of the primary genetic variant that is causative of a Mendelian trade is far from explaining all of the variation of that disorder in an individual and there are modifiers that affect those traits and those data are potentially extractable from exome and genome analysis as well we have plenty of work to do I'm not worried about us running out of projects at least in my lifetime so I think it will be fun. Yeah it's a start, exactly. Hi this is Rinki from NEI and I have a question about the exome data you presented so you said there is a need of a different variant collar because of its hemizygous status and because I did one SLRP family and what I'm seeing a lot of heterozygous variants I sequenced three affected males and very few homozygous which I would expect and I was using the standard variant collar which you would use for autosomes and so I was not even aware until I heard from you so can you elaborate on that and what variant collar should I be using. So it's very important to know what part of the genome you're analyzing and if it is a male sample and you're looking at the X or the Y chromosome you do need to handle those carefully. If you have the wrong information somewhere in your pipeline I think it's a female sample and it's really a male and you call heterozygous on X you're going to get things that just you're going to generate false positives because the model is going to try and find something that is potentially heterozygous from the data and will call things and those will clearly be wrong because it should be homozygous in those regions unless it's something else strange going on in the genome that can happen. But the collar that we're using is the MPG package it's available online through NHGRI's research web page so you can download the program it's called BAM to MPG and it has a flag that you can use in there and state I want to call this region as a single copy hemizygous region and you can give the coordinates and so if it's a male sample then you can just give the regions of the pseudo autosomal, non pseudo autosomal regions on X and Y. And also that the capture in that X chromosome it was less compared to the other autosomes. Do you see that also in your samples? Like my average capture was around 90-92% but in the X chromosome in the region where I was looking for my gene of interest there I saw at least 30-40% genes not being covered could it be because of the X chromosome or something like that Was this the Agilent XXome kit? No, I used the whole Xome kit because I was doing X chromosome as well as some autosomes so I just ordered one kit and I did the Oh you did the whole Xome? Yes. And only looked at the X? Yes. For one family? Yes. Maybe it was just a depth of sequence and another answer is that or possibility is that if you analyze the data with a heterozygous collar X chromosome will only be covered half as much because there's only one copy of X for two copies of Y. So you're going to have lower coverage and won't have enough coverage to start to call the heterozygous that you don't want to look at anyway. No, but I was looking in the genome browse like IGV for the capture and the depth of coverage and I saw like a lot of exons because I went by exon by exon because I knew this is the so it was already positional mapping family so I knew where I am expecting my mutation to be and when I looked at gene by gene I saw less coverage compared to Well, you only have one copy of X for two of the autosomes that could be the answer. No, I should be trying that thing you mentioned. Thank you. I think another thing that points out as well is again we're working towards because we're at early days and we're not the only part of the genome that people have a single copy of, right? We know that there's copy number variation throughout the genome and one thing we all want to be able to do is to recognize a point in the genome where we have one allele that we are deleted for and the other allele we have one copy of and in fact we ought to be if we could recognize those and implement single allele base calling when they're in trans with the deletion that would be our power and those are things we're just not seeing as well and the converse is when you have three copies of an allele what is your call are going to do? It's not going to call those bases, right? And you're going to make mistakes so that's a refinement that needs to be implemented for more ways from that. And I'm sure your agenda is quite interested in having if you could just figure out the copy number across your tumor sample then you could correctly call it each location and that's something we would like to get to at some point but it's a more way to process tumor samples. I'm sorry I may not have fully understood like the previous conversation but I have a more specific question about my project. I was looking at a region of homozygosity where I was expecting all the homozygous variants and I have found that there are quite a few heterozygous variants and I was thinking it could be because it could be simply false positive or platform error or something like that but is it possible that there is actually a copy number variation and that's why I am finding those heterozygous variants because I am using I'm not using the... So if it's homozygous theoretically all the variants in that region... How is it homozygous? Because I have a well defined linkage region for a recessive... So then it's identity by descent in that region so then again you should there you could be using a heterozygous collar in that region I mean a collar for diploid samples but if you know it's really identity by descent the only thing that will be real will be anything that's somatic changes between the two copies or not very recent changes between the two copies yet so again you should see a huge reduction in the heterozygosity did you see that in that region? Well I am actually seeing heterozygous more than expected heterozygosity We should talk about this maybe afterwards, yeah If there are no other urgent questions I think we should break for lunch we look forward to having people come back and I think our start time start time and as well we welcome encourage the web viewers to rejoin us at that time Thank you all very much