 So, today I'm going to talk to you about a more lab-oriented project here, where we're trying to do mass validation of variants identified in whole genome or exome sequencing, and hopefully in a single lane of sequencing, and possibly even indexing the samples so you can do multiple pairs in one lane. So just to clarify here, we're talking about validation, I mean confirming that the mutation is really present in the sample, as well as, once you know that it is, is it present in a larger population of clinical samples, which are often only available in FFPE form, which presents its own challenges. So, as I said, most genomic projects starts with a large number of potential variants that you want to confirm, and then you proceed through multiple stages where you sort of narrow down your list, and ultimately you want to be going into your clinical population, and in the end you possibly end up with a clinical diagnostic. So the two methods, one we termed here OSCEIC is applicable in the early stages of the process, mostly on flash frozen, high quality DNA, and then the second method, single strain circularization there, is another targeting method, and it has the advantage of being applicable to DNA that is not in double-stranded form, and maybe partially degraded. So the first method, instead of doing the capture in solution, and then manipulating the captured material, adding adapter and whatnot, and then creating the sequencing library and then going on to the flow cell of an Illumina sequencer, what we do is that we modify the lawn of the flow cell. And what you see is, okay, so what you see is, okay, so first step, all right, so here is the flow cell of an Illumina sequencer, this is the lawn, and we float in a population of capture probe here, the green part is a 40 base that's homologous to genomic sequences, so there could be hundreds or we're using thousands or even tens of thousands of these green ones, and there's a portion that's common that pairs to the lawn. First step we extend from this position, and then now you've modified the lawn because you float away that template, and you have a flow cell that contains, sticking up there, thousands of different capture probes. First step, second step, we now float genomic DNA in there that has been laggated to an adapter, the second adapter that's going to bring the second sequencing primer for the Illumina sequencing, and you do another extension, so the black portion is the genomic DNA, so you do an extension, float away the template again, and now you've got your genomic DNA between the appropriate adapters that are suitable for bridge PCR, and then you can do a read one, read two, you can do paired sequencing. So when you do that, the read two actually, it reads always the capture sequence, so that's actually useful for binning the made-pair reads and do, for example, assemblies afterwards, but the read ones are staggered like this, and they basically start wherever the breakpoint was on your fragmented DNA in the beginning, and you capture at a very high depth a region that's 500 to 1,000 bases downstream from your capture probe. So you also capture it in a perfectly single-stranded way, which can be useful for certain applications such as structural variant validation and all that. Now if you want double-stranded, which of course you do, you can place a capture probe upstream and downstream of the position that you're targeting, and then now you get the read here, the purple distribution is for one of the probes, and then the blue is for the other, and the sum of the two is here. So this is what it looks like on this example is the K-RAS gene, so this is an exome capture, you see where the exons are, they're down there, you see high depth at the exons, and you see a little bit of actually unspecific capture, which sometimes, like these chains, you capture one strand and another strand pairs to it. In our method here, it's totally clean, so you don't have any unspecific capture and you vary a very high depth on the exon, and even here you can see immediately the anetrozygote here in this IGV plot. On a much bigger exon, such as the APC exon 15 there, so this is exon capture up there, so you get plenty of data, but sometimes the depth drops here, there's even regions where it drops almost to zero, while here in our SIG method, the depth never drops under 100, and this here, of course, we have to put plenty of capture probes because this exon is so big. Uniformity of capture, so this blue line was the first experiments we were doing, we're using column synthesized oligos to synthesize the capture probes, and that's the highest quality, the curve is very flat. We started doing microarray versions of the capture probes, initially they were not as even, and now we sort of improved it, this is the green, so in the green here, so in the blue we have 400 capture probes, but in the green here because we synthesized them on microarray, we have now 20,000 of them, and they almost as good as column synthesized, and of course much cheaper. So this method, which basically works on fresh DNA, so from flash frozen tissue, has quite a few advantages, so for us in the lab it's very efficient workflow, so it's a lot easier than exon capture, low sample requirements, so we can start with the SN microgram of DNA, and we have high sensitivity and specificity because we basically have very high depth on the regions that we are targeting, and so one application obviously is validation, but you could use also for discovery if you have a long list of candidate genes and you want to do mutation discovery in there. So the second method, which as I said is the advantage that it can start with DNA that's not double stranded, so single stranded, or FFPE material which is mostly single stranded, so we mix this with, now in solution, with a population of capture probes, the colored boxes here are 20 base regions that are homologous to the end of the amplicon that you're trying to target. And so these capture probes mediate a circularization event, and because this is random DNA you're going to have a tail on either side, and in the mix of enzymes that we put in this circularization reaction, we have two enzymes that can degrade both of these tails, 5' to 3' and 3' to 5' and an amplicase in there closes the circle. So we end up with a population of circles, hundreds or thousands of circles that can be re-amplified with a single pair of primers, which is this black part here. So for a pilot demonstration that this was working, we picked 628 exons from a previous experiment, which totals 123 kb worth of DNA, and we looked at matched samples from normal FFPE tissue versus fresh, so no tumor here. So theoretically we should get the exact same result. What we're doing this is because we want to see, number one, what's our efficiency of capture from FFPE versus fresh DNA, and number two, are we seeing false positive calls in the FFPE material because it's known that the FFPE can have DNA damage which would result in some false positive calls. So I just said that, so we're going to look at yield as well as specificity of detection. So here is the same sort of evenness plot that you saw before. So the method is not quite as even as the previous one, but of course because it works on FFPE that's of high interest. So you see that even on the fresh DNA which is the blue, there's about 5 to 10% of the regions that are not captured at a depth of 10, which is our minimum for genotyping, but this is per probe. So obviously we can use more than one probe for a region and then that would go higher. But the important thing here is that the red curve is quite close to the blue and it drops a little earlier. So there's about 5% possibly of the 5 to 10% of the regions which are captured from the fresh that are not captured in the FFPE. So those are going to be false negative, obviously. Now for the specificity, so if I plot here the percent variant. So this is the same DNA from the fresh and the FFPE, right? So if I'm plotting here a percent non-reference base in the FFPE DNA versus the fresh, you got a bunch of positions here where you have high variant in both. And those are the true heterozygotes. The vast majority of the points I write here, thousands on top of each other, there's no variant, it's a reference. There's a few positions here which are colored black as opposed to the other ones because you have a higher percent variant, but it typically has a very high strand bias. So those don't result in false positive calls. And there's a handful here where on both strands we see a variant base and only in the FFPE and not in the normal. So those would result in a false positive calls. But it's not very high, so we find that it's about 1 per 10 to 15 kb. And if you were to be sequencing genes, it'd be an error per 5 to 10 genes. That's reasonable. And the sensitivity is that we see about 85% of the heterozygotes that we detect in normal, we also see them in FFPE. And that's, again, it's per probe, right? So when we put multiple probes per position. That number goes up. The classes of artifacts observed is that there is a portion that's transitions and they probably do to the amination, G to A and C to T. And there's some transversions. So, but if you look at this, basically we have a consensus. All we see is G or C going to A or T. And not at a very high rate, 1 per 10 kb. So, we applied this to our whole genome project that we had going in the lab. It's a gastric genome. We sequenced the entire genome of the normal tumor and the metastasis. So we did whole genome as well as exome. And out of this analysis, there was 386 variants, including SNPs and indels as well as structural variants. And then most of them are coding, but some of them are outside and quite far from genes because they break point of structural variants. So we devised a pool of capture probes to apply one method as well as the other method. So from the fresh frozen tissue, we're going to do OAC capture. And then we're going to sequence it under GA2 or high-seq. And we're going to confirm that from FFPE material, which we have, both fresh NFFP material for the metastasis, that the conclusion is the same, applying the other method. And actually, we're sequencing that material on the myseq, which are these new sequences that are luminized, but that cause the fraction of the other ones. And also, the runtime instead of a week is more like a day. So that's very useful for this type of application. So the whole thing sort of makes sense. And it basically works, and we could validate almost all those positions. And we got the same results from both. And so I'm here, I'm only showing you an IGV plot on a particular exon where there was a two-base deletion in the metastasis that was not present in the tumor, not present in the normal, and I'm blowing it up here. So you see it here. Of course, you only see a few reads in this show. But if you look at the coverage, you have a coverage of close to 1,000 in all three. So here we have very high confidence that we have this deletion in the metastasis that's not present in the tumor. Because of the high depth, obviously, even if the tumor is not pure, and which is the case in our case, even if it's only 20 or 30% of the cells are tumor, you can still detect the variant here. You have high sensitivity. Now we go to the confirming the FFPE material of the metastasis. And you see it here. So this is the ovarian metastasis. This is the FFPE versus normal that's fresh frozen. And the tumor and the deletion is here. The reason why this profile looks so different from before is that here we have a population of amplicons. And because the MySeq can do 150-base sequencing per den, we can sequence the entire amplicon on both strands, coming from either way. So this is end sequencing. That's why all the breakpoints are on top of stack up on top of each other. So and another thing I mentioned is that this reaction was for Plex and the MySeq and we got way more data than we need. So I think that we could be doing 16 Plex easily in a single run. We've made the design of our oligos public on this website, oligogenome.stanford.edu and was published recently in nucleic acid research. And in conclusion, I would say that we're moving towards trying to do validation of whole genome data in a single lane of sequencing. And we are offering two different methods. One would be mostly applicable early in the process on high quality DNA and is the most even and gives the highest yield. But we have this other method that is really good for follow-up studies in clinical samples. Just acknowledgments here. Jason Buenostro and Samuel Millicangus are the ones that developed the MySeq method and Hua Shu developed the single-strand circularization method. And our funding, NCI, NHGRI, and Dr. Doris-Duke Foundation, how I'd use. Thank you very much. Questions? I had one. And it relates to the requirement for high quality DNA for OSC. Is that a function of the length? No, it's because we basically have to add the second adapter. So the constant portion of what we flow into the capture contains one of the adapters and one of the sequencing primers of the standard alumina. And the other one is attached by ligation, you know, atailing and ligation to the DNA. And so the DNA needs to be double-stranded. So if it's not, then you've got to start repairing, start doing atailing. You introduce biases and all of that. I mean, it kind of works, but it doesn't work as well because you have to turn the DNA into completely double-stranded. And that's where all the biases come up. And while the other method, you're capturing, you're a strand-specific capture and you never have to turn it into a double-stranded. One over there, Matthew. So great talk. I'm curious about the implications of your OSC capture method for alignment to the genome and kind of the advantages and disadvantages of starting with a defined sequence at one end. As you say, there's advantages and disadvantages. So one advantage is you basically can bin by perfect or near-perfect matches of the P2-REED and then do assemblies, which would be nice, you know, if you sort of suspect a structural rearrangement downstream of that. One complication, for example, is that the mid... You don't have the mid-pair information. And for re-duplication, now, you know, one end is fixed and only the other one is variable. So you are more prone to having bottlenecking artifacts, which need to be solved by other methods. And we have some ideas of sort of random tagging and then to basically weed out the PCR duplicates. But that would be one disadvantage, yeah. And the last one over here. Are you considering circularization also for RNA, CDNA analysis? Well, there's all sorts of ideas that came to mind listening to the previous talks. I mean, there's plenty of applications for this, right? So I'm presenting it as, you know, we can target out of a complex mixture hundreds or thousands of regions. But maybe we could apply it to, say, CDNA material and do it on RNA and then look at alternative splice, for example. You're right. Okay, thank you very much.