 All right, so my job this morning is just to give a broad sweeping overview, which means I have lots of slides. So we'll go quickly. These are the mandatory slides. You want to say anything to these? All right, mandatory slides going by fast. My disclaimer, I'll be talking about lots of companies. I don't have any equity in them, so if I say good things about it, I mean it. It's no financial gain to me. All right. So genomics has been around for a long time, and it's been fueled largely by sequencing. And over the last 20 years, there was an evolution starting up here with the good old days of running your gels radioactively and then scoring them by hand, as you see the person doing there. There came out some gel readers that would read your gel for you. They never really worked. But then it came along the automated fluorescent sequencing. So these are still gel-based, but the data were collected. You had to track the gels, and when we got up to 96 lanes, that was actually an arduous task. And then the big development was capillary sequencers, which was the real milestone, where each capillary is a single lane of sequence. And so that not only did throughput go up, but it took away the need for tracking the gels, et cetera. And that's how the human genome was done. We were able to pack rooms full of these things, and you could crank out enough data to do the human genome. But what we're really talking about here and the reason for these courses is that in the last, since 2005, so the last about seven years, six and a half, seven years, there's been a huge revolution in sequencing, with many new sequencers coming on the market. And as you'll see, and as you know, the amount of data being produced is just staggering. So some of the quick advantages of NextGen, I just said that the amount of data is just ridiculous that can be produced in off one machine, as you'll see, and you're going to have to learn how to deal with that. But there's no subcloning. So libraries of fragments for doing your reads on are produced in bulk. Unlike the human genome, we had to actually prep every single clone to sequence. That was a very, very rate limiting step. As you'll see, you can readily adapt it to many, many different applications. And also that there's been a huge reduction in cost, which opened up the possibility to do many more experiments. So the first one in the market came out in 2005. This was the 454 from Roche. This is the paper describing it. And I'm going to go through a little bit on the technologies, just so everyone's familiar enough to speed. So there's a wide range of expertise in the room. So we'll sort of start basic and pick up pace as we go. But almost all of them, except some of the newer ones, perhaps, start by taking the genome you want to sequence and fragmenting it and then putting on it some adapters that are specific to the type of platform you're using. Then that is, in this case, in the 454, you have to amplify the signal. So they use a bead. This is an emulsion PCR. So it's a droplet in an oil-based emulsion and many, many droplets, millions of little droplets. And in the ideal world, in each droplet, you have a bead, which has complementary oligos on it to these adapters. And by PCR, essentially, within that droplet, you coat the entire bead with those sequences to get more copies of it. In the 454, these beads are then dropped into what they call the Pico-Teter plate, which was a fused bunch of glass fibers. And then they're etched with acid to make these little wells. So the beads go in there. You pack in the enzymes you need with more beads just by centrifuging. And this packs it in there and keeps the beads in place. And then the detection is through a Luciferase cascade and some light is released. And this is an electron micrograph of what it looks like. Oops, animation. So the kind of data that came out was very new to us. We were used to looking at the sequence traces. This came out in what we called flow grams. And we called it that because each base is flowed one at a time, unlike the Sanger sequence we've been doing, where all four bases are present. So each one comes one at a time. And if there are multiple bases, you get more signal. So in this case, you can see roughly this is the signal for one. And this is many Ts in row, so you get a larger signal. The problem with it is it wasn't quite linear. And we'll talk more about that in a minute. The next one on the market was Selexa, which got bought by Illumina. A good deal for them. It's a turned out. And they got rid of the beads. And what they do is actually amplify on a solid surface. So here's the same idea. You've got your little fragment. And it all goes now on a slide surface. And then through a bridging PCR, back and forth, you build little clusters of the same molecule being copied many, many times. And then that's sequenced by extension. So you add one base at a time, or sorry, all four bases at once, but they have a block on it. So only one base is added at a time. And then it's fluorescently imaged here. And you can just follow the sequence. This, the output in this was really in bases, but it was still a little different. It was intensity, signal intensities, but it's very similar. So we could readily grasp this compared to the older types of data that we're used to. This is an inside of one of the old, this is actually one. You can tell it's Illumina because it's blue, but this is a very early instrument. And you can see it's really a microscope slide. I have some visual aids. It's really a microscope slide here. And this, as you can see, it's just a microscope. This is an objective, robotics to move it around. This is liquid handling to put the reagents in. I'll sort of fasten some of these around if you've never seen one. This is one of the older style Illumina. So it's just a microscope slide. It's got eight little channels on it, a couple holes in each end to pump the reagents through. I'm not going to talk very much, and we won't talk very much at all about the solid. The solid came out around 2007 as well. Well, it actually worked. We had some of them. It was completely different, though. It still used the Mulsion PCR and covered the bead with multiple copies to boost the signal. But the sequencing rather than use the polymerase used the ligase. And I won't go through the whole process since we're not going to analyze the data. But it had some advantages and disadvantages and was turned out to couldn't keep up in throughput necessarily. They came out with the 5500, which is a new version. This is a flow cell from that. You can see it's much larger. Similar thing to that slide that's going around. It still produces maybe 90 gigs of data on a good day, which is nothing compared to the high seeks, which we'll be talking about and where most data comes from today, which produce around 700 gigs in a row. So I'll talk a little bit about the, some people call them G3, third generation. Some people just call them next next or just next. Next generation sequencing covers everything. There are many of them that are coming out as single molecule sequencers. Not all, but many of them. There's E as a sample prep, as we'll talk about, especially one that's coming out. Less sequence bias and longer reads, perhaps. But one of the things you'll see is they have a how much higher error rate. I'll be talking about that towards the end. One that first came out, I'm not sure what year this was. 2008, I think, was the Helicose Bioscience Heloscope. The weight of a ton, a big monster. Had a cluster as well that wasn't in there. They showed us in there, was actually a second 2000 pound cluster. And it was really the first single molecule sequencer, so hats off to them for that. It had 50 lanes. It could do about one to two million reads in each, but they're very short. They're 25 bases. And it was the same thing where the slide was covered with oligos. You needle it on and then you extend. And it had an error rate that at that time was horrendous, but you'll see some of the ones now are even worse. But it just didn't produce enough data, really, and was hard to work. So I think some people still have them. The company still exists. And you can still buy reagents, but it's not in great use. Came along in 2008 as well. It's where it's the end. Maybe 2009 was the Pac-Bio. This is a single molecule real-time, so another single molecule machine. We have one of these. The way this works is you have a thin membrane here with little tiny holes in it. These are seven Zepto liters in volume. The Zepto liter is 10 to the minus 21 on a glass surface. And the way it was described to me by the inventor is that if you look at your microwave oven in the front of it, there's that little panel with all the holes in it so you can see your food, but you're not cooking your face while you're watching your food, which is a good thing. And that's because the microwaves are too large to fit through those holes. The same thing, this hole is so small that the laser light doesn't really penetrate that hole, but it doesn't light up this bottom part here. And at the bottom of this well is a polymerase. And so it's a really cool machine because what you're doing is actually watching a polymerase incorporate nucleotides while it makes DNA. So it's really a cool device. And what's happening in the mixture there, you've got all the nucleotides, they're labeled on the floors on the phosphate, they're floating around in this space going in and out of that area that's being lit up by the laser. And that's what this background chatter is here. And then when one's incorporated, when the polymerase finds what it's matching, incorporates that, that happens on the average of milliseconds. And so there's a signal increase while that this nucleotide hangs around for a while. And then when it cleaves, it incorporates and cleaves off the phosphates, the diphosphate that drifts away, goes back to baseline, and then the next one's incorporated. This is a cartoon. This isn't what the data looked like. This is what they would hope it would look like. So it's really kind of cool though. And you can look at around 70,000 polymerases making DNA. We'll talk more about that one later. The ion torrent came out. This is another method. So we're gonna go through the different methods of detection here. It's been purchased by Life Technologies now. Life Tech also owns the solid. It's right now 100 base pairs or so. They believe they can get up to 400 base pairs. Short run time. This is one of the new things in the sequencing world was a set of a 10 day run or a five day run or actually the 454 is fairly fast, but not much data. But three hour run time came in different flavors. They spec at usually these companies are being a little more conservative these days. They spec it at 10 mags. We usually get about 20 on ours and saying the 100 we get 200. So they're spec'ing a little low. 318 we don't have much experience with. They say a gig, maybe we'll get two. But how does that work? So it's a solid state sequencer. This is the close up on the pack file. It's a solid state sequencer. So they keep saying that they're leveraging the semiconductor industry. So they should be able to make them cheap. Although they don't seem to lower their price. But anyways, this is what it looks like. And you'll see it coming around. There's a port where you can inject your, and this again has little beads. You can inject your little beads. These little beads fall into these little holes. And each one of these holes, just think of it as a pH meter. So this is a whole array of million pH meters. And this is just a circuitry underneath. And the way that works, I always draw these cartoons as if it's single molecule, but it's not, it is amplified on a bead. The way it works is when a nucleotide's incorporated to hydrogen and ions released. And so what it's measuring is a change in pH as the hydrogen ion is released. This is another instrument where they flow one base at a time. So that's how they know what's being incorporated. So they flow T, you say signal at that pH meter and then a T was incorporated. And again, it's, if you flow T, there's nothing to stop more than one T coming in. And here, in case two T's came in, and they show it being nice and linear. But it's not real nice and linear, so this is the problem. This is some data from an ion torrent. And you can see in regions where there's homopolymer repeats here, it has trouble trying to decide how many there are. And this is its biggest weaknesses in homopolymer repeats. For substitutions, just straight substitutions, it actually does very, very well. In fact, it is slightly better than the MySeq, talk about in a sec. So the MySeq is the baby HighSeq. And the HighSeq is the big illumination to 700 gigs. The MySeq's a smaller version. They, again, they spec it around one gig. We get around two. It's got an upgrade coming up later in the year. She's up to around seven gigs. Maybe we'll get even more than that. The nice thing about it is it's a fast turnaround. It does onboard cluster generation. So to make those clusters in the HighSeq or the Illumina genome analyzer that I showed you is a separate instrument. So you put the slide in, you flow the DNA in and make the little clusters. This actually does it right on board. It also has onboard analysis, which we don't use, but if you want it's sort of a complete genome set in a box, you put your reagents in, you put your, you still have to make your library. You put your library in. It makes the clusters, it runs, and it'll do the analysis in about 30 hours. If you do it, it depends on your relink, but a full relink. And it does do a long relink, 150 base pairs. So the HighSeq's, typically, we run about two by 100. You can run them a little longer, but the MySeq does two by 150 quite well. And just to show you some of the data, so this is this big brother, the HighSeq, which is sort of the industry standard. And this, we got our first one in September of 2011. And to test it out, whenever we get an instrument, we have to always kick the tires on them. What we did, we took a library, this is an exome sequencing project, so we had captured the exome using the Agilent Sphere Select system, which is about 50 megs of target. We'd done it on a HighSeq with a 101 read. There's just some of the metrics. We got lots of reads. And the nice thing was it had reads per start point. So if your library's good, you're randomly sampling the genome, and you're hopefully not saturating that. So you should have, for every point in the genome, you should have a read starting, but you don't want them piling up. You don't want 50 reads started at the same point. 825x coverage. And the target at that 50 meg, 8x coverage, so more than eight reads at any one point was 93% of it, which is not bad. So we took that same library since the same chemistry and sequenced it on the MySeq, and just to show you the MySeq data, the first thing you can look at is just the insert size. You expect that to be the same as the same library. You expect to be the same, but this is showing that the cluster generation on board doesn't have any biases toward small fragments. So that was good. We actually got more reads mapping. So the green part here is what was on target for our exome that we captured. So we got more and that's just because they're longer reads. Indels by cycle. So I showed you the data on the ion torrent and said that the homopolymer repeats for its weakness. And so those get indicated as indels. And you can see that there's very few indels per cycle on the MySeq. So it does better on, especially on homopolymer repeats than the torrent. And the quality, this is the interesting part. So this is the quality, this is the Q30 value. So this is one error in a thousand. And you can see that on the MySeq, some of the cycles progress, the quality starts going down. This is just the nature of the sequencing as you start getting, you've got this cluster of molecules that are being extended and some of them start getting out of phase. Some don't get extended and some actually get more than one base extended on them. So as they get out of phase, they create noise. And so the quality goes down the longer the read. You can see it starts dropping off around 90. And this is the same library sequenced on a MySeq. And you can see that even at 150 bases, it's above Q30 and only starts falling off on the second read, which is the opposite strand. So quality is quite good. On the horizon is the ion torrent. If you've got lots of money, you could get early access to one now. It's a big brother of the giant pH meter. They're promising a short runtime. The first version that comes out, maybe 12 gigs of data. Again, they're usually conservative so that maybe we'll get more. We haven't actually ordered one of these. And then there's a, by the end of the year, I think they're saying they'll have another version, it's another chip, which you'll get 90 to 100 gigs of. And these guys, they said in February that by the end of the year, that you'll be able to do the $1,000 genome that everyone's talking about. I'm not sure, we'll see. Also on Illumina's, as soon as that was announced, Illumina must have had a press release ready. They have firing shots back and forth, right? This is, they announced the HighSeq 2500, which is an upgrade to the HighSeq 2000, or you can just buy it as a new instrument. It has an interesting characteristic that you can run it in different modes. You can run it in the current high throughput mode, which is 100 base figure reads, 600 gigs. You should get around 700 gigs. It takes 10 to 11 days to run, but you can also run it in a rapid mode, more like the MySeq, where you can do the two by 150. And in around 27 to 40 hours, you do the cluster generation on board as well. You can generate a whole genome in a day, essentially. And if you're running in the high throughput mode, you have to use the Seabot, which is the device that makes the flow cells for you. On your slides, I have an error. Can you cross that out? That is wrong. But it does cost more. It's 15% increase in the cost for 120 gigs of data. So they charge a little premium for that fast turnaround, as they're planning anyways. Another one on the horizon is Oxford Nanopore. They gave a talk in February at the AGPT meeting and presented some data. There's different types of pores. There can be solid state pores, which are coming probably, but difficult to sense, where you just have a tiny little hole in the substrate like this, and the DNA goes through. You've got some sensors, and as the DNA goes through, you can measure a change in conductance. Their current version uses an actual pore, here, a biological pore. The DNA goes through. And as it goes through, it, the bases that are sticking out, they block this pore, and you get a change in conductance. And it's actually reading about three nucleotides at a time. And it, depending on the bulk, essentially, of those bases, you get a different change in conductance. Their error rate right now is probably around, what they announced, 4%. And the problem is that some of these three nucleotides look the same. And so they keep mutating this pore to try and get it more and more discrimination. So until they get down, I think they want to get to 2%. They said the error rate before they launched the whole thing. But it's pretty interesting concept, interesting system. You can load these things up. There's just racks of these sequencers. This is the large one, because they call it the grid ion, 101.5 gigs per hour. And you can just collect data for several days. The first version they're saying will have 2,000 of the little nanopores. And you just put your DNA in. There's no library preparation here. They put fragments of DNA in. They go through the holes. And you just keep collecting data until you get the coverage you want. And you can scale this, like you can scale a computer and have room for all of these things you want. But the other cool thing is what they call the minion, which is essentially a USB key. Fat USB key, but a USB key. Which you could plug into virtually any laptop, more or less. And it has 500 of those nanopores. And you can get about a gig. And it lasts about a day. So you just keep collecting data for a gig. So that's pretty cool. Like I said, right now it's about 96% accuracy or 4% error rate. And the other cool thing is you can get very long reads. I didn't mention what the PACBIO can get around 3KB reads right now. But you should be able to get very, very long reads. There's no limit, really, is that fragmented DNA going through that pore could be very long, in fact, hundreds of KB potentially. And they have shown data where they've sequenced lambda, which is about 40 KB in a single read going through. So if they can solve this accuracy problem, then they'll be able to sequence very, very long fragments, which would be quite cool. And solve a lot of our problems. How much is that? They don't, it's not in the market yet. I know what they're gonna charge, but I can't say. But it's actually pretty reasonable. And the price will come down as they can get mass-produced. But if it all works, it's pretty cool. I envision some day where I can walk up to a podium, stick it in the side of my computer, stick some DNA and give a talking at the end show the sequence that was generated during the talk. So that'd be kind of cool. But it's not in the market yet. They'll probably do some early access this year, I'm sure, but I'll look inside. The other thing I should mention, the difference between this and the Pac-Bio, and we'll talk about more a little later, is the Pac-Bio, the errors that it has are somewhat random, and I'll get into why later are pretty random. So by more coverage, you can get over that error rate. Whereas these guys, the errors, because you're measuring, just back up here, you're measuring these three at the time. And if you have problems reading these three any one time, the next time you read them, you'll still have problems. So even with coverage, you still will have an error at that spot, but they'll keep mutating this forward. As they get discrimination, then they should eventually be able to discriminate methylated bases, et cetera. And the Pac-Bio can distinguish methylated bases more often. All right, so the world has changed from the days of when the genome was done where pretty one size fits all. Maybe it was the dominant in the market, and we all had the same machines, 3730s. Just rooms full of them was linearly scalable. You could just add more of them as long as you could feed them. But now there's huge range of instruments to choose from with a couple new ones approached on the gridiron coming in. They all have different capacities. They all have different run times. And it really depends what you wanna do, which instrument is best for you. So there's a lot more decision rather than just going and buying what's on the market. If you wanna do large scale, then you probably want one of these. If you wanna do things that are fast and clinical, and then you're looking at these sorts of instruments like this guy. So really, I think it's a tough decision. People ask me all the time, which one should I buy? Really, the first question I have was what do you wanna do? Because it really is a tough decision. And to give you an idea of the scale, during the human genome, this is, I was at WashU, co-director of the Genome Center doing the Human Genome Project. It's me there. Baylor College of Medicine, I'm somewhere in there. I went to for four years. And at the height of the Sanger sequencing, we had rooms full of these things. I think they're between these two centers probably had almost 237 thirties running. In a month could do about 10 million reads or about five billion bases. And just to give you an idea, a high-seq will produce about 360 times that much data a month, just a single high-seq. So you can see that the reason for this course is you have to deal with all this data. And you can literally, by the end of the year, you'll sequence a genome in a day. Whereas it took a seven year. The other thing is cost. Costs have been dropping dramatically. When the next gen started in 2005, it's still about $10 million to sequence a genome, still a bargain in those days. And it's been dropping every year. This year, at least with the proton, they're promising the $1,000 genome. We'll see if they make it. But right now it's probably about $5,000 to sequence a genome. But that's only, I want to stress that it's reagents only. Whenever you hear all these things, the $1,000 genome, they talk about the $100 genome. They're really only talking about reagents, just the raw reagents. There's all these other costs. A big one being informatics. What you'll be doing here, that costs huge. So maybe $5,000 to sequence a genome now, but I would say it's probably more like $15 to $20,000 by the time you've done the entire genome and analyzed it. Maintenance on the equipment. So a high-seq maintenance is around $70,000 a year. So you have to amortize that over it. So there's all sorts of, you can get the samples. There's all sorts of those costs as well. So don't be fooled. There's also, there's less pressure by these companies to reduce their price now. Not only because there's competition, but there are $5,000 people are sequencing. Why go to a thousand, right? So it'll go down, but I think it's starting to bottom out. All right, on the application. So whatever was done in genomics by any means has been pretty much imported over to next gen. It doesn't mean that's the way you have to do it or even the best choice sometimes, but all the things that we did before, especially with microarrays, has, there's applications that you can do it on in next gen. This historical in 2008 was the first cancer genome relevant to this course. This was a AML. So it's a blood tumor where they could get lots of it and get it quite pure, which is not at WashU. They sequenced it on this, that go through these numbers because they're pretty typical for when you sequence a genome. This was short read technology. It actually cost probably $1.5 million in those days. So long way from the 5,000, but it was the first cancer genome. But here's what you find when you sequence any genome, you compare to the reference. You'll find somewhere around 2.5 million variants compared to the reference. They then filled, they were interested in the somatic variants. And so they filtered out once they sequenced some skin so they took that as the normal and sequenced all those out. You can compare to DB SNF, you can compare to 1,000 genomes and you get rid of, on average, the high 90s. 96% of things have been seen before and you're assuming that the somatic ones are gonna be more rare and what you're interested in. They got rid of some by comparing to a few of the genomes that have been done at the time. And they also only focused on novel SNVs. So these are things that would, eventually what they were interested in was encoding and it's all we could really understand. They got rid of synonymous ones, figuring that they probably weren't important. That's not necessarily so. And got it down to what they, it was a handful that they could validate. Remember this was 2008. So they validated those. Most of them did not validate. So the SNV calling was quite poor. It's gotten better as you'll see. And they got down to eight validated SNVs. So they had 84% positive rate in that first experiment. Still, it was pretty good. And they found these eight out, these are relatively simple tumors. They found eight acquired mutations. Some of which were known drivers, like for three. Structural variants we talking about tomorrow. So we won't talk much about it, but structural variants you can detect with these things. This is again in 2008, it was sort of the proof of principle. You'll probably see a lot of circles plots this week. Circle, this is the chromosomes represented around the outside and then you can plot various things inside. These lines here are connecting where translocations have occurred. And this was just a proof of principle. The next thing was the more and more we look that the normal genome, if there's such a thing, is more and more structural variants. And in fact, insertions, deletions and things like that occupy more bases than SNVs or SNVs. And so the variation in the human genome is not only driven by SNPs as we first thought when we see the genome. I thought everyone was 99.99 the same. But actually we're only probably 99.95 the same. It doesn't sound like, or it didn't sound like very much, but in a big genome it's a lot of bases. So there's more, a lot more variation in the normal humans than we expected. Transcriptone analysis as well. So it's a DNA to RNA obviously. So microarrays were the first thing. And the thing about microarrays is you're only looking at what you put down. So you have to decide what you're gonna put on the microwave. Now they can hold a lot so you can actually put down quite a bit. But in the microwave you're only interrogating exactly what you put down so you only be able to detect, like for example here exons two and four. Sage came out, that was a sequencing base but it only grabbed the ends. In TacMan assays you can design everything but they're kind of arduous. But sequence of course can cover the entire thing. And the other advantage with these paired end reads is we're talking much about paired end reads, talk more and more about that. But you can sequence from both ends of the molecules and so you can connect between exons and understand how these things are spliced together. Also and so the old way was you'd label both two different colors and do it on a microwave. You can just grab the RNA and sequence it. The nice thing is it's a digital output so basically the more data you collect the more counts you get and it has a much greater dynamic range than microarrays, it doesn't saturate as well. You can saturate, you can saturate the library. Prepping them isn't too bad. Let's put this slide up because we're talking about microarrays. So you can pull out the fraction of the small stuff, 18 to 30 nucleotides and process that. I don't have any slides on it but I should mention that there are newer kits now that actually give you stranded information. So the reads that you generate on the sequencer match to the strand of the DNA that the transcript came from. That's really good for teasing apart overlapping sequences. And not everyone's doing that now but more and more people are turning towards stranded sequence. But from this, of course, other than microarrays they'll give you a certain amount of information and expression. But from sequence you can, I'd say readily but it's still difficult. Get not only the transcript profile but differential splicing, differential allele expression. RNA editing if you have the match genome sequence. There were some papers that came out that said that RNA editing is rampant in the genome and there's been lots of papers disputing that. Like I forgot what the numbers were. Francis, you remember there's like 10,000 RNA edits or something for transcriptome but most people think that they were just looking at errors but there's still quite a bit of RNA editing going on. I've already mentioned this and then the small molecule I've mentioned that could be an example. So this was in, this came out in 2007. This was a paper represented a ton of work. This is looking at microRNA expression atlas so across various tissues. And they had done 330,000 small RNA sequences which is done on old fashioned sequencers had made 250 libraries. Traditional cloning and sequencing had done 1,300 clones from each libraries. It was a lot of work. 26 different organs. And they had seen about 700 microRNAs and breast cancer cell in MCF7, they'd seen about 100 of them. So this was a state of the art in 2007. So we decided we would try this just as one of our earlier sequencers. So this was in 2000 and probably it was end of 2007 or early 2008. So we did a single run. It was the very first attempt we'd ever done on microRNAs. So in their paper they had found in MCF7 they did 795 reads. They found about 100 of them. Our very first run we did 4.6 million reads and found 213 of them. If you look at some of the differences so there were some that they found that we didn't find. They were very low abundance and so you have to question whether they're real or not. This is their plot of the frequency of them. I had to actually scale this differently. But you can just see the shear magnet just gives you an idea of the shear magnitude and just to make sure that we were actually seeing what was correct here. This is, we mapped them to their start points of these reads to the genome. This is the height of this peak is the frequency of the mapping there. You turn it on the track and the browser, Francis shows you some examples of the browser today and you can see there is an annotated microRNA at that position. There's a major and a minor fraction. So it's a major and a minor. That all made sense. The interesting thing is that you can see that they don't all start at the same point. In fact, the one that's annotated in the databases, this one right here, not even the majors. This seems to be the major peak, at least in our data set. And a lot of people have seen this. I don't know if it's been resolved yet whether that's just the biological noise that they really matter or is it actually biologically significant. Epigenomics, much more on that. But again, this was the traditional readout was microarrays, but you can actually then cross link DNA to like histones. You can pull those down and then release the DNA and sequence it so you can map things all over the genome. Not gonna say much about methylation, but there are ways of actually looking at methylation across the genome. And we're pulling down for chip C, for example, then it just maps quite nicely. The biggest problem is you have to figure what is significant, but these are clearly significant. We can do that from the experiment anyways. Now you can see that there's a point of methylation here that's in one of the conditions and not in the other, so nice data. All right, so although that you get a lot of sequencing power, changes the shifts to bottlenecks. So as we can do more and more sequence, we have to prep more and more samples perhaps, or of course as you'll find out, the analysis becomes more and more difficult to generate more and more data. So the old way of doing it, if you want to sequence a region, it's just good old fashioned PCR. You can sequence a whole region by having overlapping PCR fragments. We got good at doing amplicons that match the read length or you could actually do five or 10 KB amplicons and cheer it up. If you want to do a gene, you just get all the exons, design primers and sequence that. So it was pretty straightforward, but it wasn't scalable. As you started scaling up and up to hundreds of thousands, we did, we scaled it up not with Nexium, but with Sanger at Baylor. We were doing the half million PCRs a month and a million sequences, one each in both directions. And that took a team about eight people and a lot of robotics. But it wasn't, you could sequence all of that now and one run on Nexium. So clearly we had to scale up. So to capture regions, I keep this title on here. This is old, but in 1991, a colleague of mine, we did isolated CDNAs specifically from chromosome five. And that was by biotinylating back clones that represented the entire chromosome and then phishing out and CDNA library. So it really is, really they felt they invented something, but they just really pulled old technology for the most part. The main difference being is that in this, some people called genome partitioning or hybrid selection, you could actually now in the technology that you could actually synthesize enough oligos, biotinylate oligos to pull these down. Started out on solid surface support where this sequence here represents the sequence you wanna capture. You have your shared DNA or your library and you can then anneal that, wash off everything, it doesn't stick and elute that off and capture your region. It worked quite well. This is just one region we did was like 120 kilobases, I think it's 120 kilobases of region, all it says on there. Oh, 600 kB, there it is. And you could see you got pretty good coverage across the region where there's these bald spots is because they were repeat sequences and we weren't able to design a unique probe for that. And you can see the coverage wasn't bad, but it very clearly mirrors the GC content of the probes. This is fairly old, you still see this sort of thing today, but it's getting a little better. And if you're doing like exon capture, so this is agile search, you're select mapping to the genome, you can see it's quite specific that you get good capture. If you set your threshold, very good capture just on the exons and not in between. So the other thing is a multiplex PCR. If you ever tried multiplexing PCR, you can have 10 sets of primers that work really beautifully. You put them all in one tube and you get a zillion different junk products. It takes a long time to optimize. So Rain Dance came out with a method that was using little droplets. So this is sort of an emotional PCR, only the microfluidics produces them when you need them. So you synthesize the oligos, you put them in little droplets and then you can mix them all together. This is a library then of primers. You can then take that in your DNA and on their little device, you can merge these droplets right around here, the little pulse of electricity, merge the two droplets. You now supposedly hopefully have a piece of DNA and one of these primer pairs use the droplet. It just spits them out in a tube and then you do PCR on that. And it's like you get millions of PCR tubes, each one of these droplets, but they don't interfere with each other now because they're separated and so you get pretty good amplification. It started out to be quite expensive and so we didn't use it much, but we're looking at it again as prices come down. They now have a, this is fairly new that came out with a hotspot target. If you're just in cancer genes, it's 42 genes, 71 KB of total around 200 base per amplicons. So if you're working out at FFPE samples, it's important to keep the amplicons short. And so this works quite well out of FFPE. And then we recently did a run on that and you could see the coverage here. This is 100X coverage. At 100X coverage, you're in the high 90s for the target that's being covered. And you do want to go deep on that. So it works quite well. It's quite reproducible. This is different samples. Another one is the Haloplex. This has been bought by, everyone gets bought. This has been bought by Agilent, but it's still marketed as the Haloplex. It was a little different. You take the DNA and you do a multi-restriction digest of it and then you put in an adapter which has overhangs to the part you want to capture. So the one limitation is it is a restriction enzyme-based process. And because of that then, depending where the restriction sites are, you can get different coverages. But it works quite well. They use about eight different enzymes to get around that. And then you amplify the part you want and sequence it. Just to give you an idea of the scale, we were looking at 19 genes, which we'll talk about later. 61KB of target, we did it with this. And on the MySeq, you could very readily put 10 samples together. You could probably do more, but we want a deep coverage. So we get about 1.9 gigs of raw data and 97% of the map to read for on target. So it worked quite well. It's fairly simple process. It takes a reasonable amount of input DNA. I think their protocol asks for about a microgram of DNA, but we brought it with 100 nanograms. It works just fine. This was their design. Because of their restriction-based, because they're limited restriction sites, and they did this based on a high-seek front, which is 100 base pairs, this was the coverage that they felt they could get. The advantage of the MySeq, we can, the longer it reads, is we could actually read further into those restriction fragments. So we got over 99% coverage of the target. It worked quite well. There's another HaloPlex run, just showing this is a custom one that we had them do. Lots of good coverage, very reproducible. This one line here is actually mouse. We'll talk about that in a minute, but that was mouse. We'd want to see, this is what it should look like. We didn't want to get capture of a mouse product. The IonTorrent has a AmpliSeq, which is their version of it. This is a multiplex, it's PCR, with some tricks, it's all pretty much in one tube. I haven't made any effort here to normalize the scale, because I want to see the difference here. So the blue one is the FFPE, and the red one is DNA from blood. And you can see that this is all the 800 Amplicons they've got in there, plotted by their coverage. You see there's a lot of variability, but it's reproducible. So it's obviously the efficiency of the primers. They have a new version now, which is better, that flattens this out. But this is the type of stuff you have to deal with in varying coverage, if not any of these multiplex methods. All right, I'm gonna move on here. Challengers in cancer genome sequencing. So, start with the normal genome, end up with the cancer genome. A lot of things happen in between. You get the drivers going, which essentially give an advantage to growth, but more and more things accumulate. So what you end up with in the end is not a single population of molecules, but a population of varying clonal modalities, which both in copy number and in point mutations. In a cancer genome project, typically, because of this, we need large numbers. It looks, you take an individual with the cancer, you sequence them, you sequence their blood, you get the difference between them, you get a bunch of somatic variants, you collect enough people, you map them on the pathways, and hopefully you'll find out what is the drivers behind this cancer. So it's the basic premise. International Cancer Genome Consortium had a meeting here in Toronto in October, 2007. There are 22 countries, 120 participants. The idea was to ask the question, could the world start working together instead of competing continuously on cancer genomics? Typical thing for a meeting was the answer was yes, but usually nothing happens, but this one actually got some traction. The reason for it, it's not always the scope huge and it's an important disease to conquer worldwide, but what got me excited was the standardization here. If we could standardize how we present the data and how we measure the quality of the data, we could compare across cancer types quite readily. There's a lot of projects out there and you can download the data from one, download the data from another. It's really hard to compare those data sets. They're very, very different. So there was a paper that came up to describe it. The goal was to do 50 tumor types or subtypes and the goal is to do 500 tumors of each. So 500 tumors of each subtype plus their controls and this is like doing 50,000 human genome projects. So it's a huge scale. Lots of countries involved. This is from earlier in the year. All the projects, all the countries will point out us here doing pancreatic. Australia is also doing pancreatic. We work very closely with them, but right now, this is actually out of date, but at the time of this slide, there were 18,000 tumors committed to be sequenced. So huge amounts of data come. There's a web portal for the ICGC. There's also where you can get the data. I don't know if Francis will cover more of this, so just give me the URL. And then just talk about us a little bit in our projects. We're a Translational Cancer Research Institute. We have an annual budget of about 160 million. This is the building up the street. This building's under construction. This is about that high, I think, right now. This is where we are currently in three floors of this. And there's about 300 people in this year, but the entire program covers about 1,500 people. The ISR acts a little bit like a funding agency, so we support other programs externally, including cancer stem cells, for example. I'm gonna talk about cancer genomics a little bit, bioinformatics, and the high-impact clinical trials and how we interact. But we have everything from prevention all the way through to clinical trials. This is our platform. We have 10 high-ceaks, two my-ceaks, a PGM, and a Pac-Bio, and a fairly significant compute infrastructure, but it's never enough. This is always full. These are always maxed out. But in that new building, we're gonna hopefully get some more, and I guarantee this will always be full, and these will always be maxed out. You never can have too much. So one of our main targets is pancreatic cancer. It's a pretty dismal disease. Five-year survival rate of 2%. It's only one in 15 new cases, so it's not the most common, but because of the survival rate, it accounts for 6% of cancer deaths. One of the reasons is when people arrive at the clinic, it's very hard to diagnose. They sometimes don't have very specific symptoms. Only 15% are surgically receptable, and they usually succumb within two years. Most people, when they show up, already have metastatic disease and can be very short-time of survival. So if we're gonna start a project, like I said, a normal ICGC project would be 500 samples. We've scaled back, it's actually 375.250, but that's because Australia is also collecting, and so between us, we'll have over 700. Because we want to deal with consent for the ICGC, we couldn't really go back to the old samples. We had to go to new samples, and so we teamed up with various centers to help us collect these samples. So sample acquisition in many of these projects is one of the rate-limiting steps. Typical project was we would get germline DNA, it's usually blood, sometimes it's adjacent tissue, the tumor, we'd measure these various things that we talked about, do some validation, magic happens like you're gonna learn to do here, and hopefully we'll find some pathways. So some of the issues with primary tumors, the very first one I showed you was a leukemia that was sequenced. You can get lots of pure tumor in leukemias, they're great to work with. We unfortunately picked ones that are hard to work with. Pancreatic tumors are a cellularity from 20 to 80%. Most of them are below 30%. So 70% of it is not tumor, right, when we get these samples. And also the heterogeneity I talked about compounds the problem. And so if you just, I won't go through all these numbers because they're easy to work out on your own, but even if there's no coffee number changes in the genome is still two N, but in many of the tumors that we have are only 20% tumor, and so a heterozygous mutation, somatic mutation, the number of reads that would, or the signal is only 10%, right? So you run into a problem very quickly, and this is actually a pancreatic tumor. You can see the tumor part and lots of stroma. So one of the goals of the ICGC is to have a 95% verification rate. The goal, which I think is actually too high, I don't think that's even possible. Most tumor types are around one somatic mutation per megabase, so we have to have an error rate of less than 0.05 to achieve this. And as you'll see as we go through this this week, that's going to be very difficult to achieve. It's a false positive, false negative trade-off, but we're around 85%, and I'm pretty happy with that. I don't think we'll achieve that. We'll talk tomorrow about one of the main errors that we have is misalignment. So even the aligners still make mistakes, and we'll talk about that tomorrow. The other thing is that one of our big sources of error is just coverage. So if you're looking for somatic variants, it's in the tumor, it's not in the normal, and if you don't call it normal because of either insufficient coverage or for whatever reason, then it looks like a somatic. So we're getting better at this. We have to make sure we have good coverage. There are many ways you can try and get around these problems. So if this is back to that pancreatic tumor, you can see this region here has more tumor in it than we'll say here. So you can actually core it out. That works, but in pancreatic, unfortunately, most of the tumors are pretty uniform. No matter where you core it, you get more or less the same. We get a 10 or 15% increase, which is good, but not good enough. So we also have a component in our project for doing xenografts. So part of the tissue is implanted in mice and grown up, the tumor grows up there. And from that, we can also try and make cell lines, same analysis. So why are we using the xenografts? One is the low-cellularity. So they're great in your project if you have this problem. The idea was we'd grow out more tumor. So increase the amount of material we have to work with. They also are great pre-clinical models. So if we sequence that tumor and find a pathway that is critical to that tumor, then we can actually target it in that mouse. So we have a bank of these and we can target different pathways. And in the OICR, we have two groups that are looking at that sort of thing. So we're trying to generate reagents for them as well. It just shows it works. This is that same tumor implanted in the mouse. You can see the xenograft has a lot more tumor content in it. Yeah, that can happen. But they're reasonably true. But the frequencies will change. I don't know if I got slides on that. The frequencies will definitely change. But some of them, it's the only way to work with it. So just from a detection standpoint, almost universally, pancreatic tumors have a K-Ras mutation. And so you can look for that. And this is just showing that the xenografts will be able to detect the K-Ras mutation universally in the cell lines in the primary tumors especially the low-cellularity samples. We couldn't detect it in some of them. These are pipelined to detect them. If you go back and look, for example, there were two reads out of 38 here that supported the K-Ras mutation that was in the xenograft. Here's three out of 227. This is incredibly low-cellular. This is about 3% cellularity of this sample, et cetera. So it's very difficult to call in those primary tumors unless you know what you're looking for. So the xenografts are helpful there. So one of the things you can do with the xenograft is sequence the xenograft, find all the variants, and then go back to the primary tumor to a very deep sequence and make sure that they didn't arise or wet the sub-population. I'll talk about the heterogeneity in a second. How long do I have? It's okay. Yeah, I've got the 10-30, all right? All right, no problem. Oh, we got lots of time. Lots of time. So some of the disadvantages of the xenografts is we just heard that you are clearly working with a tumor that went through a mouse. It's like when we sequence the human genome, we were looking at the human genome from a bacterial view. Every clone that was grown came up in a bacteria, and so some things didn't grow in bacteria, so we had gaps because of that. But so we did some very deep sequencing in the early ones, very, very deep. But the percent sequence in line to human was relatively low. Usually we're in the high 90s, and so like this one here, only 30% of the sequence we generate actually aligned to the human genome. If you measure the amount of mouse that's in there either by QPCR or estimator from the sequence, you can see that the amount of mouse can vary, and this one, in fact, is 71% of it of the tumor was still mouse, right? So the pancreatic tumors don't like to grow in a little ball of tumor cells. They like to grow interdigitated with the surrounding stroma. And what they did was, although there was a lot more tumor there, it had recruited a lot of mouse tissue. There's also infiltration going on. So we still end up with a lot of mouse DNA. So did we really trade one problem for another instead of having the human cancer stroma problem? We have a mouse stroma problem there. And why is that important? You think you just filtered out informatically and some of it you can as you'll see. But this is a series of reads aligned to the genome. It's hard to see at the back there, but there's the Gs here, the variants. It's a quite clean sequence. You can see a few sequence errors. But you would clearly call that as a variant T to G. And this would be a somatic variance, not in the normal. If you take that region, the sort of hundred bases surrounding that variant and you bled it back to the human genome and the mouse genome, the only difference you see is in that hundred base stretch, is that one base. So all that is, it looks like a beautiful somatic variant, but it's actually a mouse reads aligned to the human genome and the only difference being that one variant. So it looks and smells like a somatic variant, but it's not. So what do we have to do? Again, I found myself sequencing mouse. We had to sequence the mouse genome relatively easy now compared to when we did it the first time. So we were able to generate very quickly, generate high coverage of the most genomes. We did each of the ones. We have to do another one now. We did these, the two groups we're working with use different backgrounds for their mice. So we had to sequence each one of those. And it's got actually two more variants. We're gonna have to sequence them. But it's easy to do. If you look at what aligns, around 1% or so of the mouse reads will align to the human genome at the pipeline that we run. They disproportionately, as you'd expect, align to the exome. So about 25%, the exome's only 1% of the genome, but about 25% of these reads that are aligning actually align to the exome. And if you then just run it through your pipeline so you align the mouse genome, run it through our pipeline and call variants, you get a lot of variants that are called. And again, disproportionately in the exomes. There's another Circos plot. So these are the human chromosomes. This inner ring here shows all the SNPs that are called from the xenograft. And then after sequencing the genome, if we remove all the SNPs that we found by doing that, you can see most of the ones that are still here are actually mouse. And we still see some mouse ones that we can't. You think it'd be really easy informatically, but it's not. And that's again, back to the alignment issues we'll talk about more tomorrow. You can actually use the xenografts to enrich as well. So you can either pull out the mouse cells or you can select for the human cells. We don't like doing this because this is saying that the human cells, the human cancer cells will express the markers from their tissue origin, which they may not. And that should be shown they don't always. And so you might lose some here. But the mouse background should be the same. You should be able to do this. And it works. The problem with this is it actually works on frozen but works better on fresh. But you have to dissociate the tissue and then incubate it with these antibodies. So this thing's been 37 degrees for a couple hours by the time this is all done. So I think RNA expression is gonna be out the window. You probably wonder why not just do LCM. We are starting to do that now. We just got a new system in. We've fed a good part of the past year though. You don't get a lot of material in the LCM. So we've been a good part of the last year working on protocols to use less and less input materials to get comprehensive coverage. So we're just to the point where we can start marrying LCM along with laser capture microscopy. So we just go in and you can actually now use the microscope, mark it up and actually pull out the tumor part or pull out the stroma. We're very interested in stroma tumor interactions. But you don't get a lot of material for doing RNA-seq or for genomic sequencing. So, but I think we're almost there. I think within a few months we'll be able to do that. It's really important for us for pancreatic because we don't sequence anything. Those early ones I showed you were ones we just tried, but we don't really sequence anything that's less than 20% cellularity, which is at least half, if not more than half of our sample, very less than 20%. All right, so to make a library. We routinely make library, the genomic library and do a whole genome sequencing on 100 nanograms. That works well. We've done it on 10. You can do 10, but the representation and the biases in the genome start to pick up. You can do 50, probably pretty well. But we routinely do 100. The protocols call for about a microgram of input. And we started out three years ago with 10 micrograms. So it's come a long way. Still gonna be hard to get 100 nanograms of material. But I think we're reaching, if we can get down to 10 nanograms then it's quite doable. But that's about the only way we're gonna be able to rescue about half our samples or we're gonna collect twice as many samples. But we're also interested in, are the low-cellularity ones different than the high-cellularity ones? We're already biasing our sample set because we're only currently looking at things that are surgically resected to that 15%. And then we're also biasing because we're only looking at things over 20% cellularity. So there's no guarantee that what we find will be applicable to the rest of the pancreatic cancer. Won't get into a lot of details, but this is typical for any of these projects that this is the first 71 that we looked at. There are 100 genes that are mutated in three or more of the samples. And then there are 1,000 more genes that are mutated in one or two of the samples. And it's this long tail that you see in all of the cancer projects these days. There's some main drivers, there's KRAS. And these are a lot of primary ones. And so you can see actually it should be 95% of KRAS, but obviously we missed some. But sort of the known players are here. These are ones that are interesting. And then this huge tail. And we'll be talking about pathway analysis and that helps sort of that. They're more homogeneous. And if you did look at different parts, you might find different things. And there's a paper not too long ago. Yeah, and it was in prostate samples. The paper came out in pain where they biopsied. She got very different cognitive information out of it. But more work. Put this up for validation. So these are Sanger traces. I mean, some people have never seen one. I don't know. This is what we did with the human genome. But here you can very see that there are two alleles here. And so when we're sequencing in the old days, we find something we'd have to go in and check. So this would be, for example, this would be the tumor. This is the normal. You can clearly see that there's a variant here. There's an AT here and only an A in the normal. So that's just for reminding me about validation. And so validation now becomes more and more problematic because you start finding more and more things. So you sequence a tumor and you find several thousand things. You have to validate them. One thing we do do is we do a snipper rate on everything and this is then, we look at the concordance here. So if you do a snipper rate, then the sequence, you can look in the sequence and see that you can see those snips that the array picked up. The arrays aren't perfect. So you should probably plateau. This is actually, 99% of them should be validated or more. So here you're close to 99, 98%. So that tells you that your sequence depth is good, that your pipeline's working well. This is just a select set that we picked out of here, some that we wanted to validate using Sanger. We tried it with a Sanger-based approach and you can see then using the Pac-Bio, which I'll talk about in more in a minute, you can see if you could validate more of them. And this is just that whole idea of just being able to detect here on a Sanger-y, unless once you get below about 20% of the cellularity, this little bump here is usually not enough to confirm it. This is a pretty good trace here. You can see here's an example here. You know, your background pump here is like 10%. So you couldn't detect, you could be sure that that's real or not. And this is just some data from a more recent validation. So the pipelines are always evolving. This is our old Informatics Pipeline, our new Informatics Pipeline. This is a cellularity of the samples based on KRAS sequencing. And so you can see that their average or the ones we work with are typically 30%, 40%. Many of them below 20%. Early on we sequenced them anyways just to see how we could do. You can see that the validation rates are quite poor, although some of them aren't bad. So here's 16% and we have 96% validation. I took this one out and calculated this number because we knew there was a problem with this one. So about 85% is where we kind of sit in our validation rate. Which isn't bad. It's not the 95% they keep saying that we have to get to, but I don't think you can do much better than that. So heterogeneity, we talked a little bit about it. So as tumors initiated and proliferates, you get different clonal varieties, it's exacerbated when in therapy. So you put a drug in the patient and you'll kill off some of the clones, but some will survive or adapt. And when you get a recurrence, you get the resistant clones emerging. So you sort of select the sub-population or potentially a new population for the new mutation, but usually you're selecting the sub-population. There's a paper that I've included in your handout that came out recently. This is the Walter Del from Wash U. Again, this is leukemias. They're just nice examples of their cleaning off the deal with all the background normal. And they can clearly see that if you cluster the allele frequency, so this is in myoplasic disorder syndrome. This is the early cancer that they had or the early syndrome. And then in secondary AML that occurred and you're plotting the mutations in their frequencies. And you can start seeing clusters. So this is one sub-population of clones, another, another, another that have grown up. And they do some nice figures here. Apparently there's, they have one person there who's very good at Photoshop. Unfortunately, this isn't a program to generate these. Someone actually draws these. But this is, they're trying to figure out now the clonal evolution. And this is one interpretation of that. So the yellow cluster here is the original clone that accounts for a great percentage. Like here, it's 74% of the clones is bracketed by this yellow. But it started accumulating mutations and getting more and more sub-types. So it's breaking into sub-types. Some of these sub-types die off, either through treatment or just other ones outgrow it. So we've talked a bit about this, but the trends from 2005 to 2010 was bigger and cheaper. So they tried to do more reads at less costs, longer reads, et cetera. There's kind of a war going on. Luma, I think was clearly the winner with the high-seq solid went through many versions. So the machines were continuously changing. The data were changing. But around 2011 and this year, more machines have come up. But there's also, even within these companies, there are developing new machines that have more moderate throughput and faster runtime. These are some of the examples, the iron torrent and aluminum my-seq, and potentially Oxford Danupor when it comes out. There's still heavy guns being developed, especially the 2500 coming out. But this here, they're obviously targeting a niche for clinical applications. So we'll talk a little bit about clinical applications. So it changes the workflow. On the research side, tumor normal care, we don't care if it takes a couple months in general, there are continuous things flowing through the pipeline. But on the clinical side, you do care about time. You want to be able to do that sequencing, certainly in less than a week, as you'll see. When I came to Toronto, there were lots of questions that I had for the clinicians. Clinicians had lots of questions for me. Let me have a second to check. Okay. I talked about this, so this is a collaboration between genome technologies, the bioinformatics group headed by Lincoln Stein, and high-impact clinical trials headed by Janet Dancy. And another point about the OICR, so I showed you this nice artist illustration of the building, this one's under construction. At this point, it was just a hole in the ground. But look at our location here. We're very close to our funding agency, which is Ontario Government, which is good and bad. The University of Toronto is across the way here. But we're also adjacent to major hospitals in Toronto. And the one in particular I talked about, this Princess Margaret Hospital is part of the University Health Network, which is Toronto General, Princess Margaret, and Western way over here. But this is a non-cology hospital. This is cancer hospital only. And we were very closely with them. We've been working very closely with them on a clinical application for genomics. So the questions were thus is, is the technology writing? When I first came here in 2007, started talking about it, and they'd asked me that question and the real answer was no. It's not, the turn down was too long. And you just take the effort to do it wasn't worth starting the project. Now the most, do we have enough targeted agents? That's my question to them. Is it really worth doing? We did sequence, targeted sequencing of genes. Would it make a difference in the therapeutic outcome? And this is just a partial list of some of the targets that there are. Some of the genes that are targeted. Some of the known mutations that are targetable and what tumors they're in. But the concept was really that, for example here, BRAF mutations are quite prevalent in melanomas. And there is a BRAF inhibitor. And the idea was, well, if we see the BRAF mutations in other cancers, can we also use that BRAF inhibitor with success? This is both a good and bad example in that in colon cancer, with a BRAF mutation, if you hit with a BRAF inhibitor, you actually don't kill the cells. And it's because they can escape the pathway because they express EGFR and the melanomas do not. But if you also block the EGFR, then you can't kill them. So understanding the molecular landscape of the tumor, you can actually target the therapy and this is the idea behind personalized medicine and cancer. Big one was, can we use FFPE? Because that's the currency of pathology. Instead of fresh, we did get some fresh biopsies collected and didn't see much difference between our ability to analyze them. Formal and fixed paraffin embedded. So pathologists, there was a pathologist here, right? Pathologists love to destroy DNA, right? So they take a biopsy and the first thing they do, and I understand why, they drop it into formulin. Formulin is not kind to DNA and RNA. And then they put it in paraffin and then they can take a slice and they can stay there and they can look at it. And they get the morphology that they can do a diagnosis on and that's rooted in 50 years of pathology. 130 years of pathology. So obviously the diagnosis is key. So we had hoped to use that we could also collect fresh tumor biopsies and so we did an experiment. Usually they were pulling three biopsies. The first one goes into formulin. Secondly, usually went into formulin because in case the first one didn't work or didn't contain tumor. And sometimes they'd save the third one. We actually found that we got better results out of the FFPE than the fresh and I think it's partly because I'm not sure how long it sat around before it got properly frozen. But it was clear to us that this is the currency of pathology so we had to work in that. So we just finally said, forget the fresh, don't bother you can try and get it for us. We'll learn to work with this. And so that's what we've been doing. This is the key thing. What are the difference between primary and metastatic sites? Touch on that a little bit. We were biopsying as you'll see in this pilot phase metastatic tumors and then you're treating the patient on it but it is important to know the primary tumors available usually in some FFPE block. No, this is expensive and is there much difference between these two that you could just sequence the primary and treat based on that. You'll see that's what came out of that. So back to the Pac-Bio. We started this quite a while ago and you'll see that Pac-Bio may not be the clinical instrument in the future and I'd say it isn't a clinical instrument but we had it and the thing about it was a fast turnaround instrument. It has been an hour to generate sequence and the library perhaps quite easy. We had run, we'd run quite a few smart cells that little thing that went around. It's called a smart cell. Right now I think we're actually over about 1200 and most of them we've done on the clinical sequencing part of it. Just for that fast turnaround. Just to give you an idea what happens when you get a new instrument in. We got the second commercial instrument, early access that came in July 2010 and they got installed and the specs were met and so we started working on it but we continuously worked with it trying to improve it and these are various things that happened here. For example, they came out with a new enzyme so they're always trying new enzymes. Got an increase in read length and throughput. They did an instrument upgrade, got a decrease in read length, decrease in throughput. More tweaks, finally got that to working. We improved the quality of our DNA, kept going up. When we, another instrument upgrade, this one actually worked, we got better stuff and so you're continuously, this is over what about a year, another chemistry upgrade. So there's a lot of pain to get a new instrument in but it's working quite well for us now and we are ready, come on, there we go. Ready to start sequencing. This is sort of a typical run metric that we get. Air rate here, or accuracy, I shouldn't say air rate, I should give them a benefit to, the accuracy is 86%, which means the air rate's 14%. Sounds horrendous, if you're doing clinical sequencing then how could that ever work and I'll show you how we did it. The read length's around 3KB so it is a long read instrument. If you're interested in long reads, not a lot of them though. Number of reads, so about 70,000 reads. So it's sort of a moderate, very moderate throughput. If you think about, I'll talk about circular consensus in a second but if you want to sequence something around 600 base pairs, you can think about like Sanger reads, which are sort of the gold standard. You can get 70,000 of these Sanger reads for about, you know, Sanger-like reads for about 300 bucks. So depending on your project, like for microbes and stuff, it's actually quite good. Circular consensus. So what we decided to do was very early on, this was the only really fast turnaround instrument this would be formed myseq before I import. If we amplify our targets of interest in PCR products, you can then put on a hairpin adapter which essentially then makes a close circle of single-spranded DNA. In that, while I showed you the slide on the PacBio, with the plumeraries at the bottom, it can read around that template multiple times. So if we keep this short, and our applicants, because we're working at FE, we have to keep them short, because the DNA has been, it's fragmented. So our applicants are less than 300 base pairs. The plumeraries can go around multiple times. Remember, it's got three KB reads. On average, it's gonna go around 10 times or so, right? It makes one single sequence that comes off in your computer, it's three KB read, but then you can clip out these reads and build a consensus. And so what it does is it reads through like the forward-strand, reverse-strand, you know, through the adapter, through the adapter in the next one. And so by looking at these adapters, you can clip it up and get a consensus. When you do that, you get a much more accurate sequence. Remember, I said the errors are pretty random here. This is just our pipeline, we won't go through that much. We had to build our own. But this is what the data would look like. So this is, if you take all the reads that come out, so even things that didn't go around more than once, right, because it's a distribution of these read lengths. This is what the raw data kind of looks like. So here's a variant. You can see it, but you can see this background here. These little purple squiggles, those are insertions, so there's extra bases in there. These are deletions, and these are substitutions. So these are all errors in this sequence here. If you restrict yourself to three or more or five or more, it cleans it up. You can still see that the dominant error is insertions. And this is sort of a homopolymer feed issue where there's multiple T's, and it thought there were four instead of three, for example, this sort of thing. These are still some deletions, which are usually dark bases. So a nucleotide was incorporated that did not have a floor on it, until there's no signal, so it looks like a deletion. But you can clearly see your variant that you want to call. That's something that you have to decide on your experiment. So what you can do is you can plot the variants across your entire Amplicon. So for here there's a C, so you don't only see it once, and then you'll see this will be a big peak, and then you have to set that threshold. We're pretty comfortable calling with these data down to around, we set our threshold at 5%. So if we don't see it in 5% of the reads, 5% are greater, we'll call it, and that turns out to be real. We can actually probably go down to around 2%, 3%, but that's when you start getting more and more noise. So empirically you have to take some knowns, I think I have that here, and decide where that threshold is. Yeah, I have it coming up. Sir, could you, for that variant, why, what are the reasons that... Oh, because this is, some of these are normal, so it's a tumor, so there's the, it's heterozygous, so there are, it should be 50-50. But it's not because of the errors. Well, it's not partly because of cellularity. So in the sample we get, it's not all tumor. There is some normal adjacent tissue. So that also decreases the number of reads as well. And there are some reads, there are probably some errors where it, but it'd be rare, but there are probably a few where the error happened to be the reference, but that's probably... That also means you're gonna... Yeah, yeah, we set our count to 5%, so it was less than 5%, we will ignore it. The other reason is that it could be from two. Yeah, yeah, so it may not, it may be somewhere else you may see that it's 50-50, and then this one it's only 10%. It's at least three different reasons. Yeah, yeah, and they're all important. You have to think about them all the time. So it, we, so I'll get into it, but this is a, we're doing clinical sequencing, but not a CLIA lab, right, so we're a research lab. So this would not be used in the clinical report, because we're not allowed to do that. The CLIA lab would verify this. Now they like to use Sanger reads to verify, that's the gold standard in their opinion. Problem is, I told you, when you do a Sanger read, it's below like 20%, you probably can't, you know, you sort of, you know it's there, so you see the bump, you go, yeah, it's there, but you know, you see other bumps around that kind of look like 10, 20%, so you really, I think 15, 20% is the limit, you can go on a Sanger and be confident that it's real. So they would have trouble, if it was like 5%, they couldn't verify it with a Sanger read, but there are other methods that they have to use. So the verification is very important. So we decided it was time to start a clinical trial. We used the Pac-Bio as it said, because it was the only fast turnaround instrument we had. We targeted just 19 genes, and I'll say why in a sec, and trying to do somewhere around 80 patients, we've actually did about 90. These patients were advanced, recurrent, or metastatic disease, so these people have already been treated, as you'll see, and the cancer has come back multiple times, in fact. They have to be a candidate for clinical trial, so at Princess Margaret, across the street from us, they'll have many, many clinical trials going on, and the idea here is they got to something with this patient, so they're gonna put them on a clinical trial, and we're hoping to generate some information that'll help point them to the appropriate clinical trial and increase the odds that they will respond to it, and of course, they have to be interested and give us informed consent. The CLIA lab, the CAP CLIA lab, it's in Toronto General Hospital, part of UHN, that does the work for Princess Margaret. About a year and a half earlier, it put in a sequinome instrument, so this is a mass-spec-driven genotyping device. There is a panel on Cocharta version one, which screens 238 mutations in 19 octogenes, so this is a genotyping device. They're looking at specific mutations, so they don't look for it, it's like a microwave. If you don't look for it, you don't see it, so the idea was they would use this, we would do the sequencing, and we'd see if we find new things and also validate the sequencing. So the very first thing we did, we took 30 samples that they had characterized. They characterized them by Sanger sequencing, so they knew these mutations were there. They'd also done it on the Uncle Cocharta, and this was the Pac-Bio of circular consensus, and we found all, but we missed one of them. I think that's been moved, it's here. Missed this one. And it turned out that it was just our Amplicon here wasn't amplified well, so we fixed that, so we were able to detect them all. So that was the first thing, just to see that does the Pac-Bio detect the knowns that were there? So this is the pipeline that we started, so the patient gets consented. You have to collect a new biopsy. Most of these you'll see required a radiologist. Then went to the pathology department to assess the, confirm the diagnosis, also assessed that there was tumor there. They had marked it up and showed us where the tumor was because we're no good at that. Send it over to the CLIA lab, they would actually just scrape off the part that's the tumor, so macro dissect it, and extract the DNA. They would send us the DNA, and we would sequence it. They would run the sequinome, and then as you'll see we generated a genomics report, goes to the clinician, and hopefully used to treat the patient. The goal is to do all this in three weeks. And it usually took a week just to get to here. So obviously the sequencing has to be fast, and that was one of the keys why we had to use a fast turnaround. Once we had the sequence, we met weekly, a panel here, which is about a dozen people were made up of genomicists, bioinformatics, and clinicians. We had to have a form of at least six, three of which had to be a clinician, to come up with any decisions. I'll talk about this pipeline a little bit later. But this side here, we'd look at the somatic mutations. They'd be verified in the CLIA lab, where we decide what needs to be verified in the CLIA lab, and then a final report would be made to the oncologist based on what we felt was actionable and reportable. Big question is how to interpret and report the data that was part of doing all the, a lot of that was just logistics set up at the very beginning. One of the things was just this tracking system here. So you've got three different, you've got the OACR, Tron General Hospital, and Princess Margaret, and you just need to be able to communicate between those three centers, and a lot of firewalls to go across. So that was a chore in itself. We set up how we could track the samples. But we also had to come up with some curation of the things we had find. So it was fairly straightforward because there are 19 genes, 238 mutations, and there were a bunch of clinical fellows who were interested in curating it. So we built some tools to help them. Pre-popular database, they went in and we split it between them, and they actually curated them. But there was a lot then we pulled in that's not curated, and this will be a bigger problem in the future. So very specific targets here, everything here, and what is the consequence of these mutations that has to be figured out. The report back to the clinician, under the paper, it's, I can't read it here, but it's probably, hopefully, eligible on your thing, Steven Friend. This is the type of information you could report back to the clinician, it would be pretty much worthless to do this. You know, nobody's got the time to go through, this is more of a research project, to go through all this information to come up with a diagnosis. So our goal was to try and give something more of a standard report that might come out of the CLIA lab for any test that they do. They do a Sanger-based sequencing for other mutations and give them an idea of the mutation, its frequency in various tumor types, a little bit about it, some of the characteristics of it, some references, and importantly, what clinical trials were available and what's, if they know what the outcome of those clinical trials were. So this is the sort of report that would get vetted by that expert panel and passed on to the clinician. So just from the data here, oops, sorry. Why did that happen? Anyway, I'm gonna have to skip around here. Is it printed version? Look there. Yeah, sorry. We had set some goals for ourselves. We wanted at least 50% of the patients to consent. Hopefully they would agree. But you can see we got a very good take rate. So we've brought 56 over here, this number here, 51 were consented, 50 were enrolled. One, I forgot why one didn't. I think they got too sick to continue. 49 of them, we got a successful biopsy. So 98% of them were successful biopsy. Median age of the patient was 57. Median time with metastatic disease is 17 months. Under here is the median number of previous treatment. So they'd been through three rounds of chemo on average. The outlier was eight. And the samples that we got were in median cellularity of 60%. So it's one of the reasons you didn't see 50, 50 in that. They were all solid tumors only. It's not normal leukemias. And the majority were the common ones that they see, but we took them from pretty much all comers. Anyone they felt was eligible for a clinical trial. This is just showing, sorry, for the overlap here. This is just the tissue sites that were biopsied from. And this blue part here is the radiologists, so most of them required radiology. This was sort of the bedside clinician doing it and one was from surgery. So the first thing was, I don't know why this happened, but it's been translated from one computer to another. That was whether or not we could get DNA. So if we got a successful biopsy, in most of them, there was tumor present or a few biopsies where there was not tumor present, re-biopsy on some of them got tumor. What could we get sufficient DNA? So if we got tumor, we usually got DNA from it and this from either the archival blocks, as I'll talk about in a minute, we get lots. FFB, we could get a decent amount. Of course, from blood, you get tons. A lot of variation, but we got enough to analyze. Yeah, that one's better. This is the numbers from above just repeated anyways here. If we got DNA, we were able to analyze it 100% of the time. Mutations found, so about a third of the patients we found mutations in those 19 genes. These actual mutations were mostly what we found, and I'll point out and show you that we found some novel ones as well. And our goal was to turn that around in three weeks, like I said. In those first 50 patients, we only got two thirds of them back in three weeks, but if you were to look at the over time, early on it took us longer as we were learning how to do it, and sort of the last third of it, we hit that three week, in fact, often turned around in two weeks, so we felt it was pretty successful and that we were able to show that we can actually do this and generate sequence. Interesting thing was when available, we got an archival tissue as well, so there were 29 of the 50 that we could readily get some archival tissue. There were three of which did not match, and they're listed here. Here's one, this one's kind of interesting. This is a cervical tumor. The primary tumor, we sequenced the archive, found a PIC-3CA mutation. In the metastatic lesion that was biopsied for our study, we found a PIC-3CA mutation, but it was a different mutation. So the treatment had killed off these cells, and these cells had come back, but both were the PIC-3CA mutation. And we also saw one where the original tumor had a ret mutation here, non-detected here, and we had a primary lung where there was no mutation detected in those 19 genes, but then two EGFR mutations appeared in the metastatic tumor. This is a summary of what we found. There was concordance between the cecunome and the Pac-Bio results, but because we're sequencing the whole gene, we actually saw some things that they didn't see. So these are mutations that are not on the cecunome panel. And these were validated in the CLIA lab, so they weren't artifacts of our sequencing. I'm gonna talk about this one. This is the 18th patient that we looked at. It was a 48-year-old woman, metastatic breast cancer. You're a positive for a two-positive. She had metastatic disease, and chest wall, lymph nodes, bone and lung had been through many rounds of chemo, and we detected a novel mutation here in AKD1. No, I'll explain that in a second. So we validated it. This is actually the traces from my labs. You can see it there. And then it was also validated in the CLIA lab with Sanger. So we were able to detect it. This is where this is it. So this is AKD1. There are no driver mutations or activating mutations in AKD1 right here, E17K. This was down here in the Kynos domain. So it's novel in that, this is not in DB SNMP or anything like that. It's a somatic variant, it's not a germline variant. It's novel in that I can't find anything in the literature on it. It's not in cosmic. It's not any database. So we have no idea. You can see it's highly conserved region. It's a fairly significant change, charge change here. And it's down in the Kynos domain. Is it activating or not? No idea. We're doing some functional studies. The preliminary results looks like it is activating. Clearly if it's inactivating, there's no point in getting them an AKD1 inhibitor. It's already inhibited. But they did treat them with the bulbous liquid. I was getting in trouble with that. These aren't mine. They were treated with an mTOR inhibitor. They did show some tumor shrinkage. This is the tumor. The reason this is, I say don't look is because this is two different points. So it looks smaller, but it did actually shrink. Unfortunately, the patient developed some infiltrates. It was taken off the drug lung infiltrates and it turned out there was still the cancer coming back and they were just too sick to survive the treatment. So this patient passed away, but they were starting to respond. So potentially that was some benefit. This pathway here is something we haven't dealt with. So we limited ourselves to 19 genes. That seems like a small number and it is. Only a third of the patients would be actually fine with patienting. And what we're scaling up to will do more genes. But we didn't want to have to deal with too much of this. So we limited ourselves to what we were sequencing. This is, if you sequence a lot, and this is very controversial in the community. So even if we did a whole genome on all these patients, we would find a lot of things that had nothing to do with their cancer. We were interested in treating their cancer at first, right? It's, we want to get them on some kind of treatment. We may find things that have to do with their overall health, with their family health, which may be important. And their big question in the community is, do we report these back? Are we responsible to report these backs? We gave ourselves a longer timeframe to, we want to deal with these quickly. These ones, it could be validated. There may be more of them. We could verify them at CLIA lab and pass them on if they felt there was any information for the important to the patient. By restricting ourselves to those 19 genes, we found germline mutations in those, but they're in just no common entries. So we didn't bother reporting any of the germline because we didn't find anything that was significant. But as we expand it, we're going to start finding things. And so the big question is, how do you deal with this? I don't have the answer. And it is going to be an important problem because when you sequence an exome of any individual, you'll find, just from, look at a few papers in literature, you'll find roughly around 100 to 200 genes that have potentially deleterious mutations in them. If you sequence the whole genome, these are 600 genes that you would predict that the person has a significant impact on that gene. So it's something we have to deal with. The future challenges, we started with 200 genes or 20 genes. We're going to go up to 200 and 1,000. Once you get up in this range, you might as well do the whole exome. And it's actually easier to do a whole genome than a whole exome from small samples. But so eventually we'll go here. This is just driven by cost. It gets a little more expensive going this way. It only costs about five times more to do a genome than an exome. So cost differential is not huge, but it's still important. We'd like to include transcriptome information, structural variation. We haven't included that so far. We're constantly looking at new technologies as they come out. As I said, I think we pretty much got this one worked out. We can make libraries pretty small amounts. This is important. When choosing how we're going to, if we're going to look at 200 genes, how do we isolate them? I gave you a few examples of some of the methods. In the clinical arena, we want to get 100% coverage. We don't want to have gaps in holes. And it may not be that any one technology does that for you. You may have to do a halo flex, but I showed you only get 97% coverage. You may felt it in with some PCR or something like that. So it might be a combined approach. All right, so just the last few slides here. Data complexity. You see there's lots of different platforms. We got many terabytes per single instrument run. So you get huge amounts of data being generated. You'll learn how to deal with some of that. There's, I think I have another slide coming up, but there's the high C puts out terabytes of data for run. It's fairly daunting. Most of that after analysis, you don't need to actually, there are four human genomes on this thumb drive. So once you get down to what you want to analyze, it's not so bad. Very complex. We talked about all the different things in RNA seed, for example. This is very difficult. There are software packages, and none of them are great for looking at these sorts of things. And validation is a huge problem. No longer looking at a couple things to validate. You've got thousands of things to validate. And of course, as you'll learn through this course, there's a bigger picture to consider. So just because we found some mutations in AKT-1 or something like that, then there's all these other pathways. There's ways that the cells can get around your inhibitor, et cetera. So you have to take the whole big picture into concept, and we'll talk about that with pathway analysis. Data privacy. Like I said, there's four human genomes on here. It's not encrypted, should be. But these genomes are in the public domain. So there's not an issue there. But obviously, clinical information needs to be protected clearly. Person's name, their address, all that kind of stuff. But what about sequence data? So if you've got a genome sequence, you can identify the person. It's not a trivial matter. You can't just look at the sequence and tell who it is. But there is information in the sequence that you can use. And there's been examples of you can, for example, it's called surname leakage. You can, by looking at the Y chromosome, you can determine with some accuracy what the likely surname, at least their ancestral, where they're from in the world, and possibly potentially what their surname is. And in fact, there was a case, and it's quite old now, probably about six years ago, a young boy who was adopted, wanted to find his biological father, had his Y chromosome sequencers company that do that for you, compared to the database. And from that, got a likely surname and where the geographically his roots were from. And he was actually able to track down his biological father with that. And that started this whole storm of encrypting sequence data. But it is a big concern. The best thing you can do is encrypt. So when we ship sequences around, we encrypt them, make sure that they're protected as much as possible. And the big problem, it's going to be my last slide, the big problem is actually integrating on them. We'll learn how to do some of that here, talk about pathways, for example. But there's all these different data types that you can put together, including bringing in the clinical data, and then all the validation follow-up you have to do. So on a clinical standpoint, all of this information can eventually be brought into play to come up with a list of targets here, therapeutic targets, and from the research side, we're trying to find significant pathways into cancer.