 done by 11, right. The necessary stuff, the beginning, France has already gone through that. This is my disclaimer. I'll be talking a lot about technologies, which some of these are publicly traded companies. I have no equity in any of them. So I won't endorse any of them. And it's unfortunate, I'm not allowed to have any equity in them. All right. So one of the things that we're going to be talking about in cancer genomics is clearly a lot of its centers around next gen sequencing now. I've been doing sequencing for about 20 years, starting out the good old days here of a slab gel and radioactivity and just reading them. Those are good old days. You could have a beer in the lab, no one cared. But then it got more and more automated. It got through here. This is still a slab-based instrument, but it was fluorescent dyes automatically tracking. And then it became a removal of the gel. So the capillaries, the machines evolved. And during the human genome project, when I was involved with that, you could fill rooms full of these things. So it's sort of linearly scalable and do about 200 million bases a day. And of course, what we're really going to be talking about a lot in this course is the sequence of revolution that started around 2005 with this instrument. And we'll talk about the instruments in a minute. But all of these instruments now are commercially available and except this one's just about to come out. We'll talk about that. So there's not just a revolution in one platform. It's huge amounts of data, as you'll see, but there's all sorts of different platforms coming up. So are the advantages of next-gen sequencing? Well, obviously, you don't need subcloning. When we did the human genome, there were back clones that we were sequenced. And then each one of those was shotgunned into plasmids, and every one of those plasmids had to be prepped. So as you'll see, you can make a bulk library. You can make one library. We usually make two. But essentially, one library, and you can do all of your sequencing. The amount of data that's being generated is vastly improved. It's huge amounts coming out. And as you'll see, that's many of the applications now. There's an increased dynamic range that you can actually just count the sequences. So it's sort of unlimited dynamic range really. The more sequenced you do, the more you can count. And also, you can detect rare variants, which is very important for cancer, as you'll see, by going to sequencing very deep. It's been readily adapted to a variety of applications. Pretty much everything you can do in genomics with DNA anyways, in RNA, has been ported over. So their microarrays are still being used. They still have an important place. But every application that was done in a microarray or gel-based has been ported over to the next gen sequencing, as you'll see. And the cost has gone down. And we'll talk about that. And it's, as I always say, it's ridiculous amounts of data per run, as you'll see. It puts huge pressures on computer resources. So it really started in about 2005 with the launch of the 454 platform. This is really the first true next gen SQL. I'll put a caveat on that. I'll talk about one other. But it was really the first really commercial one that you'd stick in your lab. So we're going to go through a little bit how they work. I think it's important to know where you get it come from if you're going to be working with the data. So if you are familiar with these, I apologize, but not everyone knows how these things work. So we'll just walk through it briefly. Most of the platforms will start out by shearing up some DNA and putting on some sort of adapter, which is specific to the platform you're going to be sequencing. And this collection then of molecules that have adapters on them is called the library. In the case of the 454, they use a process called emulsion PCR. And this is just an oil and water mix that you you generate. And in each one, the little water droplets in the oil is like a PCR reaction. So equivalent of a PCR tube. There's a bead in there, which has all of those on it or complementary to those adapters. And then by PCR, you essentially coat the entire bead to with these with this molecule. So hopefully, in one of these droplets, you have one of these beads and one of these molecules. So you get the just one type being propagated across the surface of that bead. If you have two, then obviously, you'll get a double signal and that won't work. So some of these beads are some of these droplets are empty. Some just have a piece of DNA. Some have one bead, some have two. It's a Poisson distribution. In the case of the 454, they use what was called a picoteter plate and still do. This is a bundle of capillaries, glass capillaries that are infused into a plate. And then that's cut and then the acid etch it to make little wells in them. The coating on the capillary is stronger than the glass itself. These beads fit into that little well. These are the beads that are coated with the DNA. Then all of the enzymes that you need are packed in in other beads and this holds it all together. So you just put it in the plate and centrifuge it, spin it down, put it on the instrument. This is an electron microgram of what it looks like. And with each one of these little wells then, you then do sequencing by synthesis. You anneal a primer and then you add bases. And as the bases are added, there's a cascade, enzymatic cascade with a light signal being released. So this is really cool. In 2005, you get a couple hundred thousand sequences out of about 100 bases in those days. This introduced a new type of data. So we had to get used to it. We were used to the capillary sequencers. We were looking at traces that we'll look at in a bit. But this is now what we called flow space. And each one of these, so in this one, there's no blocker on the, it's not like a Sanger based sequencing where there's a blocker on the nucleotide that comes in. If you add, if there's three T's in a row or five T's in a row and you add T, then you'll get five T's put in and you'll get five times the signal roughly. So each one of these is flowing in a single nucleotide. And then you see what base is added. And by the height of the signal, you can get an idea how many there are. It's fairly good up to about five. And beyond that, it's not very, very linear. It's one of its problems. But you can imagine, though, that we've been used to dealing with traces forever and all of a sudden these kind of data are thrown at us. We had none of our tools work. We had to reinvent all our tools. The next one that came around along was the Selexa. And here's where I said there was an earlier version. So Lynx Therapeutics in California was really the first next gen sequencers. They didn't actually market anything like an instrument. They sold a service and they did something similar with the beads, many beads, but they're very short sequences. And they were acquired by Selexa for some of their IP technology. It becomes Selexa Inc. So these are really the first guys. So it's good to mention them. The Selexa, when it came out, this is the first aluminum version of it. But it's a little different. And it's on a microscope slide. And on that slide are mobilized oligos. So the same sort of process happens where you have a library, which is your DNA strand with some adapters on it. It anneals to this and gets copied. And then you repeat this process. And so you can imagine that this fragment then goes over, finds another one, acts as a primer, goes back, and you end up with clusters. So this is a cluster generation. And then the sequencing is done by, sequencing by synthesis. And in this case, there is a block around. So you're adding one base at a time. So all four nucleotides are there. And it's an image by a laser. And you can just read off the sequence. And it came out in base space really. So you could get bases out of it where the intensities and you get some sort of quality of all. So this is a little easier data to work with, which is one of the reasons I think it took off. This is the inside of one. This is one of the older versions of it. But if you've never seen one, it's really quite simple. This is a microscope slide. I think I've got one here somewhere. I'll start this one out. This is an alumina slide. Never seen one. And you can tell that it looks like a microscope. And it's built from the parts of microscope originally. This is just a microscope objective. This is the slide here. It has channels in it, as you'll see as it comes around. Each one of these channels is a separate sequence in your reaction. And then it just scans along and reads the image. And all these little dots here are the clusters of DNA. The next version is evolved a little bit beyond that. But that's the basics of it. This is how the early instruments looked. Right on the heels of that, around 2007, applied BIOS systems came out with a solid system. This is similar to the other ones. You made a library. There was an emulsion PCR. So you did code a bead with DNA fragments. You treated those DNA fragments so they would stick to a slide. So it's like a hybrid of the two techniques. They were then deposited on a slide. And each one of these beads then is imaged. The main difference came in that they did their sequencing by ligation. So this is not a polymerase-based extension. Instead they have, we'll walk through in a minute, but they've got oligos here that are fluorescently labeled. And then they are ligated onto the growing template. So this is what the early version looked like. They had, they were redundant here. And then they had dinucleotide in the middle, which is what is given its specificity. Now the interesting thing was, is they would need more dyes than they wanted to deal with to really distinguish all of these 16 different ones. So what they did instead was they did two-basin coating. And so using only four colors they get, they represent all of these dinucleotides. But you can see that AA and the CC and the GG are all blue and TT. So they all look the same. So they had to decombulate that. It was actually quite clever when they did this. And the way that that works is you do the transition. So you have AA, which was blue, right? And then the next one is an AC transition, right? And you can walk your way through this. You can see from this, these data, you need to know the first base. There are different possible solutions here because of this redundancy. So you need to know some of the first sequence. And what we did, we sequenced the first base was always an A. And then you'll find out more there's other ways around this now too. But this, this was a very different way of sequencing. What they claimed was the big advantage. So we call this color space. So it came out in color space. So once again, none of our tools worked. Right? So we had to start all over. So this is very difficult data to work with originally. Even the tools for applied biosystems were lagging. And I think that's one of the reasons that the solid had a difficult launch before it caught on a bit. But it had some advantages to this two-base encoding. So for snip detection, for example, if this is the reference sequence and these are the other possibilities, you can see that you get very different signals out of it. And you can walk through and look for errors. Oops. Can I have one more slide in there? But I guess no. No. But one of the things that they, they were trying to say was that it was more accurate. Because if you didn't get a correct transition from AA and the next base would be AC, right? These dinucleotide transitions. If you got something else like AA and then TC, they would call that an error. And because of that, they said they were more accurate. That's somewhat debatable. That was the idea behind it. So then came along the third generation sequencers and that's kind of the error now. The current, I guess the, there's the next generation and there's the next next or the third generation or the G3. But certainly the Illumina and the solid platforms are still cranking along. But most platforms in the horizon are single molecule. That's where they're trying to get to single molecule sequencing platforms. And the potential benefit of that is ease of sample prep. So you don't need to do any amplification. Much less sequence bias, as we're seeing in some of the data. Potentially longer reads, we'll talk about that. Possibly higher throughput, although initially they don't have that. And low cost per base. But they do have higher error rates and we'll talk about that in a little bit. Along with the next generation or the second generation that came out, we're going to check my visual aids here. I'll hand out, this is actually not off the version I showed you, but this is off a solid. You can see it's off the new solids. I'll show you a picture of one of those in a minute. It's the same sort of thing. It's just like a microscope slide with channels in it, but it's a much bigger format. But around the same time helicose biosciences, they're still in operation, but not really a viable company, a sequencing company at this point in time, came out with really what was sort of ahead of its time. I guess this was almost a third generation. It was a single molecule sequencer. It was quite big. We had one upstairs for a while. It weighs about a ton. On that, they would attach a single molecule and then they'd be able to detect that. They had about 50 lanes, about one or two million reads each, and they had about 25 bases. They had a 5% error rate, which was actually quite good for a single molecule sequencer, but was a much higher error rate than the other platforms at the time. It had trouble catching on, partly because the read length was quite short. When the selects the first came out, it was about 25 bases. Right now, you can do 150, but typically people do about 100. And so it took off very quickly, and also the number reads the other platforms that stripped it very quickly. But it was supposedly good for RNA sequencing. It had less bias, but I think just didn't produce enough different data for it to catch on. It's still around, though. The one that came out, most recently, the PacBio real-time sequencer. This is a very interesting instrument. We have one upstairs, if you want to come see it. This is a substrate here with very small wells in it. These are 21 Zeptiliter wells. I think there are 7 Zeptiliter, I think, with a glass surface in the bottom. The way it was described to me once is that if you think of your microwave and that little gridded panel in the front of it, so you can see your food cooking without cooking your face, and the way that works is the microwaves are too large to actually exit out those holes. It's the same idea here that these are so small that the laser light doesn't exit the hole, but it actually does light up the bottom area here. So if you attach a polymerase at the bottom, you can interrogate that. The way that works with the single polymerase in the bottom of the well is as your DNA template is coming through, all of these nucleotides, all four there at the present same time, and they're going around and they're coming in and out of this little area that's being interrogated, and that's this chattering background on the lower trace here. When the polymerase actually starts to incorporate one, it takes milliseconds for it to actually grab the nucleotide, incorporate it, and cleave off the fluorescent molecule which is on the phosphates. And during that time, then, you get a signal. This is an idealized trace here, so you get a signal that comes up, and then when it clips it off, the fluorescent part on the phosphates, and it drifts away, you go back to baseline, and then the next base is incorporated. So that's supposedly how it works, and it works quite well, which you'll see. I'll show you some data in a minute. The other thing you can do, it's interesting about this platform. It's not a supported application, but if you've read their publications, you can put other things in here, like reverse transcriptase they've done. You can put in here polymerase, or ribosome, and you can actually watch proteins being made. So it's actually an interesting research instrument as well as a sequencer. The advantage of it is that this is what you end up with this native DNA that's being made, and there's really, you know, we can say right now it's about two kb reads, as you'll see. You can probably do five kb reads readily in the near future. There's really no reason you couldn't do 100 kb reads eventually. On the length? So because of the single molecule, as each one's incorporated, the signal's the same really. It stays the same throughout. So there's no decay in signal, unlike the other platforms, which are because you do that you're looking at a cluster or a bead that's coded in DNA, any time that one of them is not extended properly, then you start getting out of sync. And as those things get out of sync, then because it's a population of molecules, then you start getting more and more noise in the background. Here you're looking at single events. I guess I can talk now about some of the error rates. One of the errors that happens is that these are chemically made, these nucleotides with the floors on them. And you can't make something that's 100%. So if you make one of these, and it doesn't have a floor on it to incorporate it, you won't see any signal at all. Right? And so it looks like a deletion. And I'll get into how we get around that in a minute. But so the signal stays the same. What happens is the laser as the inventor of the technology told me is that the polymerase catches fire eventually and just dies. And so they think they know why and they're trying to work on that so that it won't. But if the polymerase didn't burn out by the laser, it would just go as long as the polymerase normally does, which is probably in the range of 100 kV for this polymerase. So the signal would remain constant. One of the most recent ones is the ion torrent. And I'll talk about this is actually a fairly interesting instrument. It was sort of, it's called the Personal Personal Genome Machine PGM. And it's was sold as sort of a relatively low cost. So it's about roughly about 100,000 by the time you buy everything you need to run it compared to the other instruments, which are more like 650,000. These are the specs that this is just off their website. This 314 chip is currently available. 316 I think is just becoming available. And the 318 is not yet available. But these are the specs that they give out. On all the slides you'll find the website of these places if you want to go and look up more information. But it's quite interesting and how it works. So this is, I'll start passing one around. This is one of the chips from it. It's a really, really clever idea. It's essentially the machine is a giant pH meter. It's a silicon wafer as you'll see coming around. That is the sequencing surface right there. It has an array of essentially pH meters in little tiny wells here. And what happens in each one of those wells is as nucleotides are incorporated hydrogen ions are released and they draw this as sort of a single molecule. This is not a single molecule sequencer. This obviously is the signal of a single hydrogen ion would not be very strong. So this involves an emulsion PCR step. So you have a bead that's coated in DNA. And as the nucleotides are incorporated hydrogen is released and you get a signal. And it's also an instrument where if there's multiple bases incorporated you'll get more twice the signal. So it's interesting, really interesting concept. It's completely native nucleotides that are being incorporated. So there's no modification here. So you don't have to worry about floors, etc. So it remains to be seen exactly where this platform will go. This is a run we did just recently just to give you an idea of the metrics that can produce. So we got about 20 megs of data produced but if these are the quality measures of Q17 and Q20 you can see it drops down. So the quality after about 50 bases declines quite rapidly. But it's getting it's improving. Right now they're about a hundred base pairs long. It's a little late bigger probably the average is the mean one, this one, the longer she was 126. I think the last run we just did was 106 base pairs. So it has potential. They are claiming that they'll get 400 bases in about a year and produce many, many. So as the chips go up as you see on that other slide the total capacity will go up very rapidly. So this one has an interesting one to watch, I think. And just on the horizon, MySeq which is sort of the mini HiSeq from Illumina. It's like a single lane HiSeq in some respects. The chemistry is very similar. The throughput is a lot less. These are the throughputs again. This is off their website of what it's supposed to produce when it arrives. They should be doing early access in the next couple of weeks. We've got one coming in a couple of weeks. And then this instrument the nice thing about it is a fast turnaround time but you're paying the price less data coming up. But there's many applications to see that that's quite good for. So what does this all mean just put it in perspective. Sometimes these numbers are hard to figure out. So I was co-director of the WashU Genome Center during the Human Genome Project that's so much younger and slimmer than me right there. And then after that I went to the Baylor College of Medicine for four years. And at the height of their capacity on just the old-fashioned capillary sequencers those two places combined to do about 10 million reads a month or about five billion bases. That's really the equivalent. Now I have to update the slide all the time. I think it's fairly accurate but just the number of bases produced per month is over 300 times from a single machine of what the entire output of those two centers was. So that just helps put it in perspective. And what does that mean that's meant declining costs very rapidly. In 2005 when they first came out they were around $10 million to do a genome. But you always hear about the $1,000 genome. I'm sure you see many articles on that. And what's often forgotten is the cost of the rest of it. So it's declining. This is reagents only. It's declining quite rapidly. This year we're in about the $5,000 range for a genome for reagents. But what's never put into that cost there are all these other things. They're just getting the sample itself is expensive. Prepping the sample. The equipment itself the amortization of the equipment the maintenance agreements maintenance agreement on some of these big instruments is around $75,000 a year. The personnel to run it the interpretation and informatics this is a big one. So right now it's probably more like $30,000 or $40,000 for a genome to be fully analyzed. And what you learn about today is how to do this better so we can reduce the cost there. Not just today, the whole week. The whole week, yes, yes. And what's the trend then? We've talked a little bit about 2005-2010 era. It was really an arms race largely between alumina and AV with 4054 hanging onto its niche area. And it was basically it was produced more reads, longer reads and get higher throughput at lower cost. And this year we're seeing a broader application in many ways. So there's moderate throughput. There's ones with faster rotten time. High accuracy. They're they're trying that the new solids trying to get even higher accuracy. Single molecule detection is becoming quite big. And many of the new platforms that I know that aren't commercial yet are single molecule. We're seeing a launch of several new ones this year. And there's certainly the heavy guns are still being developed. So the AB 5500 XL which we have upstairs and the high seek 2000 is going up tremendously. The applications like I said is pretty much everything you can think of has been ported over. Obviously whole genome sequencing is quite doable on these instruments now. We'll talk a bit about targeted genomic sequencing but there's structural variations SNP and indel discovery copy number whole transcriptome. We'll talk about these small RNAs so like micro RNAs epigenomics as well. All of these have been ported over to the next generation. This is a more of a historic slide. This was the first cancer genome sequence 2008. But it's useful to look at because the numbers haven't changed that much. I need my glasses to read it though. So when you sequence any genome this is whole genome sequence. If you sequence any person's genome you compare it to the reference genome. You'll find somewhere around two and a half to three million single nucleotide variants. They were after this is cancer so they want the somatic they want the tumor specific ones and so they threw away all the these would just be basic polymorphisms from in an individual. They found that half of those were in dbSNP or in the Watson inventor sequences the whole genome sequences. We'll talk a bit about why that was and this time this is remember this is 2008. Left them with about 30,000 novel single nucleotide variants. Typical for then and even now even when you have a question. Okay so the germline is what's in your DNA from your folks. So this is what was in the egg when you developed. And then the somatic variants are ones that arose and we'll talk more about that in a minute but ones that arose during tumor formation. So they're not you didn't inherit those those are changes that occurred afterwards. Where was it? So typical at the time and I think even today you see a lot of whole genome sequences out there but no one knows what to do with most of the sequence. We don't know how to interpret it. So they looked at just the genes. So they just looked at the genic regions. They found 11,000 there and you can just follow this down or how they they got rid of them and figured the synonymous ones weren't important. So they looked for the non-synonymous and then they did some validation. The pulse. Yeah. What? Synonymous is they you change the the nucleotide sequence but it doesn't change the amino acid sequence. Non-synonymous means it changes the protein sequence. So they figured these were probably unlikely to be important. That's not always true but you know when you're trying to get down to something you can you tractable you can analyze they're throwing out ones that didn't change the protein sequence figuring that that didn't matter. They got down you can see the false positive rate was incredibly high here not bad for the time. We're trying to do much better than that now but we'll talk about some of the sources of the false positive but they boiled down to after sequencing the entire genome it got down to they had eight validated SNGs and I think there were a couple deletions as well but they could boil it all down to about 10 changes that they found but they were able to identify in flit 3 and npm 1 and find some interesting things but that was the very first cancer genome that was sequenced. And around the same time structural variants we'll talk about how those are detected but Campbell it out they did two lung cancer cell lines using paradigm sequences we'll talk about and they found in 306 germline structural variants so these are just structural variants compared to the reference that are in the people in their normal DNA and they also found 103 somatic rearrangements 22 of them were inter chromosome so 22 of them were translocations within the tumor the two tumors that they sequenced and they saw a lot of copy number variation in the tumors as well at that time they get about 30 kb resolution which isn't great really you can do better with microarrays but as the amount of sequence you can generate has gone up you can probably get down to about 5 kb resolution very easily. So structural variants it's important I think to understand the difference between paired ends and mate pairs so all the libraries that I started talking about is you take the genomic DNA and you fragment and shear it up and then the paired end usually shear it to around two to 500 base pairs depending on the platform you add your adapters to make your library and then you'll sequence that so you're just sequencing both ends of the same fragment in a mate pair you actually fragment to a larger size maybe one to 20 kb if you're feeling like you're lucky and then you circularize that sometimes around adapter this is just one way of doing it sometimes not around adapter but you make a circle out of it you then shear that and then capture this part here which has the the two ends that were brought together and you sequence that so you're actually sequencing two pieces of DNA then that are whatever your average fragment size like 20 kb apart so you're getting those two in one single sequencer yeah they will be yeah depending on how you sequence them but yeah so what are you looking for in that if this is a library and these are all the fragments in a library they will follow some sort of Poisson distribution you can't that an informatics person ask me once if I could make all my fragments exactly the same size and it would make life easier and I just told them I didn't need him if I could do that because it would be very easy to detect variation right so this is it's more of a distribution we're getting better at making it very tight but it's still going to be plus or minus something and what you're interested in as these extremes out here the ones that are are very much too big and very much too small and if this is the the sequence the genome that you're sequencing on the top here and you derive some some paired ends or mate pairs like this and then you map them to the reference genome and they're too far apart so they're in this part here they're much further apart than you would expect by chance and you're looking for clusters of them a single one you're not going to worry about but if you see a number of them then it would indicate that in the genome you sequence that there was something missing so this was shorter than in the reference so there'd be a deletion and you can go through and you can see that by looking at them these are the normal concordancy you get insertions deletions inversions where they're in the the opposite orientation they should be or translocation across two different chromosomes so translocations you would think would be the easiest thing to find because you're just looking for what two reads or read that were that are paired that go to opposite chromosomes and it actually isn't that easy we thought it would be easy but there's a lot of background and the background comes so here's one for example there are three reads that this is real data there are three reads that indicate that there might be a translocation between here and here right because but if you look all over the place there's a rough there's lots of places that there's three reads or so and these are just chimera clones that were made during the library construction we're getting better at that but it's not unusual to see in the neighborhood of one percent of your fragments are actually chimeric so they're brought two pieces of DNA together that didn't belong when you did the cloning so in this case we were looking for one and there were 14 of them that actually supported this translocation that was a real translocation but it was really hard we knew that because we were seeking something we already knew about so we could spot it but you know it wasn't 14 and then all the rest were three there was some at 13 there were some there was actually some at 18 that weren't translocation so we had to work on that but that's the idea behind it but it's and there's several software packages that and you'll be talking about structural rearrangements over the course but it's not easy so another example here this is a little more recent one sticking with cancer here this is from the Wash U group and it was a cryptic fusion so this was a patient that came in with acute promyelocytic leukemia, APL 90% of these are associated with a gene fusion between PML and the retinoc retinoc acid receptor alpha and it's important that they get a rapid diagnosis of this because there's a very cheap drug all trans retinoc acid that you can add to the chemotherapy and it has a dramatic change in the outcome of the prognosis for this individual that it goes from five-year MF3s or recurrence 69% drops down to 29% so obviously if your patient can benefit from this you want to give it to them so this was a 39 year old patient who came in and had a remission and they were considering complete stem cell transplantation so this is that this is actually one of these situations where you're trying not to kill the patient with the cure they ablate the bone marrow and then replace the bone marrow but for a period of time then of course you have no defenses at all and so they sort of seal you up in a room and try not to get you infected so the site of genetics looking down the microscope with the chromosomes had a poor prognosis and they did not see this fusion so typically they would see an actual translocation between two chromosomes and that's in 90% of the cases but they were pretty consistent with the APL diagnosis they just looked at the morphology of the cells and they weren't sure what to do should they treat this as an APL or an AML with a poor prognosis so they did a whole genome sequencing of this and in doing that they sequenced everything did the analysis looked for aberrant pared ends and mate pairs to try and see what was going on looking for translocations and what they did find was a 77 kb insertion from chromosome 15 into the second intron of the retinopas receptor and this resulted in the classic PML RAR fusion took them about seven weeks to do which is relatively rapid but for a diagnostic purpose it's a little slow but it costs about $40,000 is what they say in the paper to do this analysis and that's probably I think an underestimation because that's fairly fully loaded WashU we were quite good at coming up with our exact cost so that's the reagents that's the machine that's the keeping the lights on that's the people but there's probably a lot of experts people's time involved that might bump that up a bit but they were able to give that the patient then the attra and the patient at this time the publication was in remission at 15 months this is just the region so there's the two regions in question in the normal genome and they had instead this insertion and which ended up with this fusion the classic fusion so during the course here so they did a whole genome to come up with this answer and were able to give the patient the drug and get a response but during this course maybe if you think about as you learn more about ways of analyzing this where there have been other ways to do this and we can talk about that later I'm not going to give you the answers all right so transcription analysis well we all know that DNA makes RNA and you get the exons being spliced together to make the complete form and they can be spliced together in different ways to get different isoforms right so microwave was the typical way of looking at expression and microwaves are great but they're only as good as what you put on them so you have to know what you have to put down your probes and so if this let's say Exon 3 here was one that wasn't known at the time it wouldn't be on your microwave and you would miss it SAGE or SEAR analysis of gene expression was another technique that came out but it relied on capturing the three prime ends or the five prime and would only then really all of these isoforms would look the same TACMAN assay is similar to microwaves in that it's a very good method for getting precise quantification it's fairly low throughput though compared to a microwave microwave you can at least look at the whole genome here would be very difficult but again you have to decide on what you're looking at and you have to make an assay for each one and of course sequence then can cover the whole thing and with the paired in information you can actually find reads then that connect to exons and you can get a little more information about the isoforms so microwaves is still a good method and this is the classic microwave experiment but you can just shunt this off at the RNA stage prep it up in a library and sequence it on either the alumina or any platform and what you get out of that is really a digital count so you map these back to the genome and these sequences and or to the ref seek if you'd like the reference RNA sequences and by doing so then you can just count them how many times did I see this and so because of that you get a really digital output compared to an analog signal here and the problem with microwaves is you can saturate them very easily so that you'll get a signal to a certain point and then it'll plateau off so you have a much greater dynamic range essentially unlimited as long as you just keep sequencing you'll get more data out to a point in the libraries themselves have biases in them and become saturated themselves but in a perfect library you can just go as deep as you want so the methods are very easy so you can extract your total RNA you have to deal with ribosomal RNA which is like 95% of the RNA so you'll one of the things that we try very hard to do is decrease the amount of ribosomal RNA so that 95% of our sequences aren't what we don't want there's different ways of making the libraries I won't go into them but essentially one way is to make a cDNA library shear it up and then it just becomes DNA sequencing microRNAs you can get that fraction out do a size selection and sequence that as well so the transcriptome sequencing you can get a lot more information than potentially from the microwave as well so you not only get the transcript profile but you can get differential splicing out you can get that on a microwave depending on how you set it up you can get differential little expression so you can actually look for variants within the sequence and see which ones are expressed much more difficult to do on a microwave RNA editing there's some recent papers showing that that's actually a significant effect where if you've got the genome sequence and you look at the RNA they're different a nucleotide and actually there's some recent papers show that the proteins are actually being made so it's not that with the protein which is predicted from the DNA is not necessarily what's made so that's another yet another layer of complexity the genome and I've already talked about digital versus analog and then small RNA sequencing and obviously with the capacity in the instruments now you can get tons of small RNAs from many samples into one lane so you have to do a lot of barcoding and this is where you put on a DNA sequence that you can read and indicate where the sample came from so just an example as going back again to 2008 this was actually the very first micro RNA experiment we ever did here there was a paper that came out in June 2007 they had cloned and sequenced 330,000 small RNA sequence from 250 libraries so individually cloning and sequencing huge amount of work they did 1,300 clones from each library 20 different organ systems this is a great paper 700 micro RNA were observed and they had seen 100 of them in MCF7 which is their breast cancer cell line so we are interested to use that same cell line see what we could see and this is their data here showing the relative abundance of the micro RNAs that they detected you can see up here they had 795 reads out of MCF7 our very first one we ever did we got 4.6 million reads they found 100 we found 213 of them so some of them are quite rare there were 19 that they found that we didn't find but if you look at the counts there were 12 of them that had a single count and it's also when you do different culturing you can get different things expressed so these were actually very rare and potentially not even real but you could see that the expression levels were I had to break this into several different graphs here very very increased dynamic range and obviously a huge number more that were found so this is the very first attempt we ever did we were able to map those to the genome and you can see the amount of resolution you get out so what I've plotted here and you'll be doing some of this is I plotted the start point of the micro RNA you can see that there are two peaks at this one so I turned on one of the tracks in the browser and sure enough it was a micro RNA so that was satisfying to see but it's interesting that there are two peaks here and in a micro RNA there's a major and there's also sometimes a minor one so this is the two major and the minor and another thing that's interesting you can see that these peaks don't it doesn't always start at the same place and actually that little trough in there that's where the one in the in the mRNA database, Merbase that's the one that was in there and yet it doesn't seem to be even be the predominant signal we saw more often it started the next base over or this one and you know what is the significance of these different start points is yet to really be determined epigenomics many are familiar with that histone modifications methylation will have you in this case this is histone modifications where you can you take and you cross link your DNA to your proteins you fragment that and then you isolated using antibody directed at the methylated site of interest pull it down get the DNA off and sequence it very nice genome-wide analysis and it should have been shown quite well that it correlates with the expression states again you can just plot them onto the the browser and this is just the K27 versus K4 in these cell lines under two different conditions and again you can just see that used by counting them this is just the piling up here and the y-axis is the number of times you saw it you can see exactly where these things hit so front-end technology so one of the problems we're generating so much sequence now that I think it shifted the problem away from sequencing which during the human genome we had to scale up how to prep all those clones and how to sequence them all so that was a big endeavor but now the sequencing has been shrunk down to a box it's not trivial but it's it's taken away the pressure off of the sequencing in some respects but in where it's put it is on the front-end and getting enough samples we can generate so much material or so much sequence now from a single run the the illuminous upstairs the high seeks are producing around 600 gigabases per run so every 10 days that is 600 gigabases and you can divide that across a number of samples depending on what you're sequencing and you might want right now we're for exomes which is sequencing all of the coding part of the genome we're putting three in a single lane and there's 16 lanes on a high seek so every 10 days you're doing 48 samples so now you have to get all those samples into the front-end so we're working on automation for that right now but the other end and what you're dealing with this week is the back-end we're producing so much data you'll have to analyze it right so it's putting the pressure in different places for the front-end if you're doing targeted sequencing I'm going to be talking about that in a little bit one of course is due by PCR and you can either do an entire region which we've done just by overlapping PCR products and Crystal did some of that on the retinol-blastoma gene and then or you can do it for a candidate gene you can take all of the isoforms and just develop primers for each one of the exomes and usually we do one this is the so-called regulatory region 1KB upstream and then sequence those so there's been some attempts at doing that better one is Rain Dance again we have one up there if you want to see it this is an interesting technology that came up with it's sort of in some respects using like an emulsion PCR type step but what they did was you can make all the primers that you want individually synthesize them and then put them in little droplets this is a very stable emulsion so they have a proprietary emulsion that is incredibly stable these single stays in emulsion for a year and they prep all these individually and then they can pull them all together and this is what this is your primer library that you can buy from them and then on the instrument you feed in your little droplets and each one of these droplets has a specific one of these primers in it and you also then take your genomic DNA and it brings them together in little droplets so there's oil in the channel here each one of these droplets is where the liquid is the water and they come down and right here it's a little electric pulse and it fuses them and so now you have your DNA brought together with your primers and it all doesn't spit them out into a PCR tube and then you can cycle that and it becomes an emulsion PCR so it's quite a clever idea it's a we haven't used it a lot partly because this step here of getting the primers made is relatively expensive so it's not it's not something you want to try something it's something you know you want to work with but we do have one upstairs the other thing that came out was other ways of capturing regions and in 1991 Mike Lovett who was involved with this effort as well did a it's really not really a new technique and then it was called direct selection and that way we were after cDNAs and we took Cosmets which were a very old cloning vector that probably no one has heard of but we took Cosmas that represented all of chromosome 5 for example we mapped them and then sheared those up and biotinylated them and used them to capture cDNAs and so it's not really a new idea of going in and capturing out using a labeled piece of DNA to capture another piece of DNA but now there's solid supports Nibelgen and Roche and and in solution capture methods from Agilent etc where you can design oligos just showing an example on a solid support you design an oligos specific to the region you want to capture you make one of those DNA libraries and you apply it to that and you just pull down those strands elute them and sequence them this is just an example again this is quite old but it doesn't change that much this is across the region we were trying 600 kV region that we were trying to capture the real gaps you see are where the sequence is too repetitive like here this is too repetitive to design a unique probe too so we left those out each one of these lines here represents the probe this is actually the TM of the probe being plotted here and this is the capture and you can see that there are it's not you don't get an even capture necessarily you can see it's somewhat profiles are almost exactly profiles the TM of those of those oligos but you can see even a single probe here can actually capture quite effectively capture the region it's looks bimodal here because you're sequencing them from both ends all right so you're capturing that middle part but you're sequencing from both ends and then more recently the Agilent's shear select is a this is an insolution other one was a solid support this is an insolution just showing that you can direct your your probes to this is a kit gene for example and you can get coverage of all of the axons quite nicely without covering the non axonic parts all right so you've seen there's a lot of platforms and a lot of different applications so that brings up another area that's a bit problematic and that's platform complexity we used to have a sort of a one size fits all most places were using the the AB instruments or others that used other platforms but the most of the big centers at AB platforms all the same is filled room full of them or several rooms full of them and it's random all the same but now with all the different platforms that are currently available you have to decide on what you're going to run so increasing these ones over in this side take longer to run but they produce more data and increasing run time or increasing amount of data produced does cost they say sequencing is getting cheaper the per base cost is getting cheaper but as it takes longer to run there's different cost models increases you know the armortization of the equipment if you only can run the machine once a year then the cost the entire cost that instrument for that year is also born on the cost of sequencing whereas some of these guys can run very quickly the pack bio for example we'll talk about but it can run a sample but every two hours so can this one the PGM this is about five or six hour run so they're producing less data but they're doing it much faster so you have to decide on what you want to do you have to really ask the research question and then apply it to the appropriate platform another problem it causes this is a very expensive equipment so if you know it's unlikely you're going to have all of these in your shop we're fortunate we do have quite a few of these but that causes problems in that each one of these as you've seen has slightly different library preps right so you have to you can't just make you know can't build a pipeline that's making just one library goes on all your platforms and that's where the a lot of the big centers have gone and they've got a mix of these two platforms here but they take the same library so they just crank them through so there's advantages and disadvantages to having this platform complexity yeah no I still have their place but I think that I think that they are becoming the microarrays are becoming displaced but they're not displaced yet and a lot of it has to do with what you what question you and people always ask me you know what what platform should I use what sequencer should I get or what this and that and it totally depends on the experiment you have to really do the experimental design and then think about the appropriate platform if you know if you're looking at expression and you know the set of genes you want to look at and they're well characterized then you can crank through a lot of samples on a microarray very readily if you want to learn more about differential splicing things like that then you're going to want to use the sequencing approach so it depends what you want to do so along with the all the platforms etc comes data complexity so not only mixing and matching microarray data with sequencing data but even on the sequencing platforms there's tons of data being produced so huge amounts of data we're restressing compute storage and then the complex analyses even just from RNAseq there's all these different things you can be looking for fusion proteins etc and of course validation we'll talk a bit of validation or really should say verification here if you want to you've seen something you want to make sure that it's real you're finding so much that now there's thousands of assays that need to be done gone are the days where they're looking at one gene this is what we have upstairs you walk and have a look we don't have this one yet it should be coming next week that's the my seek we still have actually three of this kind but they're about to be converted to these kinds that's the high seek so higher throughput these are just being installed right now have done much with them and this is the pack bio which I'll talk a little bit in a minute our compute resources are listed here it seems like a lot but it's nowhere near what you need trust me we're always even two and a half petabytes of storage and we're always running out of space all right so on the cancer genomes whether sequencing or however you want to do your analysis so cancer is a disease of the genome as we all know started with a normal genome and normal germline the mutation free essentially and then what you see in in in cancer is random DNA changes that occur these systematic mutations this may be in conjunction with some sort of cellular defect that's in DNA repair for example so you're getting more and more or just just wear and tear on the on the genome as you age but what happens is you can have several different things that when you get a mutation which actually is very deleterious to the cell swings it into apoptosis and the cell dies you can have a mutation or change that we call a passive mutation it really doesn't give any confer any change to the cell whatsoever it might even be in a gene that that cell is differentiated and doesn't use or you can have a driving mutation and the driving mutations are ones that give it some sort of selective advantage so the cell will grow abnormally in its environment and as it grows it can accumulate more passengers it can get new drivers and start growing more aggressively and eventually the genome can have lots of changes in it so lots of cocky number variation here's a very gross example of the cocky number there's extra chromosomes there's extra pieces of chromosomes or you can have more just the point mutations are small indels but the point is that when you end up with a tumor you end up with a very heterogeneous population of cells with all sorts of different mutations so a typical cancer genome project would be to get some tumor get some blood for a normal DNA or some adjacent tissue if you like you can sequence it or you can use some of the other techniques we just talked about and really trying to identify the genome variants that are associated with the tumor that are so they're somatic they're in the tumor but they're not in their germline DNA and then on a larger population you analyze the same things and you're looking then for things that are statistically sustainable and doing pathway analysis you'll be doing later in the week to find out pathways that seem to be hit more frequently in this population and those are the ones that you're going to go after so an example of that is the International Cancer Genome Consortium which started here in 2007 a lot of things happened in 2007-2008 we had a meeting that was held right here just one floor below us it had 22 countries 120 participants there were a number of genome centers represented and a lot of funders which is very important as well to come together and the idea was to explore the possibility of doing a more of a coordinated effort global effort to study cancer it was obviously a very large problem so the rationale for it it was that it's huge obviously it's it's bigger than the human genome project right so this is best to spread it across countries there's a lot of duplication of effort going on in the world and if we could minimize some of that I think we'd move things along the one that excited me the most really was the standardization and uniform quality measures so by doing this it would make merging of data sets much easier and give us increasing power so that that was a there's many studies where you download the data and it's in a different format or in a different different you don't know what percentage of them are real for example what is the validation rate the verification rate there's a lot of different cancers across the world so some are more regional and we just wanted it to accelerate the cooperation amongst you know sharing method methodologies as well so the basis behind this was to do 50 tumor types or subtypes and to the idea was to do 500 tumors and 500 controls per subtype and this is the the basic buy-in was that you would create a catalog of the genomic variants and those and this is really like doing 50,000 human genome projects so it's a lot of work that we were proposing to do there's a paper that came out about it in 2010 just to sort of announcing it saying what it was all about but that was the basic premise behind it these numbers come from just a calculation that if you want to detect events that are about in about 3% so if you're looking at things in the heterogeneity that are on the level of 3% of the samples you should be able to detect it with a 95% confidence this is slightly outdated now but pretty close this is the number of projects and the countries that are involved now in the ICGC this is us here we have a prostate and a pancreatic sample there's also another one meduobustome I think that started in BC but and that this is the efforts that were the TCGA the cancer genome outlets in the U.S. is part of the ICGC now I do want to point out that we're doing pancreatic and I'll talk about this more but Australia also set up a a pancreatic project and so we're collaborating very closely with them through this ICGC there's a website you know you're real when you have a website Francis had a lot of input into developing this website I got I updated my slide just for you Francis since it's the new version and on here you can on here you can go to each one of these places and you can see what the project is find out all about who's doing it etc and I'm not I'm sure this will be talked about by other people so I'm just going to cover this very briefly but there is a website you can go to a data portal which is actually becoming quite good and the idea is that you can coordinate the data around the world so you'll come to this one portal which is actually housed here and it'll you you will think that you're you're searching just locally but you're actually searching the data across the world and because of that uniform standards then the data are meshed together quite well all right to pancreatic cancer we're going to take up like a two-minute break or do you want me to keep going people are leaving if they have to go throw a hand so people want to take problem is when people leave then you take five minutes and they start to get them back it's like hurting cats yeah but if I know people have drunk a lot of coffee may want to go to the bathroom if you want to take a two-minute bathroom break or people can just walk in and out I don't mind that because okay we'll just keep going then all right so pancreatic cancer I'm going to talk about pancreatic argument pancreatic cancer project for a little bit here it is a five-year survival rate of two percent which is about the worst for any cancer is the worst for any cancer it's only about two percent of new cases but because of this the survival rate is so poor it actually accounts for six percent of cancer cancer deaths so it's the fourth leading cause of cancer death in either males or females in the fifth bloke or all it's very difficult to detect it's highly metastatic and it doesn't respond well to treatment and you can see here the new cases and deaths are pretty much equal so why is that? so one is screening there's no there's no early detection for pancreatic cancer most patients are diagnosed with advanced disease they come in they just don't feel well and take some all to figure out but six percent of patients that come through the door it's already metastatic and they'll live three to six months locally advanced so it's in the pancreas and started to invade the other tissue around it a 25 percent eight to 12 and 15 percent have receptable disease so you can actually surgically remove the pancreas and they have a mean survival of 15 to 20 months so overall two percent for five years and it's not uncommon where someone come in there's no of a case not long ago that the person had a seizure that was the first indication they had pancreatic cancer and when they came in they had metastatic lesions in their brain already and they were dead within three weeks so it's a very very aggressive disease so I talked about the front end and the problems of that so our goal as a sequence as an ICGC project was to sequence 500 tumors and 500 controls we've reduced that to 350 because Australia is also doing 350 so between us we'll do 700 but you have to come up with all those samples and there's lots of issues with that one is that in pancreatic cancer as I said only 15 percent are resected so we're just looking at that the you know reducing the number it's not the most common cancer plus only 15 percent we're getting the tumors out of at this point so that's one issue the other issue is that and we'll talk about more about this but data privacy but we're generating so much data on an individual and we want to make it available to the world that you have to take into consideration privacy concerns so it requires a very specific informed consent the patient has to sign and because of that we had to start collecting real-time in 2008 so we couldn't use any of the bank samples that people had and most of those are FFP or Formal and Fixed Perforant Embedded Samples which isn't really ideally what we want we want we'd like fresh frozen if we can get them so we had to reach out to other other places so we have some collaborators in Boston we have local collaborators and also in Rochester, Minnesota to help us collect samples these three sites as well are creating xenografts and this is where you're planting the tumor into a mouse and I'll talk more about that so these three we were very interested in collecting those as well so these three centers are helping with that but this is part of the the upfront infrastructure process in the sample just getting the samples there's an outline of the project we get the samples BioBankum we've got Germline DNA we want to do all the things we've been talking about and you'll be analyzing in the next few days we're a sequence-based platform largely although we do copy number we still do an array-based copy number well we do a genotyping array so we do a million SNPs from that you can get copy number we also use that to guide us in the sequencing to make sure that that one we know we can finish sequencing we're able to detect the majority of those and then there's a lot of validation to go on so what are the issues with primary tumor samples well very seldom do you get a chunk of tumor that's 100% tumor it'll have stromal contamination or normal tissue invading it in pancreatic it's 20 to 80% of it is tumor and it tends to be more at this end than that end unfortunately we talked about heterogeneity so just in the formation of the tumor you have all those passenger mutations you have a population of cells not really one population of cells so you have to do a fairly deep analysis and that's why sequencing is quite good for these analysis is you can go quite deep into the sequence and find rare events so that is an issue related to sensitivity this is a pancreatic tumor if you were to take a section of this you can see that there's quite a bit of stroma or normal tissue here quotations normal quite often has passenger mutations in it but if you take the bulk material here because it's a mix and you're looking at a mutation if any regions that are that have you've had a copy number changed to three end your signal down goes to 33% if you're looking even something that's essentially a diploid genome but only 20% of its tumor clearly then only see your signals that you're looking for only 10% so only 10% of the sequence reads will actually contain that you can see that in a microwave base then your normal is going to swamp out your your signal that you're looking for so this is the sensitivity issue the other is specificity rough it depends on the tumor type but there are roughly one somatic mutation per megabase and the genome if we wanted we wanted the objective of the ICGC is to have a 95% verification rate so of the of the ones we put in the database we'd like if you were to take them 100 of them 95 of them would turn out to be real or better to achieve that we really need a very low error rate per megabase it's difficult to achieve but fortunately most of the regions of the genome behave themselves and most of the problems we see are occurring in specific regions we're starting to recognize this and there's two main sources of that error one is a sequence error itself just in the in the platforms although it's it's not entirely random but you hope that you get enough correct reads that that would swamp that out but occasionally the sequence the errors can accumulate enough that you might think that it's actually real especially if it's in the tumor but not the germline but another problem that we've seen quite a bit is that you get a correct base call in the tumor but you miss the call in the in the germline so you're you're considering it a somatic event but it's actually just a germline event and this is frequently due to insufficient coverage the sort of the rule of thumb's been that to sequence the normal genome to 35 fold coverage and the tumor to about 50 fold coverage and I think those numbers are a little low now that you probably even just the normal need to go to 50x but it's been a major problem because of the high rate of SNPs so there's so many germline SNPs and these and many of them as we said before are rare and to that individual so you don't even see them in the in the databases of SNPs so you think it's a novel new mutation but you missed it is just a novel rare germline mutation so if you're interested in cancer you have to deal with those we're getting better at that so there's ways of dealing with the tumors one is enriched enrichment through coring so this is a pancreatic sample here in OCT medium and it's just frozen and you take a slice and stain it and then with that stain slice you can align it and pick up basically a biopsy punch and punch out where you think the most tumor is and that works reasonably well it's fairly labor-intensive as you can imagine the other thing that we did in this project is we're like I said we're generating xenographs so a piece of the primary tumor before it's frozen or anything which is a fresh piece is implanted into mice and not every time but frequently especially with pancreatic cancer it grows up the tumor in the mouse that can be propagated in the mice you can also take that and make a cell line which turns out to be rather difficult but we are getting a few cell lines so so why use xenographs one as I said is the low tumor cellularity or the low tumor content in some of the samples hopefully then that tumor grows up in the mouse and it reaches for it and like I said in pancreatic we're lucky if we have 50% or so it's probably an average for tumor content and then you can also get more material then you can propagate this mouse you get more analyses but also these mouse the xenograph mice are good models for drug development so we have an OICR medicinal chemistry group and a selective therapies group who are very interested in these xenographs and the cell lines as models for testing with various drugs and it's very powerful when we have the complete characterization of the genome of those same models then we can know what drugs and pathways to hit just an example here again pancreatic primary tumor you can see that's the same one I showed you before but you can see that there's a fairly low tumor content lots of stroma and the xenograph itself then has much more tumor content still has a fair amount of stroma as you'll see so the first five we sequenced we went quite deep in them so as you can see a 30x coverage of the genomes around 90 gigabases so we went quite deep on some of these early ones and you can see that the amount when we align it to the human genome reference was variable and quite low normally we'd get around 85 to 90% of the sequence would align to the genome so here we got a lot less well the obvious answer for that is it's if you look at the amount of it by qPCR we can make a few low side then you can also estimate by sequence but you can come up with the amount of human DNA in there and the amount of mouse DNA that's in this sample and you can see it sort of reflects the amount that we get aligned reflects the human content so this one's quite low so is this one as far as the amount of human material that's in there so you can ask you know have we really traded one heterogeneity problem for another we had human stromal contamination before now we have mouse stromal contamination it's a particularly problem with the pancreatic samples they just like to grow interdigitated so what other problems can that have here here's an example of a somatic variant seen in one of our xenografts I'm blowing it up here so it's quite there's a lot of depth of coverage here lots of reads you can see that there's a T in the reference sequence and then occasionally we see a G it's quite clean you can see these are the errors here but these are it's quite clean sequenced through here and you want to call this a T T to G variant in the tumor if you were to take that sequence around that the 100 base pairs that surround it and blad it to both human and mouse so align it to human and mouse the only difference in that 100 bases is that T to G so those reads that we're looking like a SNF we're actually the mouse reads contaminating the mouse DNA and we call these interspecies SNPs so obviously you want to be able to weed those out another way of doing enrichment is through antibody enrichment so you can take the xenograft and you can dissociate it down to single cell hopefully use antibodies that are specific either to the mouse which is we'll pull out the mouse part or you can use them positive to some markers on your tumors and pull out the tumor part and that that works reasonably well you can see here this is actually a fairly good set of xenografts and that they started out this is the non-enriched the light blue they started out this one's rough almost 60% of it is human and some of these others were actually pretty high to start but you can see that you came to rich this is post enrichment the amount of human DNA that's there and this one got actually almost 100% so that's going to help you so of course we had to get rid of those mouse variants so we had to stop and sequence mice didn't want to sequence more mice but we had to so we sequenced the two two mice that that were the xenografts were made from align them to this is the line to the mouse genome to get the number of variants to the mouse reference but the real question is how much aligns the human for us so these two samples were sequenced and you can see roughly around you know 0.6 to 1% of the mouse reads aligned to the the human genome most of them are not a fair number of them a disproportionate number of them aligned to the X so the X the coding part of the genome which is only about one one-and-a-half percent and you can see a significant percentage of those align if you were to align it to the human genome just run it through your pipeline and call snips this is the number you'd find so you can see that we got quite a few snips being called and again a disproportionate amount in the X zone which is the part we're most interested and if you this is a Circlos plot which I'll be hearing more about but these are the chromosomes around the edge here these red lines represent all of the snips that we detected if you just run it through the pipeline and this inner circle is after you remove the ones that we know to be mouse that's what's left so these are things you have to deal with especially redoing X graphs but they are useful so this is some of the pancreatic samples we have these are the mutations that we found these are frequently K-RAS mutations almost almost a hundred percent of them but you can see that here's the primary tumor we sequenced it and we did not see the K-RAS mutation it was in if you went in a manually looked at it really closely you'd see that we only had 38 reads covering that in this particular sample only two of them were showed the K-RAS mutation and they were sort of a low quality but in these universe we were readily able to detect that another one here is three out of 227 reads that detected it so this was actually a very poor cellularity one out of 29s you could see we did get a nice enrichment here and in the cell lines we got an enrichment as well so this is the mutational landscape we're still validating this is very new data but if you looked at the samples we did we did 25 samples that went this analog 26 samples when this analysis was done and this is the K-RAS which is almost all of them you get suspicious actually when you don't see it you go in and look really hard but then these other genes here there were about 300 genes that were in four or more of the samples right so the not the same mutation that study but the gene itself being mutated so that was that seems to be significant but there's a 2200 more that were in one to three of the samples and this is pretty typical for all of the tumor projects that are going on the recently published was the ovarian project where this instead of being K-RAS was P53 but again a very long tail like this and the question now is trying to figure out the significance of this long tail whether these are how many are important whether there's different subtypes I'm sure there's subtypes of pancreatic cancer we just don't appreciate yet we'll probably be able to start subtyping those we haven't do a pathway analysis and see how many of these might might fold into a pathway and you'll be doing that over this course fold these into pathways and make a little more sense of it but this is just the long tail that we see in all these projects right now I'm talking about verification so it's very important that we verify these this is just one quick snapshot we did this on the Pac-Bio so I'll talk more about the Pac-Bio in a minute but if you do it just by saying or see because I'll show some examples again these are the primary tumors that we're doing these validation on we just selected a few randomly and they're made primary specific to that location and sequence them to see what they look like on the see if we can identify it in the primary tumor this is once we saw the xenograph first and you see it's 67 percent here 83 percent that was pretty good here but on the Pac-Bio where we can sequence deeper and see it more likely we can see that the numbers go up and just to show you some examples here these are really hard to see because it's it's really hard to see this is the the reference sorry this is this is let me get this right here this is the primary tumor here this is the reference and you can see there's a little tiny blip under there that's actually the tumor portion in the primary and you can see that there's fair this is a fairly clean trace but there's another little blip here which is just noise so it's really hard to distinguish that from the background this is another one this is let's see you get this right here this is a cell line derived from the the the xenograph and you can see it very clearly you can see both bases here this is the primary tumor and again you can hardly see it in the primary so the real question that's that we're trying to identify in and other people as well is do the xenographs how well do they represent the primary they represent it fairly well but how you know they're not exact copies of the of the primary tumor this came out in Nature Paper here again from Wash U where they sequenced a this one was breast I believe breast cancer that was a metastatic tumor the primary tumor and a xenograph derived so we have the primary tumor here the metastatic tumor here and the xenograph and one difference you can see right away is this translocation here is absent in the xenograph so whatever reason that cell was not selected in growing up in the in the xenographed itself but was in the metastatic tumor what they found and this is a data point of one so but they did it did seem to them that the the copy number changes the new mutations and overall that the the xenograph seemed to look a little more like the metastatic lesion which may not be that surprising since the the xenographed is sort of a model of a metastatic lesion you've taken a piece of the tumor and you're growing it up in a new site all right so clinical applications excuse me may I ask you with a question about the international transition yeah transition that what's the standard of the sequencing for these old projects is there a standard of which technology no no it's different yeah whatever technology people want to use so what platform is up to the up to the group what the the standard that we're trying to achieve is that 95 percent of the things we put in the database will actually validate and be real and the extract goes from I think the ending of GMI evening from the old cases and control yes and what type of variations are you going to be talking about that more depth but the DCC or what about what's what's available in it and stuff and how to search it and stuff so we could defer that but it's basically the the basic premise of the ICGC is just a catalog sequence variants so single based deletions insertions or small insertions deletions and SNPs themselves that's the minimal requirement most people are including that structural variants copy number variants there'll be transcriptomes on some groups are doing transcriptomes and all the data we're putting in and I don't know if France can talk about enough but there's two levels of access the somatic variants are available freely to anyone who wants them if you want the germline data because privacy concerns you have to actually apply through the DCC and get a password okay so I can talk about clinical or personalized medicine for a little bit here there's sort of somewhat of a perfect storm I guess coming has come together in that you know somatic gene mutation copy number variations are really the best predictive biomarkers for cancer treatment the rapid advances in next-gen sequencing and other technologies and there's huge numbers of drugs out there that are available to target these various lesions so these things are coming together to make to the point now that I think personalized medicine has really become a reality when I first came here in 2007 started talking to the clinicians about doing this this is the main reason I came here and I felt the technology is ready maybe it wasn't quite in 2007 but it was clearly on the horizon that was their first question is it really ready for clinical application can you use FFB so the formula fixed pair from bedded is the standard for diagnosis in the in pathology labs this is where they the sample is taken and dropped into to formalin it fixes the DNA so it destroys the DNA in many ways which is bad for me they then bed it in a waxy substance for get nice those images you saw are are little thin slices off of those blocks which are then stained so it works well for them we can do we can work with it but we'd rather have the the fresh to work but I think we're not going to change the clinical practice so we're we have to learn to work with FFB the other thing is turnaround if they were if they want to use this information they need it fast they need it within a few weeks and then I won't go through all these questions but there's the bioinformatics how we're going to handle that I'll talk about incidental findings in a minute and if you got a you want more detail any of these things just come and ask me so the clinical challenges for up for me was to these high throughput machines was to adapt them to an individual diagnostic part of it is too much data generated right now it's on the current platforms that we were using and I said there's some new platforms coming produce less data but like the Illumina there's so much data being produced you either have to run a single sample on one lane which is costly but if you want to pool samples then you have to wait to get enough samples to run and that doesn't fit into that time course there are the new ones that are coming along again too much time and the other important thing is that you have to make sure that the data coming out of the platforms actually has some sort of validation that you know that the error rates of it all before they'll accept it and there's two there's sort of different mutations that we can find there's the actual mutations these are genes gene changes that have potential impact on treatment recommendations so there may be a drug that actually would target that gene or pathway and for prognostic value the druggable of course that's where you actually have something you can actually hit it with and then just disease associated ones that they correlate them and we'll talk about more about that so some of the actual mutations in cancer this is just a a short list of genes here that are are common ones these are common mutations that recur in many many different cancers and then the most common cancers that you see these in here for example BRAF mutations in skin cancer in melanoma and but the what becoming more realized is that these are not these mutations are not specific to any cancer of course so BRAF mutations are very prevalent in melanoma but they're also you see these same seem BRAF mutations in non-small cell lung cancer colorectal cancer over your breast and so the question is you know can you can these people benefit from BRAF inhibitor so as I said that the workflow is changing so this is this is where we've been since 2007 really in that we've got these massive machines that can do do whole genome or transcriptome or what have you and it's you know typically a couple months worth of analyses this is far too long but for for this side here now we're getting down where we can do this in the sequencing certainly less than a week and I'll talk about this whole pipeline in a minute but this is really becoming a clinical reality on this side so one of the questions we first said these are are real biopsies here you know can these biopsies be utilized so this is a this is a the biopsy on a slide the pathologist is circled where the where the tumor is and then this is a a macro dissection so this is just completely scraped off and DNA is extracted and in most cases we can get enough DNA to do our analysis so I'm I'm going to talk about the Pac-Bio a little bit now I think I have one left now this is the this is the what's called a smart cell that's what the sequencing is done on the Pac-Bio I talked a little bit about the how it works on that chip that's coming around that's the single unit for sequencing there's I forget how many I think there's 300,000 of those little tiny wells that's on there and if you look on the bottom side of it you see a little pattern that's the etching on the in the the whole thing and it's we've gone through how it works but each one of those goes into this machine you can load up to 96 at a time at least eventually software won't support that right now and it'll process them through one at a time and do the sequencing so it has a very fairly simple sample preparation to do so that's nice it right now it's producing reads roughly around 2KV and probably going to go up in the short in the short term about 85% accuracy so 15% of the bases are wrong that sounds bad but in a long read that's that's okay but in what we're trying to do of course we want better than that but I'll show you how we do that runtime is quite short once it's in full tilt it can process a sample right now actually that's gone up a little bit about an hour and why I'm going to talk about a circular consensus sequencing which so just to give you an idea of the you know what it's like getting a new instrument so we got one of the first ones all right we had actually the machine number two that was shipped out from Pac-Bio a little over a year ago now and this just shows you since December of last year the throughput so this is a number of mapped reads number of bases that are mapped and you can see that we came through we were getting around this is the number we were getting and we're getting around 20 megs of data they came out with a new enzyme so all these companies are always coming up with improvements then we had the instrument upgrade and you can see actually things went down to please with that but that was that the upgrades didn't go as quite as anticipated so it worked our way through that and then we've been doing a lot of work on titrating a lot of the processes and you can see that it's been going up steadily and this just ends in June and it's sort of plateauing out but we've had runs up in the 140 megabase range now read lengths has been pretty consistent so with the upgrade we did get increased read length but we got less yield was the problem here but we were able to keep the read length the same and increase the yield these are running metrics from sort of 10 runs from about about a month ago now but just oops just to zoom in on this little part here number of post-delta reads so these are the good reads that begin about 76,000 we get a few more than that now we get 63,000 of those that are actually mapped to our targets the read lengths is a little under 2kb number of maps so this is 110 megs of data and I've said we've seen we typically get that to 120 and we've seen some of the 140 megs we're getting quite a bit data out of them and the mean mapped accuracy is staying the same about 85, 86 percent so circular consensus sequencing so 85 percent accuracy doesn't sound very good in a long read that's not a big problem again this is one of those situations where we have to reinvent our tools a little bit to work with those types of data but that's coming but what we're doing with it right now is a lot of circular consensus work and we're doing PCR products so we're amplifying specifically genes of interest to us and we're amplifying the exons and then we're prepping them in the library and we put on these hairpin adapters and it makes then a closed circular DNA single-stranded DNA circle put a primer on that and a polymerase that goes into that well and gets read and what it does is it reads it starts incorporating the nucleotides making a strand of DNA and it goes around this circle just keeps going around and around and around and makes one long transcript that comes out this is the sequence that's coming out then as you read it you can see it goes around it goes through the adapter goes through this the reverse strand the adapter goes through the forward strand so you get multiple reads of that same template and then we can use the informatics to clip out these subreads put them together then come up with a consensus because the errors that occur are largely random then by reading that same template four or five times you can get a consensus sequence that's quite accurate sorry? yeah so yeah yeah and most our inserts are around 300 base pairs and we are we're trying to figure out where the cutoff can be we use three frequently but I seem to like five a little better if it's gone around five times and as the read lengths increase obviously we can increase our our amplicon size and also go around more times most times the record I saw was it went around 32 times which is a waste of time but in a typical run this is again from about a month ago but 64 percent of the templates had three or more of the subreads and you can see that some of them out here have gone around many many times and as that consensus goes up then you can just tell by the map ability that the accuracy is going up here so we're getting an increasing amount that's able to of the reads that are able to map the genome and to put it sort of a visual basis here this is what the data looked like with a consensus of one or more subreads so this is all the data but including everything that's maybe even just went went through once and you can see that there are some gaps in here these are deletions these are errors these little purple things are insertions if you make it a little more stringent let's just say things that you saw three or more you can see it starts cleaning up in five or more so even a little cleaner this the major error that remains are these insertions but overall you can see the random here's a couple that have been reproduced but not at the level of the true variant here so we had to validate that against the known platform so in the diagnostic lab we're not a CLIA lab which is a certification if you're going to be doing diagnostics you have to have that so we're a research lab so we want to compare ourselves to the CLIA lab so they use the process called sequinone which is a platform that it's a genotyping platform it's looking at specific variants in these genes it uses mass spec as the readout the details but it's very it's quite accurate and they validated it and know that it works quite well so we compared on 30 samples that we got from them and these were DNA extractors from those slides as I showed you and we compared to see what we would see and you can see that we where they saw a mutation we saw a mutation we missed one it's this one here we didn't get that one and that was actually a problem with this Amplicon that we have since fixed so we just missed it because of that but we were able to detect mouse we were pretty happy with that you notice that the frequencies are a little different sometimes radically different this is variant read the number of percentage of reads that showed the variant here we saw an 8% of the reads by sequinone was 24% here's one that's fairly accurate this one's a little lower this one's about the same so we're not sure why that is whether the sequinome's off a little bit or whether we're off a little bit might just this is probably just read depth as well this is from a little while ago we were generating less data so we've actually launched on a program a feasibility study to see if we can do diagnostic sequencing eligible patients come in these are patients these are advanced recurrent or metastatic disease so these are patients who have had cancer been treated by the standard of care you're not going to change standard of care very readily but they've now come back and they're either their cancer didn't respond to the standard of care or now they have metastatic disease and there are a lot of clinical trials going on across the street at Princess Market Hospital and so they are potential candidates for these clinical trials they often be able to have a biopsy done so they have to be in good enough health or have the accessibility of the tumor for a biopsy and they of course have to have informed consent we our first patient came in in March 21st the idea is then to get a fresh biopsy we'll do the sequencing the CAPCLIA lab and the diagnostic lab we'll do the sequinome and then we'll compare the results this is the flow so this whole thing was our goal was to do this in three weeks the patient is consented they have to do a biopsy that can take up to about five days to get that booked men it's not half of them are need a radiologist to come in it's an image guided biopsy and half are just done at the bedside they also collect blood those go to the pathology lab for a quick diagnosis mark up the tumor parts goes to the CLIA lab to be the DNA to be extracted and then some of that DNA sent to us and we sequence it and do our informatics on it and compare our results if we find something that's not one of those common mutations then the CLIA lab will validate it by Sanger in order to you include in the report or it has to be communicated just a research result we generate a report goes back to the clinician just some examples this was patient one 50 year old female mucoepidermoid carcinoma of the lung won't go through all this but an image was done the issue here was that we didn't get enough DNA enough DNA for the sequinome analysis it needed around I forgot the amount of data but it's like 100 nanograms something and there wasn't enough DNA for that we actually did a whole gene amplification and sequenced on it and didn't find any mutations anyways but so that was the first one another one here called rectal cancer this one was a person who had gone through their original treatment here the mutation report there was a PIC3CA mutation we saw it on both of our platforms there's no KRS mutation originally it had a KRS mutation neither was sought so that it was a low level seen the first time and this one was considered actionable as because of the PIC3 mutation and that was done in the in the amount of time we wanted and I'll not just the last one here 50 year old male colorectal cancer of KRS wild type we saw the mutation the KRS mutation there was a germline METS SNP so this was a nonsense our synonymous change in the MET gene that we saw because we're sequencing the entire gene they're just looking at specific ones the significance of this we have no idea so this is in their germline DNA so they were born with this variant and we had no idea what that would be but it was actionable because of the KRS mutation just a summary of the first eight or nine and you can see about half of them we actually end up with some kind of actionable mutation and we meet our benchmarks this is we've actually opened up patient 25 now but on the first nine the numbers more or less holding the same is that most cases we were there we especially our timing in the to get it under that three weeks we're doing pretty well on that so the future challenges here this gets back to that front end as well so as we increase the number of genes we're going to look at 19 genes here at the moment but we have a list of 200 that we want to look at and we're just expanding that up to do PCR on 19 genes isn't that bad but as we expand it up to 200 genes we'll start looking more at those capture technologies and we're doing that right now and it's unlikely that one technology is going to do it completely here we use capture technologies for the research for the exome sequencing but we we have some parts that don't capture well and we just accept that but in the clinical setting you can't it has to be 100% so you're probably going to do a capture technology and back it up some sort of PCR to fill it in the other challenge is as all these emerging platforms come in so we get a my seek next week or a week after and we'll have to validate that to see make sure that it's it's usable in this this this pipeline again it has that nice advantage of being rapid turnaround big one here is the the amount of DNA material that goes in so we've managed to reduce that over the years the first protocols for making libraries on aluminum for example over 10 micrograms of DNA and now we can do the 100 nanograms so we're getting better at reducing that and the false negative right I talked about that we wanted sure get 100% coverage and the incidental findings here so I'll talk a bit about more I'll just pass over that for a second and we'll talk about that another slider too I think so part of the whole process is when we get the sample in we do our analysis the patients have cancer and so we want to deal with that first and that's the somatic mutations or germline mutations in genes that we know like a BRCA1 or 2 that are important these may be these we've validated to part of the report to the physician and we meet every week as an expert panel there's 12 members of that panel about half clinicians and half genomicists et cetera and to discuss this and come up with a report to give back we do this within three weeks and this comes back and to deal with the person's cancer but as you're sequencing and especially as we increase the number of genes you're going to find mutations in their germline which may be important to their overall health or their family's health but really are irrelevant to the situation they've got cancer right now and the question is what to do with those and that they could be quite important but again we have to go through this process of evaluating what we think we're going to pass on and they need to be validated in the CLIA lab and then get passed on to the genetic counselors et cetera to decide what to do but this is an interesting problem and it's I think this slide gives you an idea of that as we expand up if you're sequencing the entire exome of any individual you'll find around 75 to 200 genes with deleterious mutations in them and if you do the whole genome it's probably like 600 genes to get better coverage and you'll find as many this is a couple a couple this is extracted from the literature so you might find you know somewhere between 200 to 500 genes that have deleterious mutations in them so there's a we're all carrying a very large load of variants we'll call them mutations there's variants in genes some of which could impact our health most of which we don't have any idea what they do but some some we do recognize and the question is how to report those back all right so on to the last thing was just pharmacogenetics a little bit so well many of us especially as we get older we start taking more and more drugs but you take a drug and the there's three outcomes of that really one is the desired response with the we hope we all get the other is there's really no effect you give the drug to the person it doesn't matter and in rare cases then it can actually hurt the patient and all of these things are impacted by environmental factors and a simple one is like grapefruit juice if you're on many drugs you're not supposed to eat or drink grapefruit juice because it interferes with the enzymes involved in metabolizing those drugs and so that's why you see on many of the labels you'll see do not take with grapefruit and of course genetic factors so if the genes themselves they said all those mutations that we carry if they're in these genes that affect the metabolism then clearly that can have an effect so and there's it started different there's pharmacogenetics there's the absorption of distribution metabolism excretion of it all of these require biological processes and the pharmacodynamics of what the impact on the receptors or what their target is and whether it's changed or not and give some examples that intersect so the it's actually interesting that the the adverse drug reactions ADRs are the fourth leading cause of hospitalization and the fifth leading cause mortality and this is U.S. numbers and it's actually been calculated that the treating these and they're fairly rare we're treating serious adverse drug reactions exceeds the cost of providing the medications themselves so it's an important area so the potential of it is here where if this is your population and they all have the same diagnosis and you're going to give them a drug there are some that you would like to be able to recognize here that aren't that don't respond so you're not helping them or even worse they respond they you know it's toxic to them and you want to treat them with an alternative drug or at least change the dose of the drug compared to the the bulk of the population that responds normally and in clinical trials in cancer for example this is sort of a standard profile for clinical trials you work out the safety of the dosage and then you'll do a small test and you scale it up and in this case we're showing that these are non-responders here responders and hyper-responders people may be sensitive to the drug and your placebo and your control your case control groups are roughly hopefully about the same but they don't have to always be the same these are randomly assigned and this distribution may change a bit which may secure your results and a better design if we can do it if we recognize these these subgroups is we can sort of weed out the non-responders up front you're not doing them any good on a clinical trial anyways and also you can then have a more focused phase three trial which is going to fall faster and smaller and less expensive all right some examples here so the cytochrome b450 2d6 2d6 is frequently in many of the drugs we take metabolizes those and many of the drugs go through this pipeline so five to eight percent of occasions are for phenotypically poor metabolizers because of mutations in that gene and there's 106 alleles so far in that gene so I'm just a simple example is morphine metabolism so if you want some pain relief you take some coding but it's not the act of form it's actually the morphine which is the act of form and it goes through sit 2d6 and in case where someone is has a mutation that doesn't metabolize very well doesn't make this change here then obviously you're giving them the drug and they're still feeling the pain so you're not helping them out but even a more deleterious case is someone who's hyper active in that enzyme converts it very readily more than a normal person so you give them a normal drug but the amount of morphine then they get is higher and that can actually be toxic and just to bring it back to cancer and another example is tamoxifen tamoxifen this is the form that's taken this is the act of form that in your body goes through several steps including against 2d6 and if you can in the clinical trials it's been shown that if you can type people and get there whether they're they're going to metabolize well or not and you can subtype those groups and you can get a better outcome all right so we mentioned the the data privacy and security I think is the last thing on my list to cover it's really obvious that clinical information needs to be protected you know if you're your patient charts you don't want other people to see those so and that's that's pretty obvious but what's not so obvious I think is the great concern now the next gen sequencing data you can produce so much data you know this this doesn't apply to things like microarray data which are expression profiles but the sequence data and those individual variants are like a fingerprint and so the identity as you increase it and you get into millions of SNPs the certainty of a genetic identity become increasingly you know positive so you must adequately protect the data and probably encryption is the best security and this this is your trade off between if you can encrypt everything so you encrypt your cluster you'll spend all your horsepower of your cluster will spend encrypting and decrypting so you have to make sure you've got good firewalls but if you're going to transfer data of any large size especially you want encryption on that and I think some of this is growing out of some of this is an overreaction I think people are worried that you know insurance companies especially in the U.S. will trove through the databases and identify you and not give you insurance but there was an example I think it was in 2008 of an adopted individual who wanted to find their their biological father they had their some SNP one sequence it was SNP typing done on their Y chromosome were able to search public databases and found there's a correlation between the variant scene and surnames so they were able to get a number of hits in the databases that they matched near to and you know something like half of those people had the same last name or a variant thereof and so they were able to narrow it down they knew roughly where the person who had that donated the sperm for that person they knew roughly where they were from and using that they were able to narrow it down to two or three people and they were actually able to contact find the person so by sequencing their own DNA they were able to track down their birth or their biological father and so that really because of that you know everything everything went up and everyone got upset about it and data privacy which is it's a good thing to be upset about but it's you know you really have to clamp it down and make sure that you are in control of your data that's being used and that's why that I mentioned the ICGC if you want the germline data you actually have to apply and get a password to get at that germline data all right this is my last slide of the week you're gonna learn a lot in this week about how to deal with all these data so there's all these data types listed here that need to be brought together for any sort of analysis you want to do you want to collect as much of the data as you can because it's all important don't restrict yourself just to a single nucleotide variants for example and some of the cancers that's actually not that important compared to structural variation bring them all together bring them in with the clinical data and figure out what you want to follow and do and you know pull in the pathway so I think throughout the rest of this week this is the type of global big picture that I think you'll need to be looking at I won't put that up yet I still got 15 minutes so are any questions burning questions and if they'll be answered late in the week I'll just say they'll answer late in the week so they do it like four or five four they only put one base in at a time so if they get a signal they know that it's an A if they're putting A in at the time complicated questions so heterogeneity you are so the question is how do you deal with heterogeneity and how do you know the data is real simple way to put it so heterogeneity is an issue it depends on that's why we like sequence because you can drill down very deep in the sequence and so you're sampling that population greater depth the question though how low can you go right if you're looking at things that are one percent of your reads have a variant and the error rate on the machine is one percent do you believe it or don't believe it if the errors are random then it's unlikely they're going to pile up in one spot but we all know the errors aren't random they there are sequence biases that could do that so typically we don't like being down around the one percent range we're pretty comfortable on the five or ten percent range and we find that those validate very well on the pack bar for example we do a five percent cut off anything below five percent we don't consider real at that point in time from a patient treatment standpoint it's probably not relevant anyways when you got something that's 85 percent right and you've got one that's five percent we can actually go lower than five percent we just start around three or four percent you start finding the background it's coming up and we're getting better at dealing with that being able to filter it out but right now it's where so you have to set some kind of cut off and again it's I think we're looking at things that typically that we're not you can keep going in sequence forever and get down to the point one percent of the cells have this or that but the question is what is the relevance to the tumor you're studying we do we do so the question is do you change it if you know the heterogeneity or the the cellularity is what we we refer to it as the amount of tumor that's there then you can adjust it right so if you know it's only 50 percent tumor that you're sequencing then you can drop your signal down a diploid signal by 50 percent at 50 percent tumor your signal will be at 25 percent but there's a lot of variation that even in a diploid just sequence in a normal genome where every snip should be 50-50 in your read count you see anywhere from 20 to 80 percent in a range like that so it does vary and that's just a sampling so the deeper you go the better it is yeah you have to get one or the other another question so at some level like with data virginia there is some level of genetic posidism in normal cells so skin versus other yeah how do you choose a normal reference like typically it's blood because that's just easy to get right um if uh you can get adjacent tissue but there's a field effect you often find around the tumor move away a little bit you still find some of the early changes in it so uh sometimes you like for pancreas sometimes you use a piece of stomach they remove a small piece of stomach in the operation as well the bigger question is in transcription you know transcriptome studies what do you use and that's a real real problem so the pancreatic samples we're doing transcriptomes on them but the question is you know what do we use for normal now we can get some normal pancreas so we can get at least get in the in the Whipple operation when they remove the pancreas and remove most of it anyways they they take some distal pieces and we can use those as our control but it's very difficult to get the exact cell type and people don't like to give up normal tissue very much you know like normal brain it's hard to get so they you know sometimes you get it from autopsy people you know donate to the science etc but that is a real question and especially in transcriptome is what is the normal so that brings the next question which is when you overlay like in your pancreatic whole genome sequence when you overlay the transcription analysis how many of those mutations are actually expressed well that's good that's one of the that's a good good question I mean that the question is you know when you're you're overlaying transcriptome data or if you're looking at mutations in the DNA how many of them are expressed right so how many are important and so that's a it's a good piece of the of the puzzle as well so you sequence the entire genome you find a bunch of mutations but you don't know that they're all being they look really deleterious and really exciting but unless that that's actually being expressed then it's not going to affect the cell right so you can overlay those two data sets you can look in the transcriptome data you can actually call SNPs in the transcriptome data it's rather it's fairly difficult actually to get good SNP calls partly that's because of the variation it seems in the genome it's normalized but in the in the transcriptome data it's all over the place and so it's really hard to call SNPs you tend to over call quite a bit but you could see if you know a mutation that's in the in the DNA you could go look in the transcriptome specifically and make sure that which alleles being expressed and might find that the deleterious ones not even being expressed at that point or might be expressed and just degraded very quickly so that these are not meant to give answers really these studies they're meant to point you to leads that you do more studies on so would you then think that it's sort of essential to all? no not necessarily I don't think because if you're doing a large number I guess question for the recording is whether or not you know transcriptomes an essential thing to be doing all the time the one thing we're doing is doing a lot of sequencing you know large numbers of tumors so if you see recurrent mutation it would be kind of surprising if that wasn't important it wasn't being expressed but regardless whatever you find you have to follow up on so these recurrent mutations we would then go back and look for expression now whether you need doing a whole transcriptome or you can just do targeted qPCR and ask whether it did that specific ones expressed but the more information you have the better but they'll cost future protein work yes so proteomics is very important proteomics is come a long way but it's still you still can't take a cell or you know a population so like a tumor stick it in mass spec and come out with a spectrum of all the proteins you're really only looking at the top few that you can identify your top thousand or something you know a million proteins you can probably only ask say a thousand I'm not a proteomics person so I'm kind of making these numbers up it's kind of what I hear but so that's the problem is it but if you can add you can ask specific questions so again you can say you know I've seen this mutation even in the transcriptome the next question doesn't make a protein right you know you might see it in the in the RNA but you don't know that it actually gets all the way to a protein and so you can specifically go in and look for that specific protein and ask is that mutated one actually being expressed as a protein you know I was looking at that I was on vacation last week and I tried to actually take a vacation but I did have I was looking at some of those papers and I haven't quite formulated opinion yet there's a recent one that says I think they found 10,000 or something yeah so comparing transcriptome to DNA there's like tens of thousands or 10,000 these thousands of differences between the transcriptome and the DNA now they did do some proteomics and what I didn't get to because I was on vacation with the depth of reading was how many they did and what the validation rate was but they did do proteomics and show that some of those are being expressed as an altered protein so if that's true then it really adds another layer of complexity that we have to take into account and there may be signatures in there that you start being able to predict which ones are going to be altered but the sort of the wisdom before was that it's a rare event this one these recent papers make it seem that it's not so rare and it's something that we really do have to consider so I haven't formulated a full opinion on that yet what do you think I read some critiques on that paper that sort of could explain it by a number of different kinds of arguments yeah so so not all of them but many of them so the numbers are definitely yeah that was my feeling as well that it's a significant event but not as prevalent as likely they were trying to apply question over here the experience what is the validation for sequence that depends on the level of what you're trying to detect so the gold standard has always been Sanger based sequencing but for germline mutation that's fine because you've seen 50 50 alleles and so it's very easy to validate in cancer it can be quite difficult if your primary tumor only has 10 or 15 percent I had one that was I was given the DNA I didn't see the tumor and it was turned out you can tell by the sequence that it's only 5 percent tumor DNA in there so you do a Sanger read and you know you can see a little tiny bump and it was really beautiful Sanger read it was clean it was the only little bump in the whole trace but you know you could never call that without knowing that that was already there so Sanger is the gold standard but for cancer people are doing more targeted sequencing like so take your Amplicon and sequence it several hundred fold deep and common one has been the 454 platform people have that we've been using the PAC file because we don't have a 454 the MySQL probably would be a good platform for that as well so you want you want to and maybe the IonTorrent you want to take it to another technology you don't want to really use Illumina sequencing to validate Illumina so you want to take another technology but you want something that's more targeted and focused and relatively cheap but you want to go deep in cancer I actually tried to sequence single cells we haven't tried to sequence single cells um there there's papers out the people have done transcriptomes and they're starting to do I used to think the holy grail was to do single cells but I think that you can't tell the difference between biological variation and technical variation at that point so you're probably going to average together 10 cells or more anyway so you might as well sequence 10 cells is my opinion but we haven't pushed it to that limit and part of it is the you have to do a lot of amplification to get enough material to input into these things right now the Pac-Bio is single molecule but you start out with a fair amount of DNA to get to that get enough to put on the machine you there's quite a bit of loss we're getting better there's less loss throughout the whole process but it's like we started with 10 micrograms of DNA for a library we're down to 100 nanograms so those technologies are coming we're trying to push those two from place we like using laser capture right and pulling out specific cell types and sequencing them maybe not single but you know getting a nice clean separation of the stroma and the tumor so we're trying to get bring the two together the high throughput sequencing and those types of technologies so they can meet in the middle we're not quite there but we're getting closer the prostate project we're just starting now we sign up for these projects without knowing everything and then we go to do it they want to hold genome sequence and they can give me 100 nanograms of DNA and they want all the analysis done so we had to stop and work on that we're always trying to drive that input down no I think you can do frozen yeah I don't have a lot of experience with laser capturing the pathologists don't like to do but you know histology on frozen doesn't give nice clean sections but you can get enough to get laser capturing