 Disclaimers all right my disclaimers and I'm required to put in all right So we're really all here because of as you know, there's been a huge increase in technology and sequencing So over course for about 20 years it went from a very manual process Where you poured the gels loaded the gels and actually scored them by hand There were some little devices made to help you do that which never worked So you just sat down with a you know That's why I can't type today because I spent one hand thing a CGT a CGT That's but all I can do on a keyboard Then they went through a series where it became a little more automated So the data collection became more automated you still had to do a lot Interactive with the screen a lot here and align those lanes But the actual base calling became automatic with fluorescent markers This is radioactive and then it became pretty much a fully automated system with the introduction of the capillaries And the capillaries then where they would pump the gel in so it's like pouring the gel It would run the sample for you it would call the bases and for the human genome project We were able to fill rooms full of that those and sort of a linear linearly scalable process and Over a short ten-year period. There were a number of Genome sequence a lot of the technology for the sequencing was driven by a lot of this To sequence things like see elegans got sequenced first then there was Ecoli human mouse all these things but those that technology was driven through that yeast. Yes Ecoli actually took a long time. We'll get into that But there was a real revolution around 2005 it sort of started a little bit earlier But the first commercial platforms came out and that's we'll talk about and that was that's not that long ago really right? I mean that's that's only eight years ago and that's The really the what why you're here today is because the the data that we generated through here was manageable And now the data here is as you'll see is not very manageable Just an example of about eight years of technology is more like 10 This thumb drive is 64 megs this thumb drive in the half a terabyte Right, so those are about ten years apart. I got this one when I got my first cluster They gave me a thumb drive 64 megs But that time that was huge And you what you say your first terabyte hard drive was a million dollars or quarter million dollars now now they're you know Well doing doing her bucks. Yeah So technology is driven forward, but it's creating all sorts of problems one of the advantages next gen though As I said is that there's no subcloning so we did the human genome every template that we sequenced We had to clone it and we had to grow it in a bacterium and we had to purify it And so you can imagine there was that was a lot of work So we had lots of robots to do that and now a lot of that's done for you the data of course We'll get into data But the real thing is now you're just getting huge amounts of data per run as you'll see and so you have to have The tools to use it. I'm gonna talk a little bit historically Just bring it make sure everyone's on the same page about technologies and that sort of thing and what the data Look like. This was the first one that came out. This is the 454 Instrument that came out. It was bought of subsequent to a roach. That's the paper And it where I'll look just quickly go through the process of how how it works because it's very similar There's subtle differences in the instruments. I'll point out But usually if you're doing what just talk about whole genome sequencing for now You're gonna sequence a genome the first thing you have to do is break it up into pieces That's not any different than what we did with the human genome. We broke that up into into clones then you make a library the The trick here was it they couldn't detect single molecules or none up sync signal So they did what's called an emulsion PCR So you have a bead in that this is a inside an emulsion. So it's just oil water You know, you mix it together you get little beads each one those little droplets in that oil emulsion acts like a PCR tube So hopefully what you have in there is one of your templates and a bead sometimes you have no beads Sometimes you have two templates, but this is the ideal and then just by PCR You have a little oligo on there which Nucleates it and then you basically coat the whole bead with that template These are then loaded into a what they call the pico tighter plate And this is a basically a cluster of glass capillaries that were fused together and then etched with acid to make these little tiny wells And the beads fit into those wells and they packed them in with them with other smaller beads You do that just with a centrifuge and on these beads are contained within these beads for the reagents to do the sequencing And this the sequencing occurred by you flow in the the base and if it's detected you get a Light signal coming out so you could detect the light. This is actually electron micrograph of what the thing looked like It still exists. You can still buy one So the data that came out was very different than what we're used to we're used to these traces AC GT coming off and this is what came up to us for the 454 where they're called flow grams This is just the amount of signal. So you flow each base in at a time. For example, here's a T You're flowing in the T you get a signal and this one's actually quite a large signal because it if there's five T's in row It'll incorporate all five T's at a time and give you more signal So the the signal here the amount of signal you got is proportional to the number of bases you're incorporating to some extent That's one of the weaknesses of the system as you'll see Then I came along this was the was really a company called solexa came out with an instrument Which then bought by Illumina and the original GA2 and they what they did they still couldn't this Detect single molecules, but what they did instead of using beads is they did a little trick with the PCR right on the surface of a Slide and built these little clusters So it's just a way of increasing the signals that you could see and there are sequencing. It's done a little different It's more akin to what we did with the the Sanger base sequencing where it's sequencing by synthesis So all four bases are added at a time, but only one is one is incorporated So they have a string of T's only one T will be incorporated because it's blocked and so the next base can't come in You then wash away everything you then image it with lasers and four different lasers and get the four different colors coming out And then you de-block that add the next space and continue So it's a cyclical sequencing one base at a time so it gets through those homopolymer repeats better The data that comes out was a little more what we were kind of used to so this is hard to see But what this is is the plot of the signal from each one of those lasers and quite often It's very clear that the one of them is the clear winner And then I think there's one over here that's not so clear where you know getting two different signals And so it's a little hard to tell which one it is and that's where the you you'll see later as you're going through Data and you'll see what's called Q value, which is really telling me the quality of that base These are gonna be poor quality bases if you can't quite tell which one it is so it's a little higher It's an a but it's look like could be a G and so it gives you an error at that point This is the old an old picture I took it It's useful because it just shows you the guts of these things and how simple it really was This is the the GA2 the imaging has changed I've got a new one in here the imaging's changed But the system's really the same what this was when they built it was Really just a an objective out of my microscope. So this is this whole part here. It's just a microscope This is a slide with chambers in it. I brought visual aids You haven't seen one. That's the slide It's got little ports on one end and you can flow in the reagents across those eight channels Right and in those channels are the all it goes to which you you've bound your your template And so there's little tiny clusters in there and it's all it is then is it's got a laser it can Pick the laser and shine it across and then image and it just scans across with this imager So it really was very very simple a little more complex now It's a different they've replaced the this part with a more complex imaging system There was competition in the industry and competition is good This drove some of the prices down, but this was the AV solid, which is pretty much dead in my opinion There's there are Ones out there, and they're still supporting them, but the newer ones don't work very well, but they did sort of a combination They did that emulsion PCR and took a bead and put all the templates on it They then modified the ends of the templates and then stuck them to the slide So they stuck the beads down the problem was sometimes they didn't stick and they floated away They also had a completely different approach to sequencing instead of using a polymerase and adding one base at a time They used a ligation method. So they had a very complicated series of I think there were nine mers And then they would ligate on and then cleave off that part and ligate on the next and they really sequenced every fifth base And then you'd offset by one in the sequence every fifth base So instead of getting a linear progression through the sequence you'd get base, you know One five ten and then you get two six eleven. So you really had to wait to the end to put it all together Third generation sequencers including one that really that's almost anything else besides those is considered third generation pretty much but they had Particularly a focus on Using single reads, you know, there are single molecules So you're not you don't have to build those clusters anymore whenever you build those clusters if there's any sort of PCR involved You're introducing some kind of biases to your data, right? So you will definitely see that in the in the data the GC content will be a little different than the genome itself so there's a lot of push to get single molecule sequencers and Include this one as a third generation, but it really came out at the same time as the others And maybe maybe a little ahead of its time. Maybe a little too ahead of its time. This company just recently went into receivership This was the helicose heloscope. It's about I'd say the size of refrigerator. Maybe a little bigger It weighs a ton. It had a cluster that also weighed a ton We had to keep them on opposite sides of the room according to structural engineers It didn't fall through the floor below us But it wasn't it really was a single molecule sequencer and they had they would tether on to the The slide their molecules and then sequence them the problem with it really was that the breed lengths were short The error rates were not bad as you'll see that the pack bios has more errors, but they had about 5% error rate They the main thing you had put you didn't get a lot of data out of it You know this wasn't as much as we were expecting and you had to put a lot of material in to get there So although a single molecule it didn't catch on it just wasn't able to really produce the amount of data that it couldn't compete with the current systems So another is a truly third generation. I think was the pack biosystem that came out and this one's really kind of cool This is a single molecule detection system They have a thin membrane with a hole in it and a glass substrate underneath this hole is I think 7 zepto liters and it's 10 to the minus 21 and this hole is so small that light won't go through it And the way the inventor described it to me was if you look at a microwave You know you get a little grid on the front you can see in but the microwaves don't cook your face while you're watching your food Cook or your popcorn pop So it's the same same principle the microwaves are too big to come through those holes So they don't come out and cook your face and it's the same thing So the laser light is too large to really go through the hole But it does sort of light up this bottom area and at the bottom of that There's a polymerase and a single strand of DNA and what you're actually watching then it's a cool instrument What you're actually watching is that polymerase incorporate nucleotides. So this is just a little cartoons These are the nucleotides. They're floating around brownie brownie in motion going in and out And that's this chatter here as they go in and out of that little space on the bottom That's where the interrogation is and then every once in a while the polymerase will decide to add a base And it'll grab one to test it for a fit and if it fits it'll incorporate it And when it incorporates that event takes on the on the neighborhood of milliseconds And so you see it traps it for a while You still see the other one drifting in and out and then when it incorporates it It cleaves off the floor which is attached to the phosphates so that it drifts away goes back to baseline next base going in Of course, this is a cartoon data does not look like that that clean you can have errors where The polymerases are not complete uniform So they'll incorporate very quickly and then it's hard to tell whether those are multiple events or a single Event it also The other thing it can do is it can test it for a fit and even though it fits it can let it go and Have had to have it there long enough. It looks like a base is incorporated But one's not incorporated and the major error rate that from this thing is that these are all Synthesized chemically these compounds and if there's no floor on it It'll incorporate but you won't get any signal at all So you'll get a what we call a dark base and so it looks like a deletion But it worked pretty well it had an error rate of About 15% when they launched which sounds huge and it is huge But the read lengths are very long right now the read lengths are on median read lengths is about 5 kb And you get reads out 20 kb or longer. So if you've got 20 kb read it doesn't matter if it's only 85 percent accurate you can still do things with it, but We used it early on we needed high accuracy because we used it for sequencing in the clinical setting And you can get that out of this instrument by you put on these hairpin adapters Which makes a single stranded circle of DNA? And then when you sequence around that the polymerase just keeps going around and around and around and you get this long template Coming out which you can cut up Informatically and you basically get the forward read reverse read and you can go around You know we've had them go around 40 times and so you get consensus sequence out of that Which is very accurate right so 99.99 so you can get accuracy out of the Pac-File Another one that came along was the ion torrent personal genome machine This was a bit of an upstart and then it is was completely different technology So it's a non-light based this which was important so it wasn't detecting floors anymore Right now the read lengths are closer to 400 base pairs This is sort of just a table of the outputs that you can that they say you'll get and kind of what we're getting So usually in the lower ones we were getting more and this one we're getting about what they say But it worked very differently Visually So the things coming around is this chip and This chip is is is just the silicon wafer and essentially it's an array of pH meters So all these pH meters on the bottom you've got these little wells again. This is where your bead There so it does do an emulsion PCR as you'll see and the bead sits in there And it's just a pH meter and why is that pH meter useful if the bead is sitting in the well and And a piece a nucleotide is incorporated as you incorporate a nucleotide you release a proton And as the protons are released it changes the pH and you can detect that event They always if you go to their website everything always kind of looks like it's a single molecule molecule detection It is not it you can imagine trying to catch a single proton event wouldn't be very useful So it is a bead doesn't motion PCR it also flows one base at a time So it's like the 454 and so it does incorporate if there's five T's it'll incorporate five T's and you should you should get Five times a signal it is reasonably linear, but it is there the Achilles heel of this platform Yeah, pardon our homopolymer repeats as you can see here It has a little trouble deciding how many T's there are when they get along long above about five It's getting much much improved on the on the pgm The next one that came along with the aluminum I seek this is a I guess I skipped the high seat, but I'll get back to it the my seek which was sort of a little brother to the aluminum platform So the idea here was it would produce more data In a shorter time So the idea was really was to get her sort of a 24-hour turnaround You're getting less data overall than the the big big brother takes like two weeks to run But you'd get much more data and interesting thing about it was if you compared it when it came out This is old data, but it still still applies When you compare it back when it came out in September to its bigger brother We try and always try apples and apples. It's really sometimes tough But it was pretty easy because the same platform in this one But obviously you look at things like insert size distribution. You expect it to be the same the same library So there's no reason for it to be different You can look at the percentage of reads that are aligning so they're on target So these are the green here as you can see with the my seek. We actually were getting more reads on target And then you go into why so the number of indels per cycle they're both the same But here's where some of the difference was so this is the quality value the Q value for each cycle That's happening it starts out above Q 30 is where you want to be above Q 30 And so you can see it starts out pretty good And then about a hundred bases out or about this is probably right about 90 or something like that it dropped below Q 30 Same on and then you do the reverse read So you have to take the other strand you have to make the second strand of that template and then do it again You see the data is usually a little worse on that second strand This is the the my seek data this this is going out to 150 if you can't read it So instead of just out to 100 it's going to 150 and you can see it stayed a Q 30 all the way out to 150 Dropped off a little bit here. So we're getting actually better reads off the baby the baby machine and and that was Not sure anyone really knows why we still do but I think it's mainly because of the reagents It's a 24-hour turnaround versus a two-week turnaround So the reagents are only on the machine for 24 hours instead of two weeks So it's probably the big difference and then you can do a 250 base per read on the my seek but somewhere around I don't know about 175 or something like that the quality starts really dropping So we you can do it you can get a 250 base per read, but I think still the 150 is much more useful This is the the new big brother of the pgm the iron proton It has short runtime a couple years ago when it first was being discussed. They were promising the thousand dollar genome I'll get into that in a minute But it produces what they're they're saying it'll produces around 10 gigs of data On the the initial version that's come out and then the next chip that comes out They're saying about 60 gigs and that chip going around is is from the pgm It's very similar. It's the same technology just more more little ph meters on it It's a bigger chip to try and get more out of it And this is what they are hoping that eventually this is where you'll be able to get sort of a genome in a few hours But I think it's got a lot of development time to go to it The data coming off this are actually not as good as the pgm the little brother one But it it'll improve So the high-seek 2500 is probably the latest entry very similar to The other one, but it's this is a with an upgrade or you can buy the instrument outright and This one will produce either it runs two different modes It can either sort of in 24 hours or 27 hours produce about 120 gigs of data Which would be about a 40x coverage of the human genome Or you can flip it over to the other mode Which is the usual way of running it over a course of About 14 days or 10 days a little longer 14 days get around 600 700 gigs of data You can do a little longer read on the short and then you can on the long again for the quality of the reagents that are on So if you want to do fast turnaround and fast genomes, you can do this that you see that the Chip that's coming around only has two channels on it So it only loads two channels. There are two flow cells two of those on board So you can do two different samples, but you're committed to that and you get a little less data And they also charge you a little more for the reagents and so it's about a 15% premium But if you need an answer fast, it's quite useful And still on the horizon has been on the horizon for a long time our nanopore technologies and Oxford nanopore is one of them This is where they actually use a protein pore and the DNA as it passes through So you could measure the the potential across this membrane and as the DNA goes through you can get changes in the signal and What they're really detecting is about three bases that block the pore and Get that's where you get the signal change from They haven't launched yet. They are getting close to I think some betas out there They the problem they've had with is with accuracy in that they can't distinguish all of these three three nucleotide words as they call them Between the mouse so they in the difference between that and the Pacific Biosciences one is that they're not random errors if they can't distinguish this triplet from another triplet They can't say what it is. Whereas the pack bio just do more coverage and the airs go away But it's it's pretty cool technology. This is One of their latest sequencers. It looks like my half-terabyte thumb drive a little bigger And that plugs into your computer and that's actually where the sequencing would be done And then you can also fill up just almost like a cluster. You can fill up racks of these things with the with these cartridges you stick in with the pores so They didn't say much this year at a GBT is the sort of one of the major meetings for genomics and sequencing This is what they said two years ago this year. They were pretty quiet. They didn't say very much We're all still waiting for them to come up all right, so you can see that it's a very complex space and What instrument you use or I get asked a lot from people is that what instrument should I buy? And it does depend on what your project is and what you want to do So it's a really unknown where these guys are going to fall in here But if you want lots of data and you don't mind waiting 10 days to 14 days Then you want to be in this range it depends much money up to spend to these are expensive around $650,000 instruments. If you want fast turnaround, but not a lot of data. You can use the pack bio For some applications, especially if you need long reads. This is about $750,000 This one was brought in just because it's reasonably cheap You could probably get in to sequencing for about 200,000 if once you buy it everything you need to do it I have no idea what they're charging with these anymore probably about a hundred fifty thousand, but they produce a lot less data So somewhere in here Where are you going to be is depending on your project and Just to put in perspective a little bit Where we are in sequencing so this is at the time just near that just after the end of the human genome project The peak of the genome centers, this is I was a co-director here, and then I was here as well These two genome centers and we had rooms full of these these instruments that were the capillary sequencers And we could produce around between the two centers around 10 million reads or about 5 billion bases a month And now one of these things will produce about 360 times that and take probably two people to run So the all these people were involved a lot of bioinformatics is all in here and everything But that just puts it in scale of where we're at with data The other thing that happened a lot was the pipe the price dropped and we've all heard about you know The thousand dollar genomes coming so when right at the beginning of the next gen Introduction it was about ten million dollars to sequence of genomes the price had dropped a lot for Sanger base sequencing and then when this came in you do about ten million and it's been dropping pretty steadily as you can See what when people talk about a thousand dollar genome. They're really only talking about reagents They're not talking about everything else the getting the sample the personnel to run it informatics as you'll see is a big part of it here Sample prep maintenance agreements my maintenance agreements or one point two million dollars a year So I have to take that into account Which one Well part there's a couple reasons for that So the question is why was it why was it ten million? Why is the price been dropping basically? So the biggest problem here was the amount of data you got out of the machine So you didn't get a lot. I mean the first next gen's were impressive They'd make a hundred thousand reads at hundred base pairs long, which was very impressive at the time It's not so impressive now So in order to do a genome you'd have to do a lot of those runs and because they were new and on the market They charged a lot they didn't there wasn't much competition, so they charged a lot for the reagent So that was the cost as more instruments came in Competition was one of the things that drove the price down The other was the yield for instrument the amount for each run how much data would you get and now? We're getting a high-seq about 700 gigs of data off a single single run off it, right? And the they didn't the price has gone up a little bit, but they didn't it wasn't proportional So as we got more data, they didn't charge as proportionally more and so the cost went down We're in a point here. I think we bottomed out At least for a while until there's some new disruptive technology that comes in and really forces them to compete There's really no competition for alumina alumina is pretty much doing what they want right now And in fact the the if you to draw the curve on reagents it actually goes up a little bit So they basically they've reduced the discounts that we get for both purchasing and so the our cost It's actually gone up about five percent this year But so I think we've kind of bottomed out But it's the thing is that sequencing is not the driver in your experiments anymore If you're going to pool Hundred samples in one lane the cost for sequencing is nothing It's like 70 bucks, but you have to make a hundred libraries and those libraries are going to be the driving force for sure So they always talk about just sequencing reagents and we're a long way from a thousand dollars So applications you're going to be going through them on the informatics side But pretty much everything that was done by any other means such as arrays has been ported over to sequencing arrays are still Quite useful. They're still cheap But you can do you know copy number from the data structural variation you can do of course call snips and indels Small RNAs epigenomics. There's methylation mentioned. You can do that. You can also do Chip-chip and also transcriptome sequencing. So you'll get see many examples of that and throughout the week This is an old slide. It was the first cancer genome sequenced. It was a AML This is done at WashU and I just keep it in there because the numbers haven't changed that much if you sequence anyone in the room and The whole genome and compare it to the reference sequence You'll see about two and a half million differences Between the reference and then for cancer sequencing all you really care about are the ones that were somatic and you might find You know somewhere in the round of sixty a thirty thousand depends on the cancer type Then you see you what everyone does is just look at the gene So you keep filtering it down filtering it down and for many tumor. This is this is the false positive rate was pretty high But AMLs are actually or leukemias are actually pretty simple in their overall context But you'll find by time you filter this down you'll end up with depending on the tumor type Let's say a hundred coding non-sonomous changes That you want to look at A little bit about targeting So if you want to capture certain regions such as the exome for example and you have to define what the exome is There's a lot of companies that offer exome Agilent and nimble gen and Illumina also, but if you were to look at those what they define an exome, they're all different So be careful on when you're when you're picking one make sure you look at the content that they provide And make sure the regions you want are covered. This just shows you the data is quite good this is kit one of the oncogene and This is targeting with the Agilent system and you can see that the reeds fall primarily The scale here if you were to zoom way in you would see that there are reeds that are off target But majority of the reeds are on target and get pretty good coverage and fairly uniform Relatively uniform coverage of all the exomes So that's you want the whole exome? There are reasons you might not want the whole exome so if you're sequencing the whole exome You have to sequence it to a certain depth And you could probably put most people put as many as four of them in a single lane on a high-seq We actually only do one because in the cancer we want to go very deep But probably you could get away with two so you only put two per lane But if you just want to sequence a hundred genes you can put many many per lane, right? So there's in other technologies for that and one of them is the the rain dance. This is a multiplex PCR So this is very very very high throughput PCR It's kind of like emulsion PCR except that they've pre-made the little bubbles So they take the oligos and synthesize them They package them up a little bubbles and then they mix them all together to make your library of all these little oligos Then you combine your your sheared DNA with the primers and then down here the two little bubbles get fused And then basically just dumps them out into a PCR tube and you cycle it So it's an emulsion PCR at that point, but what they've done is they've Instead of that sort of a randomness of whether you what do you have in each of the little bubbles? They're very controlled here. So they've got a when you get to bring a bubble together You will have some DNA and some primer from your library and every single bubble. So the efficiency is quite good This just shows some of the coverage and it's dependent on how much sequencing you want to do But this is just using that rain dance Coverage is quite good. You can see a hundred X coverage of over 95 percent of the target There's other methods that are available halo plex. So halo genomics was bought by Agilent and They can offer this so the all of these sites they you can go to a website You can put in your targets and they have a design tool and they'll tell you you know estimate what the coverage is this one is a they use a Pool of restriction enzymes and cut the DNA and then they ligate on an adapter that is specific to those sticky ends For the regions you want and these adapters actually overlap into the the region of your targets to get the complementary and then fills it in and then you can PCR and So it then fits right nicely onto a my seek. So this is example. This is 19 genes only This is one of their first panels that they came out It was only a 61 kb target and you can very easily Put it on a my seek and pool 10 or more samples because you get right back then we're getting about two gigs of data And very high mapping on target So you get this whole step so you can go from from here get your DNA to getting data out in two or three days So it was a bit of a game-changer and the coverage was quite good Same sort of thing. Here's about 100x coverage one my seek run now You get much more coverage if this curve would be out there, but you got very good coverage This line here that the poor coverage is actually mouse We threw that in just to make sure because we were doing xenografts to make sure that the the mouse did not also also Give much product Ampli seek is another one. This is from life tech and is designed Specifically to go on the iron torrent platforms, but you can run a little tweak you can run any platform you want I didn't normalize these two that because I just want to show you that This is a FFP sample and blood and so there's less data generated for the FFP But so I didn't bother normalizing this to one to the other so we got just the reason the reds higher So we just had more reads more coverage, right? But the thing to notice is that the the coverage is not uniform This is their version one version two it's better, but it still has this problem where it's not very uniform But you can see it's reproducible if you just look at the two colors and pretty much wherever one is They're proportionally the same so this is just the this is giant multiplex PCR I think they can do 4,000 or something like that in one tube and You can imagine that each primer pair is not as Not quite as efficient as the other so it's getting better But still be aware that when you when you talk about coverage, I don't know what the average cover Well, here's the average coverage so we got an average coverage on the FFP that looks good around 500x But you know that's here and you can see right There's a lot of areas that are below it because that average is being skewed by some of these ones that are very greatly covered All right, so enough of methods So cancer genome projects a typical cancer genome project you get some normal DNA frequently blood You get some tumor DNA you sequence it magic happens here In all the identification and you do a bunch of people and you look for some sort of commonalities You're looking for genes that are frequently mutated or more importantly pathways that are mutated And what are that some of the complexities of that? Well You know cancer disease the genome and it starts out as a normal genome And then you get mutations occurring and the mutations that give it a growth advantage to drive it forward There are mutations that can obviously happen and kill the cell as the some will drop out But as it progresses you can get new drivers appearing which end up within the end is something that's very complex in that The genome itself is Rearranged it has multiple copies of various chromosomes. So here's three from some to a little bit of something else on it And then the just the whole population now is not uniform So you know you have this heterogeneous population you have to deal with and you have sensitivity issues in that This is a pancreatic tumor actually and you can see that there's a lot of other stuff here besides tumor So the stroma in here and that affects then your ability to sequence that so if you were to grind this up and sequence it You would find that let's say look through all these but down here for example And let's say that the genome the ploidy of it is normal So it hasn't got much structural stuff going on But it's only 20 percent tumor in your sample your signals only 10 percent So you if you find a somatic variant is only going to be in 10 percent of the reads And so 90 percent are becoming from the normal and so that now you're down and as you'll see as you go through all this This course that Detecting down around 10 percent. That's not so bad But if you want to go a little bit below that if this sample is heterogeneous You're starting to get into noise and your false positive rate goes up greatly There's some great papers. This is from WashU That followed some of this so this is again, this is in leukemias Which are much more simple and easier to follow initially But and this I think the OB might be the one behind these figures. I'm not sure but these these Clearly show that in the original diagnosis there was a number of cells that various cell types at various portions So the example is heterogeneous to begin with and then the patient was treated And so it goes through a bottleneck and so some of the cells die off and some don't and this one in particular here Came on and stayed but also continued to mutate and so they get even more heterogeneity But the population change between here and here so there's a lot of studies ongoing now looking at the differences But diagnosis and relapse as to what's going on but the main point there is the the heterogeneity is Going to be your enemy in many of the analysis you want to do All right, so variant detection just talking about that for a second There are two things to consider one is so you align all the reads you call the variants the differences between the reference If you're looking at germline Roughly, you're going to see 50 percent 50-50 of a heterozygous Variant within the poppings within your patient sample you should see roughly a 50-50 mix of Forward and reverse reads those are relatively easy to call if you're looking at paired-end data So this is a you know one one read here, and this would be its make on the other way most of the paired ends will fall around You know a Poisson distribution, but you'll get some outliers here and from those outliers You can start looking at structural variations easy to think about as a translocation Where one one of your reads is on one chromosome the ones on the other chromosome indicating a translocation? You wouldn't want to do that with one pair you'd get multiple pairs that all say the same thing To get some evidence to what you're looking at But there are lots of problems in the data, and you'll you'll be going through some of those first one obviously is PCR We talked about a lot of these processes are PCR based so making the libraries you can do a PCR free library But typically there are some PCR steps involved and then when you make the clusters on those those things I sent around They those little clusters Are PCR based as well, and so you'll see reads that start at this exact same place And what we do is we look at pairs and if both the pair start and stop at the same place We assume it's the same template we collapse that down get rid of them But there's PCR artifacts you'll see the biggest problem is misaligned reads So the aligner put something in the wrong place and it looks like it's creating a variant And if it does that often enough then you really start to get evidence that it's a variant and it's not consistent And so if it happens more in your tumor sample than your normal sample then they kind of look like somatics Especially if you're trying to call things down around 10% and some of the things you can see that help you identify them or strand bias So you only see the variant on one strand and not on the other or you see clusters like this You know where you see one reads got three different variants in it You don't want to blindly filter all those out some of those are real, but those are the things to watch out for There are some tricks so there's about three different papers out doing very similar things It's just one of them where if you incorporate a random tag At your at this point when you're making your libraries and when you do all your see your PCR and sequencing These random tags then will indicate so if you see things that start and stop at the same place Then clearly and the random tags are the same then you collapse them You say the same thing if the random tags are different And you'll say they're independent templates and you can keep them and this allowed them to go much deeper in What they were able to call because they could really have independent events You can see that one of the Things you really have to think about is if you're doing those targeted sequencing with PCR Every read starts and stops at the same place All right, so you can't just you can't collapse unless you do something like this in Amplicon sequence all right, so we've got Cheap sequencing relatively cheap sequencing relatively easy high throughput and so That is around 2007 This is when things really started to take off this one a loom the launch the idea was then okay Well, let's let's see what we can do to really push the cancer genomics forward So there's a meeting held in Toronto here To talk about it and the idea was you know, should we try and coordinate internationally? Cancer efforts and there's a lot of reasons for that the one I'm most interested in myself is this one Where if we could standardize the you know, how we do our variant calling I don't know if you're someone's gonna talk about that this but I'm not if The bottom line of that is if multiple you give people the same reads in different centers And they all call them all different answers right now Yeah, okay. Yeah, it's an interesting when you look at the data You're be absolutely amazed how different it are from the how the calls are from the same data using the same pipeline but if we can standardize and agree to Not only the sort of the what what is quality and how to put these on we'll be able to merge data sets across cancer's types very readily And so they out of that meeting what came the idea that yeah, it would be a good idea to coordinate and frequently That's all that happens in those meetings is but this actually did end up with a consortium that was formed and the goal of the consortium was to sequence about 50 different tumor types and The their controls and do about 500 for each one So it's like doing 50,000 human genome projects, but it was doable and Now that either there's a number of countries involved. I think there's 39 projects, you know crisis how many countries are now Projects in 39 countries Okay, you're saying correct the 17 But this is this is a list of all of them This is the list of all the projects. This is Canada I'll talk a little bit about the pancreatic project. We're also doing a prostate project and Mediapostoma and I will point out that Australia is also doing a pancreatic project And doing the same tumor types of we coordinate quite closely There's a website if you want to go here you can get access to the data that's been deposited It's a growing amount of data TCGA that cancer genome Atlas in the US is part of the ICGC and their data is deposited in here as well And mirrored from their site. You can go and pick any of the projects and find out more about it Who's doing it what their goals are etc? But just really briefly the the OICR is just up the street. It actually looks like this now This is an artist drawing this with building was it's been on the go for five years now or something like that Yeah, they wouldn't look at this at night right now But they got you got delayed. We're moving into the fifth and six floors of this from our current space here We'll keep that space, but this has been a long time coming, but it actually looks like that now if you go up to the corner My office is right there. I Get I get the nice corner office just it was just given to me so there's the OICR has a lot of programs in it, I'm gonna talk mostly about Genome technologies and the bioformat biocomputing a little bit about the high impact clinical trials But it does cover everything and you're what you're interested in is the cancer genomics and This part of it, but behind all that there is you know immunotherapies a medicinal chemistry group which is taking some of the Targets and turning them into better drugs Imaging is a big program how to detect things earlier in the new ways Even just epidemiology. We have a big Ontario health study going ongoing. So pretty much cover all ends of it right down to the clinical trials Okay, so to show what we have these are my toys. So this this is what the genome technology platform looks like We have 10 high seeks Two of which are 2,500 We could do roughly a 650 whole genomes a year on that platform We've got two of the my seeks which are the fast turnaround ones We have a Paco out Which is the long read technology iron torrents. We've got a couple of those and we've got we do have a proton That's we haven't run that yet. It's just being installed But this we have this various platforms. This is an obvious one for the long reads different applications But it's nice when you're doing verification so you can do your sequencing call variants and you get a list of them all And there will be false false positive in there for sure And so one thing you like to do is then go back and use a different technology to analyze them So the cleaner the data are up front the cleaner the data are at the back And so it'll make your informatics a lot easier. So we tend to do that on a separate program So we'll do something like a full genome sequencing here discover a bunch of variants We'll then order an ampliasee create agent to amplify across those samples And then we'll sequence them on the ion torrent if we see the same variant gives us the warm and fuzzy feeling that we've actually called something correct, but I should point out this is our Compute but 8,000 cores. We have three and a half petabytes of storage Which sounds like a lot, but it's not we're always trying to delete things All right, so thank you. I had a cancer. I'll just give you a brief overview of one of our projects When the ICG ICGC started we had to decide what cancer type we were going to do We chose pancreas because there's quite a bit of local expertise in that the Hospital next to us is one of the primary sites in Ontario for doing what's called a Whipple operation Which is the the primary surgical curative form for pancreatic cancer. They do that They they centralize Ontario if you're not from here or not from Canada It's a single-payer system So there's 14 million people roughly all under one health care system and what they do is they've they've localized certain care in certain areas So pancreatic whipples the primary site is right up the street here And they just find that the the more whipples you do the better you get and the better for the patients So we get access to a lot of samples, but we decided to go for them because of that local expertise It's also it has a five-year survival of about two to five percent. So it's one of the most lethal lethal cancers It's only it's not that common, but it's one in 50 cases new cases But because it its survival is so poor it actually accounts for six percent of cancer deaths And is I think the number fifth killer after the common ones and one of the reasons is when people come in and are Ill and they try to diagnose them and they find out only about 15 percent of them are candidates for surgery because the rest of them it's already too advanced and even with that the the mean Life spans only about 15 to 20 months if it's locally advanced. They really can't do much There's major blood vessels come together right there and if it's wrapped around those they can't do surgery And they last about a year, but 60 percent come in and it's already metastatic It's too late and these these tumors do not respond very well the drugs gem side of beans one of them The primary ones used but not very many people actually respond to it. So we needed samples apologize for some of the Going over your your template slides kind of messes some things up But anyway, so one thing you need for any of these projects. It should be 375 here Is lots of samples and lots of good samples and with the ICGC We want a specific consent for DNA sequencing. So that eliminates a lot of the old samples, right because Historically people collected things and didn't didn't even think they'd be doing whole genome sequencing So we had to start collecting from the beginning. So we had to reach out to several centers and To collect as many samples as we could our target was 375 Australia student 375 For that project then combined. We'd have 750 So we did it pretty much set up a typical project. We had primary tumors resected We got Biobank from usually blood from the Biobank for our germline sometimes adjacent tissue But these are all the 15% of the the cases that were surgically resected We tried to capture all of them that came through Then we did exome we've done whole genome sequencing looking for structural variants copying our transcription epigenetics all the things you're going to do But when we started doing the project It became clear that although we could get samples we couldn't get really good samples So pancreatic cancer and prostate cancer is the same are difficult to deal with because of the cellularity So we talked about that before how it affects your sensitivity in that the the amount of tumor in the sample is actually quite small And this is just a subset of our tumors, but it's represented of our entire spectrum This is 20% this red line So all of these tumors are less than 20% cellularity All right, and so your signal is only 10% and forget about heterogeneity even 40% you can see almost all of them are less than 40% and There's a mixing experiment done so we just took a cell line and it's matched normal and mix them together at various proportions and We knew by sequencing the cell line sort of what the ground truth was and we looked to see could we what could we detect of those? ones we knew were in there and you can see right around here at 20% We're only to picking up about 40 or about half of the variants that we knew were there in our pipelines. So obviously All these samples here are gonna be very good especially down here very difficult very high false negative rate But we pushed on with it because that's all we had at the time we as I said we're collaborating with Australia also be able to college of medicine was doing pancreatic sequencing So we hooked up with them and between three of us we did a analysis on about 90 samples and that was published the end of last year It's a typical project that you see so these are sort of the known Drivers in pancreatic cancer here at this end We saw those of course and then you get this long tail These are all things that are in about you know, maybe five percent of the cases And this is typical of any sequencing project right now And the clue that the thing we got to figure out is what does this long tail mean? These are any of these drivers or they can't all just be passengers because there's there's they're too common But and it could be subtypes that we're starting we think we're starting to see subtypes of pancreatic cancer It was going into it. We had no idea how many subtypes there might be there was a study done in 2008 which was a Pretty heroic effort actually this was a straight PCR based and Sanger based sequencing Of the exome and this is from Hopkins and they did they did this on several cancers, but they did on pancreatic They did 24 samples. They were largely cell lines or xenografts or were all cell lines or xenografts For reasons become apparent a minute So we compared ourselves with that and the red here are the things that they saw and we also saw in a Significant proportion error samples the black ones are ones only the ones that they saw and the blue ones are things that we saw now Why do we see more? Well, we did more samples, but I think largely it's also because This is all based on primary tumors whereas there's are all cell lines and xenografts So it could be some bias and selection of what grows in those those systems, but this is at least we're all done on primaries And I won't read the paper phone won't go through the details But we did actually find something new and in pancreatic cancer was the axonal guidance pathway seemed to come up as one that was being frequently hit and There was also some evidence from other functional assays that also would show that these are important genes in this pathway to give some credence to that it actually there was actually a axonal guidance pathway involved and Slit slits and robos and in the axon guidance pathway were already known in cancer So they're in lung cancer. They were they've been seen as being mutated and in our own data It made a lot these some of the things made sense This is again more evidence that this seemed to be on the right track in that this this the pathway here robo one Signals downstream for decreased cell adhesion increased went activity increased cell motility all things that cancers like to do So you can see over here if robo to expression is high the patients That backwards know it No, sorry it inhibits that yes robo to inhibits that so if the expression is high Then the patients do better right and if it's low then this not not inhibiting this these pathways Then the patients do more poorly and robo three's inhibitor robo to and so robo three expression is high You see them flip so this made sense in our patient data set So it does seem to be an important pathway All right, so enrichment you talked about Sensitivity so this again is that same pancreatic cancer sample and you can see that most of it's not cancer in this one That we want to study so we try doing things like you can just take a Slice off the top do an H&E stain look for a region that's got the most tumor in it punch it out with a biopsy core And then try and sequence that it didn't help much in pancreatic cancer. It's uniformly bad So almost where you punch it you don't get much advantage So you get five or ten percent increase, but it's still not enough you can use antibodies to for cell surface markers to This has got xenografts in it. Well, but you can expression frozen So you can take markers that are specific to epithelium, which is where the tumors derive from and you can pull it down that that work somewhat it's It has its own issues that the tumor may not be expressing that cell surface marker anymore You maybe just pull up subsets. So we Introduced into our project xenografts. So this is where you take the primary tumor put it in a mouse and grow it up From those can also derive some cell line. So it gave us more more reagents. It is a We we believe or we we hope that the primary tumor and the xenografter are quite representative of each other They will be different. This is something that's growing up in a mouse, right? So it is different So we hopefully we get more material would help can grow up as much as we want It also importantly we're trying also develop preclinical models So these are models now that once we've sequenced the tumor and it grows in the mouse We could and we find a pathway like axonal guidance and we've got a drug that can hit it We can go to the mouse and actually try and prevent the tumor growth and Did it work yet works not a pathologist, but they tell me that the so this is the primary tumor here And in the mouse you can see there's a lot more tumor tissue But and they say they tell me that the morphology is very similar, but I'll take the word for it But did it did it work for us did it help so this is an example of sequencing in Primary tumors and xenograffin cell lines, and this is just looking at KRAS, which is pretty much a universal Mutation in pancreatic cancer there are exceptions about three percent don't have KRAS mutations 97% do and you can see in sequencing the primary just in our pipeline Did we detect the KRAS mutation in about half of them like I said in the xenografts? We detected always and in the cell lines we detected always so clearly there was an enrichment going on if you go back and look at the raw data for example here if there was 227 reads covering the KRAS locust three of them showed the mutation but it's way below what our pipelines could call So there wasn't definitely an enrichment one of the problems that came along with that was pancreatic cancer doesn't like to grow is like just a lump of tumor It likes to grow with all interdigitated with the stroma and so it recruited most tissue for stroma And you can see this is the estimated amount of most tissue that's in there So we had a another problem now set of although we didn't have normal human tissue in there We've got a lot of most issue and what we found of course then is a lot of your reads don't align to the human genome So you just do a lot of sequencing that you're throwing away But more importantly, this is a region of alignment here reads and nicely aligning to the region These are just random errors. You see a few of them So it's nice clean sequence and you can very clearly see a variant here, right roughly I don't know what percentage is here. It looks like almost 50% And so you'd call that as a variant a t2g variant at that position if you were to compare the hundred bases around that and Align it to the human and mouse lines quite well and The only difference between the two is that t2g So these are just mouse reads aligning to the the human genome And then the variant caller sees that as a single variants and so we had this problem with the mouse So we sequence mouse again And then using that that data we aligned so we just took the mouse genome sequence that Aligned it to the human genome ran through a pipeline You get around 1% or so of the most genome that aligns which isn't very much But it does prefer to align up as you'd expect to the exome more conservation in the genes And if you just run it through your pipeline and call SNPs You'll call somewhere between 300 and 600,000 or so, you know half million of these what we call the interspecies SNPs So these are things that are going to come through our pipeline now if you're comparing to a normal and a xenograft They're gonna look like somatics right because they're only in the tumor sample So we had to develop pipelines to filter that out This is just one this is showing if you just called with our pipeline and then after removing all the ones that we knew We're most and we also now are getting much better at actually Informatically removing this we use a software package called Xenome which actually uses camber analysis and removes the mouse reads And it works quite well, so the data is getting much cleaner, but I still think we have some bleeding through all right, so Laser capture and flow sorting these are our two techniques to get at the tissue So if this is a tumor and we can just pull out the parts of the tumor ignore the rest or we can sort it and Get out pure pure a tumor That's great. This is the problem solved Not really When we first started out sequencing you needed microgram quantities of DNA to to sequence So, you know if you're gonna sequence a mouse for example you can snip off the tail And the original pipeline would take about 10 micrograms going in now typically if it's an automated pipeline You need three micrograms, but from these these sorts of biopsies here You know it's like a fine needle aspirate or we're going to laser capture and take specific regions Or we're just going to take pieces off of an FFP block You're looking at nanogram quantities and it was a big misfit there Between the two pipelines so we could we could get material, but we couldn't get enough to do real whole genome sequencing We get enough to target it, you know that would be fine to a few genes But if we really want to sequence the whole genome, they just didn't there wasn't enough material So we spent quite a bit of time Working on that in my group So right now as it's said somewhere in here it says yeah, it's about three micrograms So if we have a automated, you know robotic pipeline, though That'll make libraries for us, but there's a lot of waste in the robotic pipeline So we need about three micrograms going into that you'll never get that from these flow starting and laser capture We also had a the prostate biopsy project that we're doing they typically could give us 100 nanograms We didn't know that when we started the project, but That's what they started giving us 100 nanograms the flow starting. We're looking at thousands of cells We're also interested in circulating tumor cells talk a little about that later That we've been requested several projects for single cell analysis clear. This would have to be amplified But still there's not a lot of material and with the ffp, which is the bulk of the samples that are out there in the world There you get low amount and poor quality DNA out of those So we had to put a lot of effort into that and we can routinely take 15 nanograms and make a library That a really good quality library by hand and but it's it is manual process. So it's made hard to do lots of them Formal and fixed paraffin embedded So most of the cancer samples out there the big collections are like this So the the sample comes out the pathologist drops it in formalin, which is a terrible thing to do from the nucleic acid side But it holds the structure and what they're used to looking at they embedded in and paraffin and the one actually the one back here Far back I have to go there There this one actually this is a embedded in a thing called oct Which is the optimal cutting temperature So it just it freezes at a temperature that allows the tissue to be sliced very nicely But the pathologist won't use that because the morphology is a little different And if they're trying to do a diagnosis They want what they're used to and what they're used to is ffpe Okay, so this is the formalin fixed and and then paraffin embedded you'll find if you are working with these samples Which is really the bulk of the collections out there If they're older than about 10 years, you'll get really poor data out of them really hard to make libraries from them If they're Three to five years old you can get pretty decent stuff out of them And that's not really nothing to do with age. It has more to do with the way they process the samples So formalin is terrible to DNA and RNA It crosslinks it it breaks it up But in about 10 years ago, they were using unbuffered formalin and they would throw it in and go away for a week and go on vacation Come back take sample out But now that they're pretty, you know, it's like 24 hours and then even Sometimes some places if it's a Friday it'll sit through the weekend But most places will actually come in someone will come in and take it out after 24 hours and they use in buffered formulas So it still buggers up the DNA, but it's not as bad as it was So it's just things to be aware of and when you're also using older samples The best thing if you take a slice off the top throw that one away because of the oxidation that occurs and use the next slice So you can get decent stuff out of it, but not very much um, so we could do manual props, but of course we're not the only one interested in that and uh, there is a something out there called next terra which was epicenter Developed and they were bought in by alumina primarily for this protocol It uses a transpose on that integrates in to the dna And it brings in a little bit of sequence and that sequence that that is engineered to be a tag Which you can then amplify and put on the sequencing Tags that you need and then this becomes a library So it it shears the dna by what they call tagmentation So wherever there's an insertion it is a double-stranded break so you can break it up into smaller pieces It's Standard input is 50 nanograms. That's quite nice It uh, that does have a pcr reaction involved in it But it's a low input very easy. It's just like Minutes instead of hours You could do 96 on the time very easily and it's not that expensive It's about 70 bucks out of that library So we we played around with that a while back. It does have a bit of a bias It tends to Like to insert an at rich regions and so there is a bit of a gc bias But it's not and I don't think I have a slide to show you but We've been looking at that hard lately and it's not that bad. It's acceptable But this is what a library looked like. I just some of the metrics and libraries and so you can see some of the data that comes off This is a 50 nanograms input This is the the run on a bio analyzers is showing you the fragments the fragment sizes these two spikes are the standards And so this is a you can see the peak It's fairly reproducible two different libraries here But it's fairly broad Right the peak size and that's one of the characteristics of the Next-air libraries is the peaks are quite broad This has some Implications in structural variant calling if you're trying to call us a variant that's a translocation between chromosomes doesn't matter Really how far apart they are you can just tell they're in different chromosomes But if you're trying to detect Insertions and deletions and you want to use the You know the distribution of these fragment sizes the tighter that is the better it is for your data But it's a trade-off so we can make rapidly make libraries But they're not great for calling a small structural variations large ones are fine You know if it's 100 kb you'll still you'll still be able to detect it This is the output of sequencing just you haven't seen these this is a Roughly this is the insert size here. So this this library is about 443 bases This is an important metric the reads per start point So this is a just exactly what it says It's a measure of the number of times number of reads on average that started at the same place In the sequence. So this gives them a measure of the complexity of the library and has to do with that pcr artifact collapsing You want it to be Close to one as possible. So one point. This is very good 1.02 You probably you definitely want to be 1.2 If it's above 1.2, you're gonna have trouble with your library And what happens is you saturate the library very quickly So you're sequencing and here's four lanes of the same library But you won't get your yield of decreases the more you put in just keep getting the same thing over and over And you'll collapse it out. So it's higher than 1.2. You probably want to make a new library So this is actually this is all generated. Um, you know, we pull it together in our own format But this comes off the the instruments here Just off your first alignment But you'll need up you need some kind of Qc pipeline to quickly look at what you're getting. This is the yield. This is the the total yield and gigabases per lane Um, of aligned reads and so we can watch that too and make sure that all these metrics are all of these metrics are looked at right, we've got the Percent mapping. So this is the percentage of the reads that actually mapped to the genome So it should be these are these ones are a little low actually It's because I think this low end here We're reading through into adapters. It makes them hard to align We're working on making that a little tighter But we played around with it quite a bit and so 50 nanograms at the typical input You can do with five nanograms a little funky little peak here. I don't know why but this way it turns out you can see It's quite broad But you can even do two nanograms and get decent libraries up You have to control the ratio of the transpose on to the dna Because that's going to determine your insert size. So we can actually move this peak around anywhere you want pretty much That always stays broad But by changing the ratio of the of the transpose on that you put into the dna You can it's a quite a linear relationship actually Making me do math is early in the morning So there's about six pico grams of dna in a cell So a thousand cells is six nanograms And so that is one third of that. So it's about 300 cells I did my math right I'm using out by a factor of 10, but I don't think on that one Um, so these these libraries we sequenced and this one wasn't bad the five nanogram pretty decent insert size again that the The standard deviation was higher than we would like but it's so simple. We'll accept it. Um, the uh There's the the number of reads were here. The percent math was pretty good So average for these types of libraries reads per start point get a little high 1.1, but that's still that'll still work That was okay yield wasn't bad the 1.2 nanograms You can see that the reads per start point was unacceptable here. So this was over amplified We did too many cycles of amplification Uh, and so we'd have dialed and we've done it. We dialed back on that and we can get more like this out of a 1.2 nanograms And it's an important actually point that Francis raises in that As we push this down further and further you have to start thinking about how many genomes Are you sampling right if you're getting down to the you can do a hundred pico grams and you can Do targeted sequencing or whole genome off that and like how how deep do you want to go? You know there's no point and if you've got a hundred genomes in your sample you're doing targeted sequencing There's no point in going to 10 000 x deep Right and you're you're just wasting it. You know you're saturated it completely So we're getting to the point we didn't even worry about this until about six months ago And then one day we just started well How much how many genomes are really there because we're always looking for like a thousand x coverage or five thousand x coverage And uh, then we started thinking about it's just crazy. So you do have to think about how many genomes you're putting in there All right laser capture. So this is recent what we've been trying to do to solve our problems Uh, this is the again a pancreatic cancer. It's marked up by pathologists. Those regions are then captured You can see they're cut out and leaving the stroma behind Uh, this was just a deep k-rat sequencing just to see how it worked out Number of cells is estimated the number of cells he's getting us Uh, so it's various numbers of cells various preps. So some of these numbers are a little different But roughly we get here's the yield. This is an important one For getting on most of the samples we were getting at least 100 nanograms or some 200 nanograms That's enough for us to work with and do the studies we want to do The deep k-rat this one actually had two k-rat mutations in it Right, so you can see the the ones uh, this is the heterogeneity So there was a subset of cells there that didn't have as much and without the the This enrichment there's no way we would be able to detect those From flow starting the first problem there was we don't have a lot of material coming off Can we extract the dna the green line here is your Theoretical based on six pico grams Per cell how much dna would you get out? This is 50 yield. We have a couple failures, but most of them now are falling in here So we're actually getting the dna out And we can sequence that and this is a k-rat targeted But you can see that now we're seeing roughly so this is the the mutant is around 40 percent of the reads So this is a sample that was only about 10 or 15 percent tumor content And so if this is a heterozygous it would only be about You know seven and a half percent of the reads and it's got up to 40 percent, which is pretty close to 50-50 So this is pretty pure stuff And we haven't done nothing to know there's always a variation here You know Reads up read some beautifully come out at 50 50 the depth here isn't very big and it might be and this one's about the same So it might be that either they're not quite pure or just by chance these two samples We're just getting the sampling we're getting around 40 percent variant But if we want deeper it make go closer to 50 50 The other advantage is you can get some very clean data. So this copy number we'll zoom in on a couple So this this is one here where we've lost some of chromosome 17 That we have an amplification as well And this amplification if we zoom in on it And see what it's doing. It's herb b2 important driver and cancer so that that one's interesting and then this one's quite interesting So this is where a complete loss of chromosome 18 And imagine if only five percent of the data were actually From the tumor it'd even be hard just to see this this chromosome loss But even within that the remaining chromosome we can actually clearly see a deletion here And that deletion if you just map it to the genome Uh, it looks like we haven't mapped the exact break points yet But it looks like it would probably take off the tail end of smat 4 and smat 4 is a common driver in pancreatic cancer I lost a smat 4 so that's uh Looking like it's working So I think the take home message there is the more pure your sample is the better It's why leukemia if you work on leukemia good for you You can get lots of sample in this nice and pure All right, so circulating tumor cells just throw some current stuff in here This is an instrument from the lumina. It's not it's a pre-release. It's not a commercial instrument. It's called a mag sweeper It actually looks When they delivered it I didn't expect it to look as nice as it does since it's not really a commercial instrument But it does look like a finished instrument. This these are magnets And these are plastic sleeves and these magnets then go into the plastic sleeves and then using the Antibodies linked to paramagnetic beads We can go through the sweep through the sample That's why they call it a mag sweeper and collect up the circulating tumor cells based on cell surface markers And it just goes through a series of washes Takes about An hour to This is a cell under phase contrast all this background here This is all the the magnetic beads that come come through in the process come off the magnet But this is a cell that's captured And if we use an anti red red glowing antibody to ep cam which is an epithelial marker You can clearly see this is an epithelial cell origin And then the the biggest problem here is that Phil has to sit down for hours now under the microscope with a pipette and collect those And take some several hours to get like 20 of them. So the typical patient The samples we're looking at now may have anywhere from 10 to 50 circulating tumor cells in three mils and we're trying to collect those So we haven't done much analysis on them yet because we're still fine tuning our protocols for Taking 15 cells and working with them Sure. Um, and also talk about circulating tumor DNA so if You can detect if you take a cancer patient you take some of their blood You can detect circulating tumor cells or at least epithelial cells Which are presumed to come from the tumor And there is a correlation with the Stage of the cancer and the number of cells in many cancer types What we want to actually do then is is actually be able to analyze those cells and look for the mutations And same with the circulating tumor DNA, which we'll get into in a second. It's obviously If you could detect these things It's a great way. It's a relatively non-invasive It's a needle stick You can take some blood and you could screen it and look for like k-rass mutations, for example Which are common in pancreatic cancer common in colorectal cancer And you can almost see it being a screen for most of the common variants In the population it's getting cheap enough that you could almost screen everyone for this routinely But what it's really useful for is a monitoring tool So if someone comes in and you treat them for their cancer You can then Take blood over time and look for the mutations that you knew they had because they were there at the beginning And see if they come back So that's one of the it's more of a monitoring tool right now than a diagnostic tool But I think if someday it may be cheap enough But the question is what what we need to do is is look at enough normal people You know if we took everyone's blood in this room and sequenced it would we find some k-rass mutations and my prediction is probably Yes, but it shouldn't you shouldn't be worried about it I think I think our bodies are clearing cancer all the time And so we'll depends the problem is that we turn up the resolution the lower we go the more we see We're getting more questions than answers right now And that's why it's a really exciting time And it's great, you know you guys have taken this course and and analyze the data because We go down we see more and more heterogeneity. Heterogeneity is very important We're starting to drill real deep and see that the in the relapse tumor those mutations were already there And so the question is You treat someone with a drug. Do they uh, do they become resistant over time because the cells have Got accumulated mutations Or were they out were those mutations already there And all you've done is you've killed off themselves, but this other subpopulation is is continuing to grow There's a great paper the original b-raff paper. I don't have it in here, but Original b-raff for melanoma paper is quite interesting to look at there's a picture of a guy in there Who's got looks like he's got golf balls under his skin all over him And then they treat him with the b-raff inhibitor For that specific v600 emutation and they all disappear. I think in 15 weeks I mean the guy looks almost normal in 15 weeks They show another picture at 23 weeks. I think it is and he's got them all back Every single one is back and if you look at the pictures carefully, they're in the same spot So it's not that these aren't new tumors coming up. They're all back and they're all back about the same rate So I don't believe that all those tumors developed resistance to the drug There's something else going on there. It was probably already there And but that brings up questions of if every single metastatic tumor Carried that was heterogeneous right and carried that mutation Already that gave the resistance now. They didn't sample them. So we don't know what the mutations were but Yeah, it brings into question was how does the metastatic Tumor arise is it a single cell that comes And seeds it and then you think well, how can that have heter much heterogeneity there, right? And or is it a clump of cells that seeds it and these are a lot of open questions right now We don't know but it's really fascinating paper to read sorry We are starting to do that right now. Yeah, yeah, so It's well, it's hard to get the hard to get the material but I've hooked up with a guy who's doing rapid autopsies. These are patients that Consent and within two hours of death. They were brought into the the morgue and are essentially torn apart They get kilograms of material from them. So they get everything out of them. It's really fascinating They take the liver out and they'll section it into Half centimeter sections lay those down and image each one of those and then they'll cut it into cubes and image that And so if I get a metastatic tumor from him or a couple from the same liver I know exactly where they were. It's really a fascinating data set. He's got about 75 of those now They've got primary tumor metastatic tumor lymph nodes invading lymph nodes all the stuff So I'm looking at lung and liver Mets to see what is the difference between those and there's there's number studies You search the literature you'll find people of sequence Mets and primaries But usually it's one primary in one met and they're trying to make conclusions from we're doing five Mets for liver Five Mets from lungs in the same patient and the primary trying to figure out how related they are Figure out why why does some go to and some patients never get liver Mets Right and so why don't they so that's the kind of things we're interested in but the the circulating DNA so we all have circulating DNA Cancer patients end up have more just because the the necrosis of the tumors and if that DNA if you look at it What you get out you can you don't get a huge amount But what you get out is very fragmented as well So this might not be able to read this the scale here, but that's about 180 base pairs you see at least sort of A mono dye try if you think of a nucleosome, so they're just nuclear. It's degraded down to nucleosome size So that has important implications if you're trying to do targeted sequencing And you have a 500 base pair Amplicon that you're using is probably not going to work because there's not much material there that that that's that big The trick here is making sure that when you isolate it, you don't like some of the it's a sort of normal Circulating leukocytes and things like that and you can tell there'd be this giant peak here of of intact genomic Contaminating DNA, which is just the normal DNA So again, this is getting down to you know, how low can we go? What are we trying to detect and this is another piece of equipment we have that's again pre-commercial release But hopefully we'll come out in the market and they're not too too distant future So here's a sample in which it does say there's only 0.01 percent that has a mutant allele on a we sequence that There's no way we can call that right if you look specifically at that like k-ras and you see it And you go yeah, there it is 0.1 for 0.01 percent But then you say okay now i'm going to let my pipeline tell me where where else do i see 0.01 Changes and your false positive rate would be through the roof. So what this device does is it actually Isolates the mutant allele And enriches it and so you can see you can sequence it your signals noise then is much better And this little cartoon if it works this is from them. I didn't do this Is how it works. So here's your your population of fragments And these oligos that are Perfect match to the the mutant allele So they have a single base mismatch to the the normal allele are attached to the gel within the system All right, so you take your samples you put them on the gel and then they have a electric field which is kind of circulating This is their technology called scota And it makes a differential separation of the mutant and the the normal allele And if it works, I do have a little movie If it comes up, let's see if it'll work. Oh, I can't do this Maybe it won't work. It worked the other day Worked on a pc. Why wouldn't work on a mac? I don't get it. Oh, it's cool Maybe I'll Crash out of here at the end and see if I get it to work I'll just move on for now So this is their data, but we've we've actually done some swikin experiments and getting very similar results So this is uh, this is point zero one percent. These are synthetic samples that they've mixed together But point zero one percent k-rasp common k-rasp mutation In a plasma sample and if you sequence it just on the myc, you can see you got 130 000 reads that are the g The the wild type and only 19 reads that were the the variant So this is that because it's a slow frequency. There's no way you're going to call that in next gen data Um, and this is a wild type. This is just someone's you know plasma period You can see the numbers are almost identical There's 11. So these are sequence errors These are probably mostly sequence errors and a few real ones, right? Using the on target. So after you run that that chip and it pulls out just the mutant alleles You can see here. This is the wild type And there's there's very few of them compared to now the mutant and out of the plasma sample You can see a very big difference. So you're detecting real events here. All right, so it's a cool. It's a cool system But not quite available at the prime time All right, so the trends we've talked about This is really should be updated. This has gone now to through to 213 They they're getting bigger and cheaper Sort of flattening out now as I said, but it was a bit of an arms race So it was who can produce the most data who can produce the longest reads Illuminal one that that race at that time But there's been more and more systems coming out. This is the the proton the 2500 the pack bio The ion torrent really the pgm and eventually oxygen nanopore where these are more moderate throughput the faster run times They have a lot more clinical applications, right? And in the in our workflow here in the research side We don't really care if it takes us a couple months to analyze the tumor because we've got you know Number of tumors in the pipeline and they're just coming out all the time So it takes a couple months to go through down through this whole pipeline. We don't care But on the clinical side, especially in cancer, you have to be quick, right? These patients want to they need to put them on something quick And so we have a really a week, but it's actually turned out to be less as i'll show you So here we are in in toronto and just up the street. This is where the oicr is in the mars building This is the hole in the ground, which is now kind of looks like that But we're right surrounded by all these hospitals and in particular I want to talk about our relationship with princess margaret where we did some clinical studies and so typically diagnoses are made a lot to do with the the site that the tumor was And limited molecular profiles. So these are sort of standard molecular profiles to be done on the various Depending on where the tumor was from so we mentioned the melanoma and b raft especially the v600 e Some of these are targetable in the the b raft mutations They are targetable with drugs so called actionable And they have multiple tumor sites, but what we're really interested in is this sort of picture here I'm blowing up here. So this is the b raft again, which is very frequent in melanomas and targetable But it's also relatively frequent in colorectal cancer. And so the question is, you know, will these patients respond to a b raft inhibitor? and The answer is that the short answer is actually no That they won't respond and it's just because that the e gf receptor is being expressed here Where it's not being expressed here and it gives a workaround for the drug But if you hit them with an e gfr inhibitor and the b raft inhibitor at least in in vitro It should work and those are heading to clinical trial. So the idea is if we can Profile a tumor and not care so much where it's from Are there actual mutations that the patient would respond to? So we started a little project a couple years ago With them into if they would take biopsies. These are advanced stage metastatic patients You're not going to change the standard of care right now But these are these are patients that have typically been through Around three rounds of chemo for their and had relapse to relapse or didn't respond They have to be able to get a biopsy to us And of course they have to consent to the study We needed to calibrate our system. So the we had helped the The diagnostic lab put in a sequinome, which is a genotyping platform using mass spec And they had a panel that would screen 238 mutations in 19 oncogenes. So we did a comparison about this is the pack bile as I talked about earlier Uh, we just said, you know, could we detect those they gave us samples with no mutations didn't tell us what they were But they gave us some blinded 30 samples and we tried to detect on the same 19 genes We sequenced the whole gene not just the the spot that where the mutation was could we detect it This shows you the the pack bile. This is the ccs is that circular consensus We read around multiple times to get high accuracy. And did we detect it? We detected them all but this one This one was so we went back and looked and it was just our primer design One of the ampicons wasn't working well. So we fixed that so we could detect that So virtually we could detect everything that that they detected so we felt that the system was ready to go And again apologize for the The font change here, but we were doing a project. This is the original with the pack bile And just to see what we could find out how well it would work Patients we wanted we're hoping half the patients say yes, almost all of them did The median age is 57 They've had disease for quite a while. Here's the number of treatments So as I said, the median number of treatments already were three and the one the most was eight, I believe There was all different tumor types. We weren't doing leukemias all different solid tumor types The cellularity of what we got was pretty good. This were these were FFP biopsies And they were scraped scraped out the tumor portion. We didn't actually extract the DNA that the cli lab did But they were doing pretty well. So they were marked up a pathologist And then the portion tumors scraped out the estimate about 60 percent And the most of them required Some sort of radiological intervention to guide the biopsies. So that was one of the bottlenecks of the whole thing So I already talked about Ffpe and the ability to get samples out. So sometimes it's not very big sample But we were able to get it's quite variable the amount of DNA we got out But in general, we were able to get enough material and blood you can get tons of course The fresh fe's in the archival a lot less But this is still we were getting somewhere around 500 nanograms on average and that was enough for us to do This is just targeted sequencing If we were able to get DNA we were able to successfully analyze it We did find mutations in the first set of patients And then there was a number of their actual most of them because we were looking at Uncle jeans that had a lot of drugs at them, but we did find some novel ones. I'll talk about that And the goal was to report Back to the clinician within 21 days. So that's 21 days from the time that the patient consented Then a biopsy is done through pathology through the cli lab get DNA get it to us. We do the sequencing We then report back what we found. We're not a cli lab So they can't use those results in for the patient management They'd have to be validated in the cli lab and then report generated all within three weeks That was our goal and in two-thirds of the time we did make that 21 day goal A lot of the outliers here sometimes one of the outline centers would consent a patient But then schedule them for a biopsy three months later So kind of there blows our timeline We won't go through all of these slides, but we had the first 50 patients We were able to get 43 of them with enough material And six of them had their their treatment decisions impacted and that shows the spectrum mutations This is the ones that were impacted. So we got partial response or confirm Confirmed partial response. This is the clinician's talking, but this is what we found And we we found it in the biopsy and in the archive and all in most of these cases there was about a Number of 90% concordance between the archive sample and the the biopsy that was taken at a later date These are the some of the novel ones. I'll talk about that that first one aka T1 in a minute, I think And this is just the study continued on for about two years and this is the overall So we had enrolled about about 140 patients And at this point we were looking at 19 oncogenes In the the PAC biosystem and then we switched over to the myseq Just because uh, it's my sequence. There's the reason we went this way This is the first real fast turnaround instrument and we only have three days to the sequencing this guy Consequence in about an hour. Uh, and then we went to the my sequence sequence overnight And then we expanded it to instead of 19 genes to do 54 genes And then and pretty much the impact is around 20% of the patients. We found something that we could was considered Actional Towards the end of the project we were doing it this way. I talked about the rain dance here So we had this hot spot panel that they developed that covers actually 54 genes And if you look in cosmic in those 54 genes, the amplicons would cover around 13,500 entries in cosmic Uh, I talked about how it's done the the droplets are brought together Uh, motion pcr break it and then put it on the myseq and we were doing you could do up to eight samples per run Uh, why quite often was only a single sample because you can't sit around waiting for seven more samples If you only got one you have to process what you get that week We opened this up across Ontario. So there were a number of sites sending a sample So the logistics were being worked out and that was quite good From an informatics standpoint, some of the things that seemed like minor challenges It was actually quite difficult was to develop a tracking system. These are all redacted I'll put these are all identical fileable fields. And so that's why the black boxes are there But all these centers then we had to you know, they had to send the samples in We had to know they were coming get alerted and you're working across multiple firewalls across and so it was quite a Big deal to get this together. But we had a nice tracking system where we could track samples And then you could input the information And generate a report and the goal was to as much as we could have an automated Process for generating the reports. These were then reviewed by an expert panel, which was composed of clinicians and Genomicists like myself And they had to have a minimum of three clinicians in the room to make a decision on whether there's an axiom mutation and would be reported back Am I doing time wise? 1030 We'll make it So this is one example here. It's a 50 year old 53 a woman platinum resistant low grade serious ovarian cancer already menostatic had multiple treatments before detected a krs g12 d mutation and Were started on into a phase one clinical trial so Princess Margaret hospital is one of the major sites for in north america for clinical trials, so these patients and what we're trying to do is Guide them they're going to go on some clinical trial But rather than flipping a coin we're trying to from this information guide them under the appropriate trial And this patient whoops responded This is the akt one mutation that we found Find no information on this anywhere the literature and cosmic No matter where we looked this is looking at the gene itself There is an akt one Relatively common mutations e17k. This is an activating mutation ours was in here. It's in the kinase domain incredibly conserved region All right, it's a looks like a pretty severe change in charge change and as I said absolutely Conserved down through zebrafish here. You would predict that this is probably detrimental to the protein What you can't say is is it activating or inactivating? Does it turn this gene on or off if you have an inhibitor for this gene? And this this mutation is actually turning the gene off. There's no point in giving the person an inhibitor All you're going to give us the side effects of no benefit But they were put on be based on this information. They were put on a inhibitor that inhibited the mTOR pathway and Although these are not the same plane as them pointed out to me by radiologists It did actually shrink the tumor Unfortunately the patient continued to get but have issues and it was ill and had to be removed from the drug and subsequently died All right, so incidental findings if you're doing Exome sequencing or whole genome sequence or even targeted sequencing Um, there's a big debate right now as to what to do with all the information So if you sequence the exome on anyone in this room, you'll probably find 75 to 200 potentially deleterious genes That you would predict would impact the gene function If you do a whole genome, you'll probably find about 600 genes in any one of us that it impacts Now that doesn't mean that That it really affects us greatly. It may have to do is uh, you know It depends on your lifestyle how much fat you eat and if you blow fat You know the genes that are impacted never matter to you But the question is how to interpret all that and once interpreted Do you are you required to to give that information back? And this is really a hotly debated thing right now Um, and so there's ways you can think about getting around that in that if you set up an analysis pipeline And you could these are genes that you may not want to look at and so you could just say okay I'll sequence the exome these genes are in there, but I'm only going to let the pipeline put through the cancer results For example to treat this patient now from the point of view of cancer care. That's all we care about up front Right, we've got to get this person on a drug within within three weeks So we don't we're not going to try and interpret all this other stuff We just want to find out is there an actual mutation here, but in the longer term You know These are things that people may or may not want to know about and so consent forms have to start addressing that Whether or not people some people especially hunting disease here many people do not want to know the outcome of that And so the but as a we're not a clinical lab either So we're a research lab and then these tumors were sequencing Many of these patients are dead in pancreatic But in other cancers these patients are still alive and if we find things that can impact their Their long-term health not forget the cancer or we find things that they're you know should be Might be interested to their siblings for example like a brca one mutation What's what's the responsibility to feed that information back to them? So it's a an open question at the moment the american college of medical genetics has a list of I think it's about 50 Genes that they say you just are morally responsible for reporting those back But the mechanism to report it back, especially when the research side is not well defined So informatics, what do we have to generate to just do this project? I already talked a little bit about Uh the tracking system in itself We made a consequences database, which is really geared at the common mutations And so early on it was easy. We had 19 genes 238 mutations from that panel that we had tested And but we needed to have a system where we're automatically would generate a report So we sort of brought in all the information and then we enlisted a help of a bunch of clinical fellows who went through and took A subset of them and actually went through and read all the papers and tagged it all together So that uh, we could just quickly generate the report but you can see that that would amplify very quickly as you start doing exomes There's not that great annotation. So that's a great need And what you don't want to do is from uh, steve friends publication You can't return this to a clinician Right, you need a report that's clean and simple This is all the information that you might be of interest and help interpret the results But the clinician is not going to sit down and try and understand this for several hours He's got five minutes That's he wants a report. So somehow we have to do all this up front and automated in order to uh, to generate the report And this is the again the report that we generated was quite simple Uh, gave a little information about frequency give background and gave links to uh, some literature here And also whether there were clinical trials that they available that they might put the patient on if there was an actionable drug here All right. So we talked about data complexity. We talked about data volumes increasing if you'll see in in Probably some of your exercises will be dumbed down in quotations to Not have a full subset of data. You might just zero in on part of it I guess you're using the cloud this year. So maybe you can do more comprehensive analyses, but clearly some of these These runs and these analyses like even alignment takes 24 hours, right? So of a whole genome So you depends how much data you have but that's you're really starting to come Stress out the compute and the storage resources The cloud has its own issues depending on Again, that's a debatable issue about Whether the it's secure enough for people's information. You got someone's whole genome up there all their genetic information Is that acceptable put on the cloud? You know all your banking information is on the cloud so Hopefully it is But then the analyses are very complex. So even like transcripts aren't ac for example It's not just about getting the transcript profiles You get better data than doing an array but But you also want to look at differential splicing a differential allele expression RNA editing fusion proteins You can call variants and all of that. So it's a very just from one one run here You get a lot of data that's complex and how to put it together Validation we talked a little bit about our verifications really should be Well verification is where you're you're confirming that in the sample that you detected it that is true And validation would be I'm going to take that because I think that's an important variant or Important gene and put that across a cohort of a thousand samples. So you need lots of assays to to deal with that And as you'll go through the week here, you got to consider the bigger picture Right, so it's not just about collecting all the variants and looking for a common variant or even just one common gene There's some good examples of that people get lucky But in general you've really got to put in context of the whole thing In pathway analysis we talked about data privacy. So let's get back All right. So what needs improvement everything? Absolutely everything so take what you learn here go away and make something better Alignment needs to be improved and things aren't bad. It's not a terrible situation but There are mistakes made in alignment all of those propagate through and I think are major crimp turbid or drug falls positives Calling low frequency variants. So getting digging down into the the low frequency heterogeneity Structural variation detection is poorly handled right now There's lots of programs lots of software out there There's 70 aligners that you can go and read about but half of them you can actually download a functional piece of software And probably a quarter of those actually work Right, so there's just lots of room for improvement RNA seek differential splicing variant calling RNA editing are all things And the big one I think is functional annotation all these variants were detecting. What are they actually doing? clearly You know laboratory you'll you'll never completely eliminate the need to go to the lab like that akp one mutation We we're going to have to do some lab work on that Structural variation detection that's been walking along So translocations large insertions large deletions Things like that. Uh, they're poorly detected right now And then the big challenge is bringing it all together And how do you take all this different information including the the the clinical data here and just put it all together to come Up with a diagnosis All right, unless you have questions What tumor type so what's the cellularity, you know, it's probably Yeah, that's pretty good We you've got lots of real estate there on like you can go real deep Um, I think you need to be in the 500 x to a thousand x And going higher than that probably doesn't help you a lot right, you just You can go 10,000 x and you see all sorts of things But you're just starting to you're starting to get starting to believe the false positives because the random the random errors Aren't completely random And you start piling up ffp introduces a little complexity in that the dna is damaged And so you do start seeing a higher background error rate Yeah, and you have to look at every base in the amp So if you're doing amplicons that targeted sequencing then you need to plot the You know the acgt on every single one of those bases across your normals Many normals you can and start to understand what the variation is and then across all of your ffp samples And you'll start seeing to some It's not just the you know a's in general tend to be you know destroyed It's sometimes sequence context. So like a trinucleotide Signature will more frequently be hit than others. So you just sort of you got to start understanding that profile Once you understand that profile, you can sort of set your bottom end at how low you can go It's right. It's pretty easy to call variance down to 5% You can probably call it variance quite readily down to 2% But you're starting around 2% 3% you start your false positive rate to go through And it's a trade-off of how How much do you want to know about the heterogeneity and how much work do you want to do in the validation To show that they're real So something around 5% Cut off if you just really want to capture if you've got enough samples and you want to capture Most the events and if you miss it in one sample, you say, I'll catch it another one at a higher frequency So depends on the structure, but yeah, you do have to be a little careful in the ffp these old samples Yeah, anything over 10 years is almost can't get data out of it. You'll get about 30 success rate just getting sequenced Yeah, five should be good But you're definitely going to see a higher background error Let's take a quick question. I mean not all of the genome is in sequence still and there's a lot of bias from PCR PCR processes you go to have to take sequencing How much are we actually missing? I mean when you do a whole genome analysis, how much of the genome we actually capture? Well, I mean Yeah, not depends how you define the genome. So the genome has been sequenced. It was done. It was done by clone base, right? Right But they're but that's but there's still gaps. Yeah, that was there's gaps in there and also that only represented 70% of that sequence comes from one individual And the rest of it comes from about another 12. So it's a very small subset of the population And so there's chunks of the dean of the genome that weren't in those individuals that are missing and and there's an effort to Figure out how we can better represent the genome But until we do I don't know what percent it was briefly But when you do a whole genome sequence and align it You can align using the paradigms. You can get alignment to a lot of the genome in the high 90s or low mid mid Low to mid 90s percent that's covered. There's going to be more and more errors in certain regions more and more misaligned But you could easily be looking at Good good data on 85 percent And there's stuff that's not aligning or not really usable Most of it's in very repetitive regions and this question is it dubious whether or not that that's important I mean it could be but you can't even develop primers to go in and verify Yeah, it's part of the problem because it's so repetitive So you'll see i'm sure you'll probably come in your analysis. You'll see musons mutated You'll see olfactory receptors Things like that which are just gene families of high similarity And those are those are all just misaligned reads Most of them some of them be real and but if you're doing cancer and olfactory receptors mutated d care Maybe if you see it in a lot of your samples, it's either a systematic error or it's actually important We don't understand why so it's a trade-off between Filtering your data not over filtering your data and trying to be open-minded and not bias in your interpretation like saying Oh olfactory receptors can't be involved in cancer. It could actually be a real driver in cancer We just don't understand why That's right, quick question There's a validation that basically It is Dogging on its bias kind of same problem Well, there there are um I mean if you do whole genome sequence on the alumina for example So you sheer you randomly shear the the genome up you can do a non if you got enough material going in You can do a non PCR amplified But let's say it's got some amplification in it and then you sequence the genome you see a variant and then you do something like ampoule seek To verify it and you see so that's now a PCR base All right, so it's no longer a random shear. You're actually targeting into a region So it's a different technology from that standpoint that it's a PCR based And if you sequence it on the ion tarn is the different technology for the readout You can still get fooled that way because uh somebody's muses for example Yeah, so you may Yeah, you might reproduce it and and start to believe it because your PCR primers actually don't amplify What you think they're they're amplifying multiple sites even though you don't think they are And you're just amplifying that other site and then you see the read but usually that's You can get a hint of that by the frequency Right, there's there's some really bad areas of design primers do that even by hand You just can't get to work, but you can amplify eight low side And so you can tell it because one eighth of the reads Match what you've saw and it's just one out these other genes, but the you can blot those regions and to see Uh and look at their similarity and usually you can find your variant just by doing that But most of them are data quality is pretty good now It's not the same library. You're starting going back to the original DNA. Yeah, but and and there are groups that are doing uh, you know alumina sequencing for both verification That's becoming more acceptable because the data quality is so good and it's um Whether or not if you did hold genome and then you make a targeted Capture reagent say a long oligo capture and you then capture out those parts And sequence them and you see the same thing and people believe in that But in some regions you're going to get fooled just because it's there because of the similarity and that's where you really have to drill down on the informatics side and Search the genome for similarities and see if you can figure out how that alignment occurs And some of the things you'll see from misalignment. You'll see a cluster of three Variants in a row and then you'll both find the gene family. You'll find well, there's those three variants right there But you can't eliminate all clusters of three because i'm gonna run real It's uh, but it's a trade-off How about a coffee break? Okay, we're gonna take a coffee break until 11 Um Again the washrooms are in the hall Feel free to get yourself set up to a couple of people just join sign and move you around and uh John's going to be around for a while. So if you need to ask him any questions Specifically about your project