 Good morning, everybody. So it's my pleasure to start the first bit of science here for the course of this week. So as I mentioned, all these slides are available under Creative Commons. And the condition here is that you give credit to whoever helped make these slides. And here's my bit of credit. So I've benefited from some great teaching materials by my colleague Ben Langmeade. Solve his slides are here at this URL on his lab website and also for my friend Erin Quinlan, who slides up on GitHub. So we're reusing a bit of material for both of those. So this is the first module of the entire Weklont workshop. And the goal of this module is really to give a foundation for what we're going to talk about for the rest of the week. And this is going to be an introduction to how high throughput sequencing works, so what the sequencing instruments are actually doing, how they generate data, what the data looks like, and some of the caveats and challenges of working with these very large data sets. I have around maybe 40, 45 minutes of content to talk about. I think we're scheduled here for a full hour. So please ask questions as we go. I like to be interrupted by questions. It makes the flow a little bit easier. And it makes sure that I know that I'm giving the right information for you guys. So please interrupt. As Ann mentioned, one of the big benefits of CVVW is you get to interact with the instructors who have been working on this for a very long time. I've been in bioinformatics and genomics for around 10 years now. But I'm only going to be here for the first two days. So if you have any questions about DNA sequencing or how these algorithms work really at a low level computer science way, please ask me. But just get your questions out in the next couple of days. So we're going to start with a bit of basic biology to motivate why we want to sequence genomes. Many of you know this in your heart just because you have background in biology. But the central idea here is that we want to understand the information that's encoded in genomes and how that gives rise to different phenotypes, phenotype to being the observable characteristics of individuals. So we've known since the 20th century that DNA is the storage of information within the cell. That information gets transcribed into various levels of different molecules. Central dogma says that DNA gets transcribed into RNA, which encodes protein sequences that get translated into these amino acid sequences that then go and fold and carry out their biochemical function. Now throughout this week, we're going to hear about all these different levels. So in the first two days, we're going to hear about DNA. We're going to understand what genomes are and how we sequence genomes. And then when Obi Malachi come on Wednesday and Thursday, they're going to talk about RNA, how we analyze these actual RNA molecules. And then we'll hear a little bit about proteins later on in the week and how we actually link these different data sets together into a unified picture of how cells and how organisms work. So DNA, this is going to be really the focus point for my lecture in the next two days. So everybody's probably familiar with the DNA structure. So it's quite interesting to give this talk at Cold Spring Harbor because Jim Watson, obviously, with Francis Crick discovered this structure that we're looking at today. I don't think he's here. I don't know how much he spends his time at Cold Spring Harbor these days. But really, we're all benefiting from Jim Watson and Francis Crick's work here. So DNA is a double helical structure, has a sugar phosphate background backbone, which gives structure to the molecule. And then the individual rungs here on the DNA ladder, if you'd like, are made up of the four nucleotides, A, C, G, and T. And they're linked together, linking these two complementary strands. Now, genomes are quite large. These are incredibly large biomolecules. The smallest genomes that we work with, viral genomes, are maybe around 10,000 to 20,000 bases. Bacterial genomes are on the order of megabases, millions of bases. But the human genome, and the very large genomes that we work with, are around three gigabases. This is an incredible amount of information. It's incredibly long molecule. The human genome is actually packed into the cell by wrapping around nucleosomes made up of hitstones. And they get compacted into these fibers that are then wound together and packed into the divisible chromosomes that you see if you looked at a microscope during metaphase. But we're really interested now in just what the sequence of these genomes are. What are the order of these four nucleotides, A, C, G, and T? And what do they tell us about how that cell functions? So we're going to be trying to figure out the sequence of individual genomes. And why we want to do that is because variation within the genomes will give rise to variation within phenotypes. So phenotype can be something like hair color or eye color, or it can be more complex traits like your height or your predisposition to getting certain diseases like cancer. Now, we sequenced the human genome as a field about 15 years ago. And again, being a cold spring harbor, that sort of brings the history into it. Because if you go to the bar at Cold Spring Harbor, where the reception is tonight, there's a guitar signed by all the people who are involved in sequencing the human genomes. Have a look at that later for a bit of history. But since we sequenced that human genome and finished it, there's been this explosion of data where we wanted to sequence more and more people such that we can understand all of the diversity within the human population and figure out how the differences between individual genomes give rise to these observable characteristics of the phenotypes. So one of the types of variation we're going to hear about are single-nucleotide polymorphisms. Matthew is going to talk about this, I think, later on this afternoon. These are just single bases that differ between two genomes. So if you compare the two genome sequences, they're going to be identical over very long stretches broken up by these places where there's just a single nucleotide that's different between the two individuals. And if those differences are an important region, like they change the coding sequence of a gene, or they're in some sort of regulatory region that changes how much a gene is expressed, or whether it's expressed in a certain cell type, that might then go on to change these observable characteristics through this messenger RNA. So that's really why we want to sequence genomes. There's larger types of variation as well, like structural variation where there's big rearrangement changes between genomes. Again, we're going to hear about that in the first two days. But first, what I really want to talk about is how we sequence genomes and how the sequence technology has changed really over the last 10 years to allow us to sequence human genomes at very large scales where we can sequence now 10 to 20,000 human genomes in a single research institute per year. And to talk about DNA sequencing, let's go back to the structure of DNA. So DNA is a directional molecule. We have what we call a 5-prime end of DNA and a 3-prime end of DNA. So the 5-prime end is denoted by a phosphate here. And the 3-prime end is noted by this hydroxyl here attached to this sugar. Now, this alternating lengths of phosphates in these sugars, which are deoxyribose, give us this backbone of DNA, which allows it to have this helical structure. And then in the interior here, we have the actual nucleotides, A, T, C, and G, and they're complementary. So A minus A, T, G minus A, C, and vice versa. So these are on the interior structure of the DNA and they're linked along this backbone here. Now, we're going to color the nucleotides here just to identify them. Green is A, purple is T, red is C, and this blue is G. Now, the first technology, the first sequencing was developed by an English biochemist named Fred Sanger. So Fred Sanger is notable for actually having won two Nobel Prizes. He invented protein sequencing in the 1960s, followed it up later by inventing DNA sequencing. We're going to talk about DNA sequencing here and go into the details of how this method works. Now, appropriately, it's called Sanger sequencing after the inventor of it. And the way that it works is that it uses a lot of the machinery that is in the cell that replicates DNA, but it modifies it chemically such that we can figure out what the sequence is. And crucially, it uses an enzyme called DNA polymerase which copies a strand of DNA using the complementary strand. So if we have our template strand here, DNA polymerase comes in and then it will find the complementary base to the next one and then add it into this growing chain by linking up these phosphate and sugar molecules that make up the individual nucleotides. So it would come in here. It would find the complementary base to T, which is an A. Add it here after the C. Then it would find the complementary base to the C, which is G. And add it here, synthesizing this complementary strand in the 5 prime to 3 prime direction. Now, how Sanger sequencing works is that when you add your nucleotides into your reaction mixture, such as DNA polymerase, can find them and add them to this growing strand, Sanger noticed that if you chemically modify the nucleotides to make what are called di-deoxynucleotides, where rather than a hydroxyl at this position of this sugar, there's just a hydrogen, that will actually inhibit the elongation of this chain. So what you can do is you can take a mixture of normal nucleotides, and you can spike in a low concentration of these chain-inhibiting nucleotides, the di-deoxynucleotides. And if one of those chain-inhibiting nucleotides gets added in, it will just stop progression of that DNA sequence. So whenever one of those gets added, we stop here at this G, and you can no longer progress. That chain is not going to grow anymore. So what Sanger did is he noticed that if you do four separate reactions, one for each of the di-deoxynucleotides, so one reaction for di-deoxy-A, T, G, and C, then you use standard molecular biology technique called gel electrophoresis to sort those DNA strands by size. You can then read off the sequence by looking at the pattern of these bands here. So we know in this case that the shortest DNA sequence in this collection was in the reaction tube where we used DDC. And we know that the last base of that sequence therefore must be C, because that was in the reaction where we terminated the strands with C. Now, if we move to the next shortest, we see that it's in the G column here. And we know that this DNA molecule ended with a G. We know the next two in size order are T, T, and then A, and then T, C, T, T. So essentially by setting up these four reactions, copying DNA, using these chain-enhibiting nucleotides where we know what the last base they're going to incorporate is and then sorting them by size, we can just read off essentially by eye the sequence of these nucleotides. And that's how Sanger sequencing works. It's brilliant, and he got a Nobel Prize, and it revolutionized pretty much all of molecular biology. Now, while Sanger sequencing was brilliant, it's what we now would think of as a pretty low throughput technique. To do Sanger sequencing, a technician or a graduate student or a postdoc would have to run these four reactions, run these gels to separate them by size, and then read off the bands of sequence here. And you might be able to sequence a few hundred bases of DNA, but as we already talked about, the human genome is incredibly large, three billion nucleotides in length. So we want to be able to sequence that entire thing, and you're not going to be able to do that 500 bases at a time with a single grad student. So over the last, so Sanger invented DNA sequencing and the 20 years since, there was a lot of movements towards automating this chemistry and making it so that we can do this at very high scale. And this is one of the sequencers that was produced to automate this technology. And the key innovation is rather than using four separate reactions and reading the bands off like this, they invented fluorescently labeled nucleotides where if you shine a laser on the nucleotide, it will emit some light. And if you use four different fluorophores of four different colors, one for each nucleotide, and then separate them in a capillary tube instead of a gel, you can just go along what's called this trace, look at the color of each one of these peaks to figure out what that DNA sequence was. So here's a timeline of how DNA sequencing progressed. So Fred Sanger invented this in 1977. And as I said, one hardworking technician or grad student might be able to sequence around 700 bases per day. And if you divide the size of the human genome by 700, it would take about 120,000 years to sequence the human genome. And obviously that's not a very practical amount of time. You'd need quite a lot of grad students to do that. In 1985, the first automated sequencer came out. This was called the ABI 370. That could sequence around 5,000 bases per day. So just automating a lot of this chemistry got quite far, but still it'd take around 16,000 years to sequence the human genome, not really practical yet. In 1995, the ABI 377 came out. This had a lot of improvements, bigger gels, better chemistry and optics, more sensitive fluorophores, these colored nucleotides and faster data processing. This now went up to 19,000 bases per day, around 4,400 is the human genome. And then finally, the workhorse, the main sequencer that we used to sequence the human genome, the ABI 3700, which multiplexed 96 reactions. You could do 96 sequencing runs at the same time, in 96 different capillary tubes. That could sequence 400,000 bases per day, almost half of a megabase, and get the time to sequence human genome down to around 200 years. We then, as a field, many of these in places like Cold Spring Harbor or the Broad Institute, the Sanger Institute, to sequence the human genome, which took around, I think, maybe 10 years of data production for cost of around $3 billion. Okay, the human genome was revolutionary. We realized that we drastically overestimated the number of genes that was gonna be in the human genome. There's quite a famous contest where people tried to predict how many genes were encoded in the human genome. I think the low estimates were something like 80 to 100,000. And then when they sequenced it and they ran their gene-finding algorithms on the human genome, it came out to be around 20,000. That number keeps being revised up, but we understood that there are many fewer genes that expected, and a lot of complexity in human genomes comes from things like alternate expression, alternate splicing, and different isoforms of the same genes. Something we also learned is that, while there's a lot of variation in the human genome, to really understand the full picture of how human phenotypes arise, we need to sequence a lot of genomes. And sequencing a genome at $3 billion per genome isn't really practical. So people worked really hard on developing new technology that would drive the cost of sequencing down. And in the next bit of this lecture, I'm gonna talk about some of the technologies that were developed in really the last 10 to 12 years that have driven the cost of sequencing down to a point now where we're sequencing human genomes for around $1,000 per genome. So the first high throughput sequencer, and one we're not gonna talk about in detail is called the 454 instrument, which was eventually acquired by Roche. But in 2006, the main sequencer that is nearly ubiquitous and almost all sequencing data that's now generated came out, that's called the Selexa sequencer. It's now called Illumina, Illumina acquired Selexa some around 2009. The other two that we're gonna talk about in some detail are the Pacific Biosciences sequencer that came out in 2011. And the technology that I work with quite a lot in my lab is a nanopore-based sequencer, which is from Oxford Nanopore Technologies. Just for completeness though, the ABI Solid in 2007 was another short read sequencer that offered very high throughput, as was the complete genomic sequencer that came out in 2010, and the Ion Torrent sequencer also in 2010. We're not gonna talk about the 454, the solid, complete genomic trying Torrent as being short read sequencers, they essentially got outcompeted by the Illumina sequencers and nearly all short read data is now Illumina. So we're gonna focus on that one, but we're gonna talk about the long read sequencers as they have some unique properties and unique applications that we're gonna wanna talk about. So key to Illumina sequencing, just like Sanger sequencing, is the use of DNA polymerase to copy a strand of DNA from a template. So here's how it looks again, we've got our single strand of DNA template, we've got free-floating nucleotides in solution, we use DNA polymerase, it finds a complementary nucleotides and synthesizes this complementary strand in the five prime to three prime direction. Now, one of the reasons that Sanger sequencing was low throughput and you could kind of max out after 20 years of development to add a half a megabase throughput per day is that you're imaging individual reactions in say a capillary tube. Now, what Illumina did or what Selexa did, it was the founders of Selexa, is that rather than using capillary tubes to separate them by size, they developed what they call cycle by cycle chemistry where they're going to image a huge array of sequencing reactions that are all happening simultaneously, a single base at a time. Now, the way that this works is you take your DNA sample, which includes many copies of the genome, you chop this DNA sample up into single stranded fragments and these fragments are typically in the range of say three to 400 base pairs. You then attach these templates to the surface of what we call a flow cell, which is essentially a microscope slide. You attach them by using linker molecules that are bound to the surface of the microscope slide and then the templates come in and bind to these linker molecules. We then perform PCR polymerase chain reaction, which is a way of amplifying DNA to take those molecules that are bound onto the slide and create what we call clusters of those molecules. So here we're going to have an example, we're going to sequence two DNA fragments, this one here, this one here. So after the second step where we attach these DNA molecules to our flow cell, we then run PCR in place, which just amplifies that into many copies of that same molecule in this region of the flow cell. So here we've now got six copies of this one, six copies of this one. This is just to enhance the signal that we're going to eventually observe from the sequencer. Now just like Sanger sequencing, we're going to use color labeled nucleotides. These are fluorescently labeled nucleotides that will emit light when you shine a laser on them. We're gonna have one color per base. And what happens is that in the first cycle of sequencing, we flow these colored nucleotides, we add them to the flow cell, we add DNA polymerase. And just like Sanger sequencing, it's going to find the nucleotide as complimentary to this red base, which is green here, that's A. And it's gonna find the nucleotides complimentary to this blue base, which is this yellow one, which is G. We're then gonna shine a laser on them. They're gonna emit light and that's gonna be captured by camera that's observing this microscope slide from above. And this is a very expensive digital camera looking down a microscope, but it's able to take this very large image of all of these reactions that are happening on the flow cell. Now the difference with the Sanger sequencing chemistry is that this inhibiting group is reversible in the selection sequencing. So what you can do is then after you've imaged the first base, you add a chemical in there that cleaves off the inhibiting group and then the reaction can proceed to the second base. And each one of these chemistry steps where you add in the nucleotides, you shine a laser on it to emit light and then cleaving off this inhibiting base is what we refer to as a cycle in aluminum terminology. So after we sequence the first cycle, you then repeat the procedure, you add new nucleotides, shine a laser on them, capture the image, then you remove that inhibiting group, move on to the third cycle. So we're reading this sequence base by base with each one of these cycles. And this is what the actual output of the sequencer looks like. So we're imaging these clusters from the top down. They look like circles here. At the first cycle, we observed green for this one, yellow for this one. Second cycle, we observe red and red, third one, yellow and blue and so on. Yeah. Good question. What makes sure that the complementary base you attach to is at the very top of the first and second cycle? So in the first cycle, so DNA polymerase goes from five prime to three prime. So the first one, you can only add it to that base. And then when you remove that inhibiting group, you can then add it to the second one. But there is a catch, which I'm not gonna go into yet, but in the next slide, I'll explain the catch. So when we get these images, we need to translate these colored circles into the actual base called sequence. And the software that does this is what we call a base collar. It just takes these images, does some feature detection to figure out where these clusters are and how the clusters correspond across the different images. We've captured one image per cycle. And then it looks at the color and tries to predict what that base was, just using the color here. So in this case, we had a five base pair read with five nucleotides here. We captured five images across the five cycles. And we translated that into the complementary sequence, T-A-C-A-C, which is what you actually get from your sequencer. So this would be in a FASTQ file, which we'll talk about, which is what we call the sequencing read or just the read for short. So in this simple example, it was only five nucleotides in length for real sequencing data, usually it's around 100 nucleotides for the Illuminous Sequencer. Okay, so here's the catch. We have this chemistry operating on these clusters of molecules, but this chemistry isn't perfect. These enzymes, these chemicals that cleave off this inhibiting group may not cleave off it for some of the molecules in the cluster, or they might, when we flow these nucleotides in there, it might not actually have an inhibiting group on it, which means that the molecules in the cluster might get out of sync. So let's say we're on the second cycle for sequencing this cluster. So two of the DNA templates in here are on schedule. They're on the second cycle. One of them is lagging behind. It maybe didn't have it. It's inhibiting group cleaved off and one of them has jumped ahead. Okay, so in this one we sequence two bases this cycle, this one we sequence one, this one we sequence zero. Now, this causes uncertainty in the image that we capture. When we take this microscope image after using the laser to make these fluoresce, we see that two of the molecules are red. One is green and one is yellow. And this gives a mixed signal that the base color needs to try to deconvolve informatically. And usually they're pretty good at doing this, especially near the first cycle. So the chance that these molecules get out of phase or out of sync increases as the reaction proceeds. There's more chances for them to get out of sync. So what we say is that the signal purity goes down as a function of the cycle level. And that means that sequencing areas for luminous sequencing are more common in the later cycles of your sequencing run. Yeah, so usually it's a good question. Usually the way the chemistry is set up is that a single DNA molecule will seed that cluster and then get amplified into this bundle of say 10,000 molecules. Ideally all of those have identical sequence. Now you can't have errors in PCR where the PCR particularly if it happens in very early stage. It can amplify into a cluster with mixed identity. And like you said, it would be a mosaic of different molecules there. Then it would look like an impure signal as well where for some cycles it's mixed, a mixture of red and green. And that can lead to base calling problems as well. But usually that's quite rare in only some motifs you get these PCR artifacts. And the usual error mode is that they get out of sync like this. Now the sequencer will helpfully try to quantify how certain it is that the base call is correct and reports what we call a quality score, which is a confidence level or it's a probability associated with each nucleotide that is the machine's estimate of whether that nucleotide is correct or in error. And later on when we talk about genome assembly tomorrow we're gonna see a profile of the error rate along the sequencing reads. And you'll see the error rate goes up and the quality scores go down as you get towards the end of the read. Now why I'm going into the data in detail like this is that if everything in bioinformatics just worked first time it would be easy. We wouldn't need courses like this. And a lot of the difficulty in bioinformatics is when something doesn't go right try to figure out what happened. Just like when you're in the lab if your PCR doesn't work, your gel doesn't run you need to figure out why. In bioinformatics when your analysis doesn't work you need to figure out why. And I feel that to figure this out you need to have an understanding of how the data is generated and all the processes that can go wrong during data generation. And when you're looking at say IGV plots genome alignments and you're trying to figure out in case if you're looking at cancer genomes if a mutation is true or not you'll be looking at sequencing reads you'll be looking at their quality scores and having this intuitive understanding of how the data is generated is quite important. All right, let's summarize aluminum sequencing. So the advantage of aluminum sequencing is it has by far the best throughput. I have up on this slide here the single aluminum run will give you 600 gigabases of data over an eight day run. This slide's a bit out of date now and one of the problems with giving these talks is that the sequencing technology progresses so fast that I always am updating this talk. So we're teaching this course again in two months I'm probably gonna have to update these slides in two months just because the sequencing technology will have improved by then. For aluminum technology, let's say for now it gives you something like 600 gigabases to a terabase of data per run for a cost of around $10,000 per run. It has the best accuracy. So this error rate that we just talked about these base substitutions caused by these mixed signals happens at a rate of around, let's say one in 200 bases. So maybe you're a single hundred base pair read has between zero and one errors, which is very, very good. We're gonna talk about some sequencers that aren't as accurate next. And for any of these really high throughput short read sequencers, it had a better read length than the complete genomics and the solid sequencer, which is one of the reasons these combinations is why Illumina came to become the dominant player in sequencing technology. Also just library preparation is fast and very robust. The chemistry for just taking extracted DNA, adding all the adapters and all the preprocessing you need to do to put the molecules onto the sequencer works very well now and you have very few run failures. The disadvantage of Illumina sequencing is that because we're sequencing cycle by cycle there's this inherent limit to the length of the reads you get. The longest Illumina reads you'll get is around 150 to 200 bases. And while that's good for a lot of work that we're gonna hear about, if you're trying to do de novo genome assembly, which is where my big interest is for large genomes which have a lot of repeats, this read length is really limiting. The dominant repeat in human genomes are the Alu family of ruptured transposons. Their length is around 350 bases. That's longer than the read length of an Illumina read. So these repeats will cause you to have uncertainty in your genome assembly. They'll be ambiguity in your assembly graph and you won't be able to get around these read lengths due to this, sorry, these repeats to this short read. All right, so I'm gonna talk now about some of the long read sequencers. So the first one that came out was called the Pacific Biosciences Sequencer. It's quite neat technology. It's based on fluorescence and DNA polymerase, like the other technologies that we talked about, but it's doing what we call single molecule sequencing. So unlike Illumina, which amplifies the original template DNA molecules into clusters, the PAC bio sequencer is just imaging off of a single molecule at a time. And this allows it to sequence much longer pieces of DNA. So the way the PAC bio works is that DNA polymerase is embedded at the bottom of a well here and single stranded DNA can come in here, get captured by DNA polymerase, and then the polymerase will synthesize the complementary strand. And our fluorescently labeled nucleotides are just free floating in solution and they're gonna diffuse into this well and the microscope that's capturing this, the fluorescent signal is just measuring how much fluorescence there is in this individual well. Now they're gonna diffuse in here and when the complementary base gets captured by polymerase, it takes some time to incorporate that nucleotide into this synthesized strand of DNA. So it gets held there for the polymerase and what we see is this peak where we see a jump in the fluorescence for the color of the nucleotide that's getting incorporated into that growing strand. So here's what it looks like, the signal trace over time for two incorporation events. So the intensity is this low background rate from nucleotide that's just diffusing in and out of the well. But then when there's a true capture event for an A, we see this jump here and then for some amount of time over a few milliseconds say, we see this long A signal. Then when the incorporation is complete, the fluorophore gets cleaved off and it diffuses out of the well, drops back down to the background, background intel here, then a T comes in, gets incorporated, we see a jump in the T signal, then it gets cleaved off and go like this. Now because this isn't this blocking cycle by cycle chemistry, it's all happening in real time, we don't have this limit of read length. If you put 10,000 base pair fragments of DNA, the Pac-Bio sequencer will read those 10,000 base pair fragments of DNA. The caveat of this is that the signal isn't as clean as what we're seeing here and it can be difficult in some cases to see whether there's a true incorporation event or a molecule had just diffused into the well and lingered for longer than you expect by chance. So the error rate in the Pac-Bio sequencer is around 10 to 15% and the throughput when compared to Lumina is quite a bit lower. Now it's around five gigabases per run for the Pac-Bio. At fewer data, it has a higher error rate but you get much longer reads. Here's just a view of what the read length looks like. So this is a project that we collaborated with Cold Spring Harbor to sequence a breast cancer cell line with Mike Schatz's lab and this is a histogram of the read lengths for this breast cancer run. We see that there's quite a lot of sequences that are longer than 10,000 bases and even 20,000 bases with the longest read that we often this data set around 71 kb. This is much better if you want to resolve repetitive genomes because you don't have this problem of these short ALU repeats or other type of repeats in the human genome confusing your algorithms. So tomorrow I'm going to talk more about assembling data with long reads and we're gonna actually have it in the tutorial section. You'll have a chance of taking Pac-Bio data and assembling it using a program called Canoe but just a plug for how much better this is for doing assembly. This is a paper that was published in genome research last year of assembling a human genome using Pac-Bio data and they show just how much better the assembly is than if you did the same thing with short reads but we'll go into that more tomorrow. Okay, the last sequence we're gonna talk about is the nanopore sequencer is interesting in that it's a portable genome sequencer. It's small enough that you can take it really anywhere where you want to apply the sequencing rather than bringing your sample to these large genomics facilities that have these large Illumina Pac-Bio instruments set up within them. So I was involved in a project of taking this portable sequencer, this is the nanopore device here, to Africa to perform infield surveillance of the Ebola outbreak. So probably a lot of you should be able to get the Ebola outbreak in West Africa from 2012 to 2014. This is a guy who worked on the project Joseph Bohr, he's running an experiment, a sequencing run here. We took the nanopore sequencer into these field clinics and field hospitals, amplified the virus and then sequenced it directly in the field to give the epidemiologists who were trying to control the outbreak some view of how the outbreak was spreading in essentially real time. We can get the results back to them in a few days because of the fact that we take the sequencer directly to where it was needed. It doesn't use the DNA polymer to the sequence. It's more of a biophysical device rather than a biochemical. And the way that it works is that we have a protein nanopore, which is a protein with a channel running through the center of it, which is embedded within a membrane which is shown in black here. Now this channel in the protein nanopore allows charge carrying ions like calcium ions to pass from one side of the membrane to the other. And this flow of ions is inducing an electric current and this current is being measured at around four kilohertz or 4,000 samples per second. Now this channel is wide enough that single-stranded DNA can pass from one side of the membrane to the other and as this single-stranded DNA passes through this constriction in the pore, it partially blocks this flow of current and the instrument records how much current is flowing and we see the passage of this DNA molecule as a decrease in current. And the amount of current that is flowing through here depends on the properties of this DNA sequence that's in. Now something that my lab works on is taking the measurements from this device and then trying to predict what the DNA sequence was just using these current samples. So here's what the data generation process looks like. At some time, T zero, we have a sequence in the nanopore, which is G-C-T-A-C and we see a current signal of around 60 picoamps for a duration of about a half of a second. Now what we hope to see is as the DNA sequence moves through the pore by a single base, new sequence comes in, C-T-A-C-G and we see a drop in the current depending on the new properties of this DNA. So now with the currents has dropped around 45 picoamps and this DNA continues to slide through the pore, new sequence in here, T-A-C-G-A, current changes again and again and again. So are there constants for each base pair? It's not quite each base pair. So if we go back to this schematic, there's the depth of this channel is around five bases and the signal that you see is dependent on what those five bases are, which makes life for the informaticians quite a bit more difficult because you're not reading single bases at a time, it's this convolution of bases and it's not really like a linear function where an A adds so much signal, there's interdependencies between the different nucleotides which makes it quite hard to model. So the way that we model it is that we sequence DNA with known nucleotide sequence and then we say, okay, when the sequence A-G-G, T-A-G passes through the pore, we think this current signal should be drawn from this Gaussian distribution here, which is about 59 picoamps with a standard deviation of about a picoamp and a half. Now if we build up this profile for all possible six base pairs subsequences, which you do by sequencing known DNA, you can then reverse this process and infer what the sequence was using these current measurements. And right now we're using probabilistic models called hidden Markov models and I'm happy to talk to anybody who's interested in actual nuts and bolts of how we do base called nanopore data to solve this inference problem of going from these raw current samples to a base called sequence. And now we're also using newer methods called neural networks which you might have heard about as a new framework for doing machine learning that shows incredible accuracy on things like image processing and speech recognition. We're also starting to apply those models to this problem of trying to base called nanopore data. Okay, so just like Pac-Bio, the nanopore sequencer is measuring single molecules, which means that the signal to noise ratio isn't as good as the Illumina sequencer and that means the accuracy of our reconstruction of these base called sequences is lower. But for nanopore data it's been improving over the last few years. Here are three different versions of the nanopore chemistry. Two versions using what we call the R7 pore, one using the R9 pore and this is a histogram of accuracy for sequencing the Coli genome. For the early R7 data, the accuracy was around 80%, 80, 81%. So about 20% error rate, again, which is much higher than Illumina sequencing. But the introduction of this R9 pore, accuracy went to around 90 to 95%, five to 10% error rate depending on your sample, which is a bit better than Pac-Bio, but the Pac-Bio error rate is more uniform, it's more random, where the nanopore error rate is somewhat more biased, where some DNA sequences are much harder to call accuracy because of this biophysics of how the DNA sequence appears in the pore rather than the Pac-Bio chemistry errors are due to just the random diffusing of these nucleotides. So here's what a nanopore flow cell looks like. I didn't bring one of the sequencers with me, I probably should have, but they're the little desktop instruments, it's about the size of a USB stick. So on this part here is the array of actual nanopores, it's 2048 nanopores, any 500 of which can be sequencing at one time, so you can sequence 500 molecules, DNA simultaneously. In 2015, we were getting around 500 megabases of sequence from the nanopore with a read length around 6KB. With the latest chemistry, we now get around five gigabases of sequence, again with about 90 to 95% accuracy, as I mentioned, and the read length has increased quite a bit as well. Now we can quite routinely get 10,000 base pair reads from the nanopore, and if you're really careful with your DNA sample preparation, we've gotten up to 500,000 bases off of a single DNA read, almost a half a megabase of information in one continuous molecule, and that's something we're working on quite hard. Now, as I mentioned, the sequencing is quite portable. Here's an extreme example of that, yeah. So do you have any more samples that compete in the pore once something is actually approved? Yeah, yeah, exactly. So once the pore becomes free, another DNA molecule can come through. There's a lifetime to the pores, they can become blocked or the membrane can pop, but yeah, usually you sequence a few thousand DNA molecules per pore. Okay. Does that answer? Yeah, but then with those kinds of read lengths that you could, you're just put whole transcripts through and you can just sequence entire transcripts at a time. Exactly, yeah. So that's something that they just came out with. There's a kit for doing direct RNA sequencing. So usually, so OB and Malki will go into this a bit more on Wednesday, but usually when you sequence RNA and if you sequence on the Illuminate, you're actually sequencing cDNA, you reverse transcribe the RNA to cDNA, then you sequence the cDNA. With the nanopore, you can actually put the RNA molecule through the pore and you get to do that directly. So anything like base modifications of the RNA, you can in theory read those. Is the direction the RNA goes through? It actually, it goes through, I believe, three prime to five prime. DNA goes through five prime to three prime, but RNA goes through three prime. The reason is that you have to attach these tethering molecules to attract the DNA to the pore and with RNA, they attach them to the polyA tail. The error rate with, I'm not sure what the error rate with RNA is, I'd imagine it to be the same, probably a little lower to start with because a lot of the improvement in error rate was building better models of the data and to do that, you need a lot of training data. So RNA sequencing is just available in the last few months. The models are probably aren't trained as well, but just from the setup of the system, it doesn't seem that different to me. So I'd imagine the error rate should be similar. But you said that they use the polyA for a tether, like they are not much, so for non-polyA RNA, this is not gonna work. I don't think so, no. You'd have to do, like, see DNA and then go back, yeah. Right, so this is just an extreme example of how portable the nanopore sequencer is. NASA's interested in using DNA sequencing for things like diagnosing infectious disease during long-term space missions. So they sent the nanopore sequencer up to International Space Station. This is an astronaut named Kate Rubins who ran the nanopore sequencer on the space station showing that you get essentially the same quality of data as you do on Earth. Essentially works the same way. You need to attach it to something using Velcro so it doesn't float away during your sequencing run. But other than that, you can do sequencing in space. Okay, let's summarize the different sequencing technologies. So the Illumina 100 to 200 base pair reads, you get a lot of reads up to 600 gigabases per run. You have a very low error rate, around maybe a half of a percent to 1% packed by an oxygen nanopore. It's single-molecule sequencing without amplification. You can sequence incredibly long pieces of DNA. You get between five to 10 gigabases per run, but with a higher error rate. Something I didn't mention in detail and after listening to what your guys interests are, I should have is that the pack of all oxygen nanopore can both detect modified bases, so methylated bases. If you put human DNA in without amplification through the nanopore, you can distinguish between cytosine and five methyl cytosine just based on these current signals and how much current flows through the pore without having to do things like bisulphite treatment. Again, this is another big interest in my lab and we just published a paper showing how we can do that with fairly high accuracy from nanopore sequencing. You can also do that from packed bio sequencing as well. And again, just my caveat that all of these things are improving essentially constantly. So well, I say 600 gigabases and five to 10 gigabases here, take that with a grain of salt and then it's probably improved since I started giving this talk. That's my last slide. Do you have any more questions? One in the back? Yes, DNA polymerase. So the pack bio is, sorry? It's measuring fluorescence. Yeah, it's measuring fluorescence. But this is essentially the duration of how long it takes DNA polymerase to incorporate a T. And if it's methylated, that incorporation time is slightly different. So there's certain methylation types of pack bio is quite good at detecting. Other ones, it's not as good at detecting just because they don't have a really strong signal of how much the incorporation increases or decreases, but it's based on this, the duration of this pulse. Right now we're just, in our model, we just call five methyl cytosine. So the one, the predominant human methylation. We know that there's a different signal for adenine methylation and for five hydroxymethyl cytosine, but our model and our software that's available only calls five MC right now. But we're planning on calling other types of methylation. It's just a matter of getting training data to train our models. And five MC is by far the easiest one to generate training data for. Yes. Yeah, so the question just for everybody else is, so this is an idealized view of how the nanopore signal looks like. But DNA, natural DNA has damage. There can be thymines that are cross-linked to each other. You can have a basic sites where there's not even a nucleotide to it. And that changes these currents that we observe. So sequencing naturally occurring DNA typically has a higher error rate than sequencing amplified DNA just because the amplification would get rid of all these extra artifacts that are present in the DNA. The downside of that is that it also gets rid of the methylation, which I'm going to Cancer Research Institute. We want to sequence a lot of cancers. We want to look at methylation patterns in cancer. So typically we want to sequence the natural DNA without paying this cost of getting rid of, or so while paying this cost of having slightly higher error rates. Right now it's 450 bases per second. So originally the very first instrument they released commercially. It was 30 bases a second. And then they've progressively moved it up to 70 bases a second, 250 bases a second. And now it's 450 bases per second. I think ultimately they want to go to 1,000 bases per second. Is that just three or four? Yeah, so something I didn't mention is that this poor complex. This is what they call the reader protein. And this is the motor protein. And this is actually acting as a break to stop the DNA from going through too quickly. If you didn't have this on here and you just allowed it to pass through the poor, or if you had a solid state natural poor, which isn't biological, it would go through at something like 20,000 bases per second, which is far too quickly to actually register any sort of signal changes. So this is a DNA helicase, which unwinds the DNA. And it's slower than... So the electronics can only sample it at I think 4,000 samples per second. So if the DNA is going through 20,000 samples per second, obviously you're not getting... Yeah, you're skipping a lot. And that's why I think the limit is 1,000 bases. Is that you need a couple samples redundant to be able to actually accurately call the sequence. Yeah, good question. So nanopore, the flow cells that I showed here, if you buy one at a time, they're about $900 and you'll get around five to 10 gigabases of data. If you buy in bulk, if you buy I think 48 at a time, they're around $500 per flow cell. So there's some discounts there if you buy a lot of them. Pac-Bio, I don't know the exact cost for SQL. Do you know? Anybody know? No, well the machine's quite expensive. So one thing, the nanopore, the machine, the instrument itself is essentially $1,000. So you don't have to put a lot of money upfront. For Pac-Bio, the machine is hundreds of thousands of dollars. I don't know what the reagent cost per run is though for the RS2. I think it's around $300 per run, something in there. Can you get again like 500, sorry, five gigabases per update? Maybe not for your reasons. Sure, yeah. There's two companies, one which is sort of taken over the market of doing library preparation before alumina sequencing that allows you to get much longer range information. And the main one now is called 10X Genomics. And what they do is they use droplets that single DNA molecules can be put into these droplets, barcoded such that you know that all of the alumina reads came from the original same DNA source molecule. And these source molecules can be hundreds of kb in length and you get much longer range information. Because you know all of these alumina reads came from the same 100 kbps. So if you're doing things like genome assembly where you need long range information, that's quite handy. If you want to phase SNPs, which means determine which parental haplotype the SNPs came from, these 100 kb molecules are quite useful. And also for things like structure variation which we'll hear about later on as well, the much longer range information is useful there. Now these instruments are a pre-processor for alumina data. It's a way of preparing a library in special way before running it on the alumina sequencer. And I think again the cost is something like $500 to run to build a 10X library to then go on to your alumina sequencer. There was another company called Molecular which I was involved in that alumina eventually bought which did a similar technology where you barcode individual fragments of DNA. These fragments were much shorter, around 10,000 bases and then you'd reassemble them informatically to get the full 10,000 base sequence. That's not used so much anymore. And the 10X is the main long range linked read application.