 Okay. Good morning, everybody. As I introduced myself before, I'm Jared Simpson. I'm going to be talking to you about how DNA sequencing works. As Ann mentioned, all the teaching materials for this course are available under Creative Commons license. One of the terms of this license is you give credit to people, good credit to the source of the material. So in that spirit, I'd like to just thank Ben Langby and Erin Quinlan, whose open course materials were used as part of this lecture. So this lecture is going to be about how high throughput sequencing works. Now, I'm fascinated by sequencing technology. The development of the Illuminous Sequencer around in 2008 was my entry point into bioinformatics. When I was in Vancouver at the BC Genome Sciences Center, we were one of the first institutes to get the Selexa Sequencer, which could sequence 36 base pair reads, and I started working on methods for this sequencer. It was really the reason that I joined in bioinformatics. So I'm going to tell you about just how these sequencers work and then some of the upcoming sequencers, some of the long read sequencers like the Pac-Bile and the Oxford Nanopore sequencer, which is something that my lab works on upstairs. So before I start, I just want to sort of pull the audience. Who has worked with Illumina data at all before? Okay, about half of you. Who's worked with long reads like Pac-Bile or Nanopore? A few people. Okay, good. So something that I really want to get across in this lecture is not just how these sequencers work, but really how they can go wrong and some of the error modes that you'll see and some of the pitfalls that you can run into when you are looking at your own sequencing data. Before I go into that, though, I think it's good just to start to review why we want to sequence genomes in the first place. So let's start with the famous central dogma of molecular biology, which describes the flow of information through a cell. So we start with DNA that gets transcribed into RNA, which then gets translated into proteins, which fold into their shapes and go on to carry out whatever their biological function is in the cell. Now, the reason this is such a good starting point is that it sort of links this information which is encoded in DNA through to the observable phenotypes and through to the actual biological function of the cell. So the idea is that if we understand what the DNA sequence is, we can try to link that to the phenotype in some way. So a lot of people in their introduction said they're interested in cancer. We want to sequence DNA to figure out, you know, how proteins might be disrupted either through frame shift mutations or just some difference in structure and how that leads to cells growing out of control and cancer. So very basic review. Here's a DNA molecule. It's a double-stranded molecule where the individual bases are these colored bars in the middle here, and they come in pairs. So adenine, represented in our sequencing data as A, binds to thymine, guanine, represented as G, binds to cytosine, represented as C, and they're running along this inner track between these two parallel strands which make up the sugar phosphate backbone of DNA. Now the important thing that because DNA is double-stranded, if we know the sequence of one strand, we can automatically infer the sequence of the other strand because of the complementary relationship between the bases. And of course this gives it the property that you can copy DNA by just separating the two strands to get apart and then copying by introducing the complementary bases. Later on, when we talk about the actual DNA sequencing technology, this ability to copy DNA by using one strand as the template for the other strand is going to be one of the important features that these DNA sequencers exploit. So this is a little bit of a human-centric slide. This is a human genome. It's roughly three billion bases in length. So that means there's three billion of these individual nucleotides along this inner track of DNA. Of course the human genome is subdivided into 23 chromosomes which have a size in the range of around a few hundred megabases each. Now the human genome is extremely large and needs to be packed into the cell by these higher order structures where it gets wrapped around nucleosomes with then fold and wind together to form the individual chromosomes. So we want to sequence these genomes essentially end to end, but unfortunately sequence technology isn't good enough that we can just put an entire chromosome onto the sequencer and then read out the sequence as one 250 million base pair of string. What we need to do is we have to fragment the DNA into many millions or even billions of pieces, read the sequence of those individual fragments and then computationally reconstruct the sequence. That's the genome assembly problem and that's something that my lab works on and I'm going to tell you about later on actually tomorrow. Now we're not just interested in the sequence of human genomes or plant genomes or bacterial genomes or whatever you're interested in, but we actually are interested in how genomes vary between individuals within a population. So if we took two chromosomes from any two people in this room, compared them to each other, they would be the same at about 99.9 percent of the positions with about 0.1 or 1 in a thousand bases that are different. These variants come in all shapes and sizes when it's just a single mismatch between two chromosomes. We call those SNPs when they're larger tracks where bases are inserted and deleted, we call those indels. If there's large changes where material has been, you know, swapped between chromosomes, we call those translocations or different types of structural variations. And we're going to talk about all the different ways of finding genomic variants later on in the course, particularly in the afternoon today. But really why we want to understand these variants is that there would give us differences between phenotypes. So individual variants will control things like your height, your pretensity to get different types of diseases, your hair color and so on. So by sequencing large numbers of individuals, we can link those individual sequence variants to those phenotypes that we're interested in. Okay, so I'm now going to start talking about sequencing DNA and something I didn't mention in the beginning that I wanted to mention is that this is a lecture, but please feel free to interrupt and ask questions. If anything comes up, if anything's unclear or you want me to elaborate on any of the points that I'm making. So again, here's another view of molecule DNA. DNA is a directional molecule. We can assign a direction of it. And there's one end that we call the five prime end. There's one end that we call the three prime end. The five prime end has this phosphate group at the terminus. And the three prime end has this sugar here. And along the backbone of DNA, and so I can't point to both screens at once, along this backbone of DNA is made up of alternating phosphate sugar, phosphate sugar, shallots of sugar. And then the individual nucleotides are linked onto the sugar here. So when we sequence DNA, we're going to start from one end, usually the five prime end, and then read off the order of these individual bases, base by base. So here we have on the strand on the left. As you're looking at it, an adenine. So we're going to put an A into our output file. That's followed by a cytosine. We put a C in. That's followed by thymine. We'd put a T in. Then a guanine. We'd put a G in. Now, if we've read off a strand, the sequence of that strand, which was ACTG, we automatically know the sequence of the other strand, which is just starting from the other end at this five prime end, this five prime end, and then reading up the sequence along here. So DNA is always read in the five prime to three prime end, and that's how we represent it in our output files. Now, DNA sequencing originated in the 1970s. So the most prominent early DNA sequencing method was developed by a biochemist at the laboratory for molecular biology at the University of Cambridge, whose name was Fred Sanger. So Sanger's method of sequencing, which uses chemically modified nucleotides, was developed to figure out where individual DNA strands ended. So what Sanger did is he took the four nucleotides, and he chemically modified them by removing one of these hydroxyl groups on the sugar of the DNA. Now, by removing these hydroxyl groups, if you spike in a reaction of these modified nucleotides, which are called DDNTPs, if DNA polymerase comes in and copies them and introduces one of these modified nucleotides, it stops the reaction at that point. It doesn't allow the sugar phosphate background to be elongated, so you get nucleotides up to a certain point, and then the reaction stops. Now, if you've only modified one of the four possible nucleotides, let's say A, whenever you see a DNA fragment that stopped at a certain point, you know that it was modified A that was introduced at that point. So Sanger's method, which won him his second Nobel Prize, was to chemically modify the four different bases, AC, G, and T, spike these modified bases into four different reactions where the DNA is going to be copied, and then each one of those different reaction tubes, all the strands would stop at A, or all the strands would stop at C, or G, or T. Now, this would allow you to know what the last base of every molecule in those four different tubes are, whether AC, G, or T, but the second thing you need to know is what the ordering is by size. So what Sanger did is he'd run out those four reactions on a gel to separate the sequences by size, and then he'd look at which reaction was the shortest, which was a second shortest for the third shortest, fourth shortest, and that would give you the last base of every one of those fragments. So here's what a Sanger gel looks like. Each lane is labeled ATGC for which reaction it was present in, and you just look at each one of these bands, and you say okay, well the shortest fragment ended in a C, the second shortest ended in G, the third shortest ended in T, the fourth shortest ended in A, and then in T, C, T, and so on. So you've got these strands separated by size where you know what the last base of the sequence was, and just by reading off the positions of these bands, you can figure out what the sequence of this molecule was. So you don't replace all the A's with the modified A's, you put them in and say like 1% concentration so that there's some small chance that a modified A will go in there, and because you're copying billions or like uncountably large numbers of molecules just by chance, you have many molecules that ended that base, and also many that don't contain modified A there. So you get the spectrum where like all possible bases are modified at 1% of the fragments, and then you can read them off by size. So is that clear how Sanger sequence works? It's very rare, people don't use it so much anymore unless you're doing sort of validations of things, but it's incredibly important in technology, and it's really key to understand and understand how things like a luminous sequence it works. Now the throughput of Sanger sequencing wasn't so great. This really required a lot of manual molecular biology, and you'd only sequence maybe a few hundred bases of DNA per day. This was good enough to sequence things like small viral genomes which was the first genome that was sequenced in Sanger's lab, but it was pretty obvious that this technology needed to be scaled up so you could sequence it much higher throughput. So in the next two decades, there's a lot of work both in academia and industry in automating this type of sequencing. Such you could run it at much higher throughput. So where Sanger's method used these gels and these four different reactions where we separate the molecules by size. The automated technology put a lot of the pipetting onto robots, and it replaced running four different gels by fluorescent dyes attached to each one of the four different nucleotides. You could then separate the DNA sequences by size in a capillary tube, and you'd get this famous Sanger trace where each one of these colored peaks is these terminally labeled fluorescent markers of the nucleotide sequence, and you would just read off the color of each one of these peaks. So when you'd see a flash or a peak of red light, you'd associate that with a T, blues a C, red is a T, and so on. So we replaced counting these bands on a gel by these capillary tubes, and we're just looking at the color of each one of these to figure out the sequence of DNA. That's a good question. It's probably something you do with the concentration, like how many molecules actually had that terminal fluorescently labeled C or T there, but also the strength of the fluorophore that gets excited. Usually, I think back in these days when you know 70s and 80s, they probably cloned things like into some vector and then and then popped it out with restriction enzymes and just sequenced the region of interest. You can also PCR amplify things to just design primers, amplify it to high concentration, and then you'll just sequence whatever your applicant is. In the Human Genome Project, which I'm going to come to later, they did what's called whole genome shotgun sequencing, where you essentially just randomly fragment the genome and then sequence the random fragments. But there you need really high throughput to do that, or else you'd have very low coverage of your genome interest. All right, so here's a timeline of the major advances for Sanger sequencing. So in 77, Fred Sanger developed this and you hard-working technician in the lab could maybe sequence 700 bases per day. If you wanted to sequence a human genome, which is 3 billion bases in length, it would take that technician about 120,000 years, which is not really a feasible project. In the next 10 years, as this technology started to become automated, the first automated sequencer came out, which developed by ABI, the 370. That could sequence an order of magnitude more per day, about 5,000 bases, and it would drop the cost of sequencing a human genome down to about 16,000 years. 10 years later, the 377 came out with various improvements, like having bigger gels, better optics, more sensitive dyes, and faster computers to do the analysis. This went up to about 20,000 bases a day. And then a few years later, the real workhorse of the Human Genome Project, the ABI 3700, came out, which paralyzed the sequencing by allowing you to do 96 sequencing reactions simultaneously. This was a big increase in throughput to about half a megabase or 400,000 bases per day. And there you could sequence the human genome in about 200 years. So collectively, the Human Genome Project got many of these ABI 3700 sequencers spread around the world, places like the Sanger Institute, places like the Broad Institute, Wash U as well, did a lot of sequencing for the Human Genome Project, and they were able to sequence the human genome over the next four or five years with final publication coming around 2003. But it was clear that we didn't want to just sequence a single human genome. It was a great resource. We found, you know, tens of thousands of protein coding genes by sequencing a human genome. But as I mentioned earlier, what we're primarily interested in, or one thing we're interested in, is making comparisons between genomes. We want to understand how genomes vary across the human population. So we needed to scale up sequencing technology such we could sequence many more genomes. And this led to the development of new sequencers, which collectively we call next generation sequencing or second generation sequencing. There's a lot of different terms that are used. They're all used interchangeably. Massively parallel sequencing is one of them. High throughput and ultra high throughput sequencing all collectively describe the sequencing technology that came after Sanger sequencing that we could run at a much higher scale. So the first high throughput sequencer was the 454 instrument developed by Roche. That was followed in 2006 by the Selexa sequencer, which was acquired by Lumina. It's now commercially available as Lumina, the High Seek and the Nova Seek. 2007, there's the ABI Solid. 2010, the Complete Genomics sequencer. And then in 2011, we started to get long-read sequencers like the PacBio, Pacific Biosciences, and a few years ago, the Oxford Nanopore sequencer in 2015. So I'm only going to talk about the sequencers that are shown in green here, the Illumina sequencer, PacBio and Nanopore. Those are the three most commonly used sequencers. For short-read sequencing, Illumina has completely dominated market and people basically only use Illumina sequencing. That's going to be the focus really of this course. But now if you're doing things like trying to assemble complex genomes, very large genomes, the long-read sequencers are now quite commonly used as well. So I'll spend a bit of time on those. So a key part of Illumina sequencing is DNA polymerase. So DNA polymerase copies single-stranded DNA. So it's going to take some free-floating nucleotides in solution. It's going to take the single-stranded template and it's going to find the complementary base for each one of these and synthesize the complementary strand from 5' to 3' direction as we showed before. So here we've got our single-stranded template, we've got our nucleotides, and DNA polymerase is going to put a C here, then another C, then another C, then a T, G, and so on. So it's going to take that template sequence, go over it, and synthesize the complementary sequence 5' to 3'. Now Illumina sequencing has taken some of the ideas behind Sanger sequencing and just massively paralyzed them, such that instead of 96 reactions that occur at the same time, millions or billions of reactions will occur simultaneously. And the way that Illumina sequencing works is you take your DNA sample, you then fragment it into, you know, billions of pieces, you can do this by by using enzymes to shear it or with ultrasonic waves, and you get a lot of the short pieces that are around 400 bases in length. You then take those individual templates, you attach them to a surface of a microscope slide, and they're sticking up perpendicularly to the surface of the slide, as shown here. Next in place you run PCR to take those individual single molecule templates and expand them into a cluster of clones where there's thousands of copies of that individual molecule in this one region of that microscope slide. The reason that we're going to do that is that we're going to read a fluorescent signal for each one of these individual clusters, and by copying them it boosts the signal, such that we can detect it during sequencing. So just like Sanger sequencing, we rely on fluorescently labeled nucleotides, so each of the nucleotides is going to have an individual color. Here we're going to see that A is colored with a green fluorophore, C is colored with blue, G is colored with orange, T is colored with red. So you're going to add those labeled nucleotides and DNA polymerase onto this slide, and the DNA polymerase is going to start synthesizing the complementary strand base by base. And each step of the synthesis reaction is going to be called the sequencing cycle. So in the first sequencing cycle it's going to add the first base that's complementary to whatever our template is here. Then you're going to excite the fluorophore with a laser, and it's going to make a flash of light here, which is captured with an incredibly expensive digital camera attached to a microscope. So here we're going to register a flash of green light from this reaction. Here we're going to register a flash of orange light from this reaction. Now just like in Sanger's sequencing method, these fluorescently labeled nucleotides have a blocking group on there that won't allow the reaction to proceed to the next step. It won't allow you to sequence the second cycle. But unlike in Sanger's method, this is a reversible blocking group. You can add some chemicals to cleave that off that will then allow the reaction to proceed to the second step. So you do that, and then you rerun this chemistry where you add the nucleotides that are labeled, take a picture, and then cleave off that blocking group. And then you sequence the second base. Then you do that again, sequence a third base, fourth base, fifth base. So the reaction is essentially proceeding in a stepwise fashion base by base where in the first cycle we sequence the first base, second cycle we sequence the second base, third cycle we sequence the third base, and so on. Now for each one of these sequencing cycles, we get a picture of the flow cell and what's happening on the flow cell. So here we've registered two flashes of light. One was green for this one, one was orange for this one. In the second cycle, we registered red, red, then orange, blue, orange, blue, sorry red, blue, orange, blue. And by just looking at these images cycle by cycle, we can figure out which base was added to those individual clusters. Now in this example, we're only showing two clusters. On an actual aluminum flow cell, there'll be billions of these dots of light, each one having a different color and each one being an individual DNA molecule. Yeah, so on the microscope side, there's like adapter molecules attached to it which are fabricated by Lumina and they only allow a connection from like the three prime of the adapter to the five prime of the molecule. So it's chemically constrained that it can only come down on the five prime of three prime. And then so this is the five prime end of our template and then this is the five prime end of the complementary strand state sequence. Yeah, so you don't know like of your original DNA fragment, it's double-stranded and you don't know whether you've sequenced the top strand or you know the bottom strand. You do know that you're always sequencing in five prime to three prime because that's the way the chemistry works. But you're right that the adapters could ligate to either end of that and you could sequence either strand of the DNA. I'm going to leave that question for a few minutes because I will come to that. It's very important to understand. So the next step of the process is to start to do the informatics. So there's software that runs on the sequencer which is called a base collar which is going to take this stack of images. It's going to detect where these clusters are, where these colored dots are and then line them up across the images and predict what the nucleotide sequence was that was sequencing. We call this software a base collar. This is software that Illumina provides. People don't need to do their own base calling but is running on the instrument which just takes the machine's measurements and predicts what the actual nucleotide sequence was. It writes the base collar results into a file called a FASQ file which is what the starting point of all your analysis will be. It's just a huge file which contains the nucleotide sequences that the base collar detected. Okay so now I'm going to come to your question. Something we need to understand is how this process can go wrong. And all sequencing technologies generate errors. There's no sequencing technology that will perfectly tell you the exact sequence with 100% accuracy. All of them have sequencing errors and this is what the dominant error mode for Illumina sequencing is. Now in the slides that I just showed you we were just looking at individual molecules but actually we're sequencing this cluster of molecules which has all of the same sequence. Now these chemical steps of removing this blocking group or synthesizing the complementary strand don't happen perfectly. What can happen is when we're sequencing this cluster some of the molecules can be lagging behind. Here this one's on the first base the green base and some of them can be jumping ahead where this one's on this or the third cycle down here. Now what happens is when we excite the fluorophores of the laser and take this picture of the cell we're going to get a mixed signal. Here we'd see a signal which is 50% red, 25% green, 25% orange. Now the base caller needs to take this and figure out what the true sequence was while accounting for this ambiguity. So in the first few cycles this typically is pretty easy. A lot of the intensity that it registers will be purely one color but as you move down the sequencing reaction to the later cycles like the 100th cycle there's more opportunity for these strands to either lag behind or jump ahead. So as the sequencing reaction progresses you get more and more mixed signal that's difficult to resolve. So this gives you the characters the error rate or error profile of the luminous sequencing where the error rate steadily increases as the sequencing reaction progresses. So the first base is typically very high accuracy maybe one in 10,000 errors whereas the very last base is lower accuracy might be one in 50 errors or something like that. So does that answer your question about whether these reactions can proceed in lockstep? Does it cleave off the blog group and the floor floor? That's a good question. It does? Yes, it does. Thank you from the audience. So in these examples I've only shown you about five cycles of sequencing. Typically you'll run the aluminum for about 100 to 150 cycles and that determines what your read length is. You'll get about 100 to 150 bases for each one of these reads. Now if you're working with human genomes or you know very large plant genomes they're typically quite repetitive and all the 100 base pair read might not be long enough to uniquely assign it to some location on your reference genome. This is going to be talked about a little bit more in the mapping lecture later. So what we need are slightly longer reads or hopefully much longer reads to resolve these more repetitive regions of the genome. Now because aluminum is fundamentally limited to this 100 base pair read length they came up with essentially a medical biology trick to increase the amount of information you get from each one of the sequencing reactions. And the idea here is to read both ends of a DNA molecule. So you'll take your 400 base pair sequence, put adapters on either end, they'll get added to the flow cell, and then you'll sequence one end of the fragment for 100 to 150 bases and then turn it around and sequence the other fragment for about 100 to 150 bases. Now you don't know the sequence in between these read pairs. If it's a 400 base pair fragment there might be 100 bases in between that you don't have information for. But the informatics, the bioinformatics programs that you'll be working with understand that there's reads common pairs with some unknown sequence in between it and they can use that to constrain where they're going to assign their reads onto the reference genome. And this helps increase how accurately you can map your reads to the reference genome. Is that all clear what read pairs are? How does it help increase that? Yeah, so let's say you have, so dominant human type of repeat is called alluret transposons, which are about 350 bases in length. If you have one read that's present in the allure sequence, then it could probably map to millions of different places in the reference genome. If you sequence another read that's 200 bases upstream of that allure sequence, there's a good chance it's going to match somewhere uniquely in the reference genome. So by the fact that the one read is unique, you can then match the other read into the copy, the right copy of the allure. So just constraining it and hopefully you essentially have two chances at having a unique match there. All right, so let's summarize luminous sequencing. It's advantages are it has by far the best throughput. The NovaSeq, which I've shown an image of here, can sequence up to a terabase of DNA in a run by generating about 8 billion paired end reads. It has very good accuracy. The overall error rate is less than 1%. I think we're going to have 1 in 200 bases. And compared to the other second generation short read sequencers, it has better read lengths. It has better read lengths than the solid and the complete genomic sequencer. Another advantage is because the aluminum sequencer has been so popular and it's been developed over 10 years now, is that the library preparation has been very well optimized. It's quite robust. It can deal with things like FFP samples, which are typically very degraded and very damaged and still give you fairly reliable information. The disadvantage though and something we'll be coming back to a few times is that there's this inherent limit to the read length, which is around 150 bases, because it's using the cycle by cycle chemistry. Okay, I'm going to move on to aluminum sequencing. Any more questions about aluminum before we go into the long read sequencers? No? Okay. So when you're doing genome assemblies, which is something I'm primarily interested in, even if we have these paired end reads, they're typically not long enough to give you highly contiguous assemblies. We can't assemble, you know, megabase-sized contigs from our genome. So some technology that various companies are working on are what we call single molecule long read sequencers. And the first one that came out was the Pacific Biosciences sequencer. Now this is based on similar principles to the Sanger and aluminum sequencing in that we have fluorescently labeled nucleotides. And they're going to be measured as DNA polymerase synthesizes a complementary strand of DNA. Now unlike the aluminum sequencer, which has this cycle by cycle chemistry, there's none of these blocking groups. And the nucleotides are free to just be added as they're found and matched by the polymerase. So it's happening essentially in real time. Now in the PacBio sequencer, DNA polymerase is immobilized at the bottom of a tiny volume, which they call a zero-mode waveguide. And as these fluorescently labeled nucleotides diffuse in and are grabbed by the polymerase, they're held in place. And this is registered as a flash of light over time that it takes for the DNA polymerase to incorporate that base. So in this trace at the bottom here, it's showing the signal at the bottom of this well. And here nothing's happening, nothing happening. And then you see this flash of green light over this duration, which shows that it was and A was incorporated. And then nothing happens. And we see another flash of light where T was incorporated. You can see these at the bottom here. Or we're just looking for these blocks of fluorescence signal, which indicated when a nucleotide was incorporated. Yeah, so here nothing's happening. It's just background, you know, things are diffusing into the well, being registered, and then they diffuse out. But as one gets grabbed by the polymerase, it takes some time to actually link it into the synthesized strand. So you see this flash over time, which has a certain duration. You can calculate what the duration distribution should be based on the kinetics of DNA polymerase. But yeah, you register when a base gets incorporated by these flashes of light. Now the sequencing technology is measuring single molecules, unlike alumina, which is measuring clusters of molecules. And because of that, it has a much higher error rate. Sometimes a base will get grabbed by the DNA polymerase, it will try to get incorporated, fail, and then get thrown out. And that can be registered as a false insertion into the sequence. Likewise, sometimes the reaction is so quick that you don't actually register a signal, and that can be registered as a false deletion. So the dominant air type for Pac-Bio are insertion and deletion errors, and it happened in a rate of about 12 to 15%. So the air rate of Pac-Bio is much, much higher than alumina. But because we've, we're not using this, the cycle-based chemistry, the read lengths can be incredibly long. And you can sequence 10,000 base pair reads with Pac-Bio compared to 100 base pair reads for aluminum. And when this was developed, this was really revolutionary for genome assembly, because even with a 10,000 base pair read with 15% errors, it's much easier to map that to a reference genome or to find overlaps between adjacent regions if you're doing a de novo assembly. I'll come back to that in the assembly lecture tomorrow when I describe how much it helps to have long read sequels in. So the current state of the Pac-Bio sequence is called the sequel. It gives you 10,000 base pair reads or longer and about 10 gigabase yield per run. Again, that's orders, magnitude less than alumina, but you get much longer for it. The other advantages of Pac-Bio are that the air rate is essentially not systematic, it's essentially completely random, which helps when you're doing a genome assembly. The next technology we're going to talk about is the Oxford Nanopore Sequencer, which has a problem of systematic errors. The errors cluster at particular sequence motifs, which made it get harder to do a genome assembly. Something I didn't talk about, though, is that you can detect base modifications. So if DNA is methylated, that will change the kinetics of how long it takes to get incorporated into that synthesized strand. And by subtly analyzing these subtle changes in duration, you can detect whether particular bases were modified or not. Disadvantages are quite well known, high air rate in the reads, and compared to other sequencers, much higher cost per base. CG bias. Yeah, so alumina fundamentally requires amplifying DNA, either on the flow cell or in some of the library preps have a PCR step. And that can introduce some bias towards GC-rich or GC-poor region, the genome. Pac-Bio, I believe, is relatively unbiased because it doesn't have any PCR amplification during the library prep. Some protocols do, but typically if you're just sequencing genomic DNA, you don't. And the next sequencer, nanopore, which I'm more familiar with, most protocols don't have PCR, so it won't have those type of GC biases. All right, third and final technology we'll talk about is the nanopore sequencer. The most striking thing about the nanopore sequencer is that it's tiny and inherently very portable. This is a project that I was involved in where we took the Minai on sequencer to Guinea and West Africa to sequence the Ebola outbreak when it was happening in West Africa in 2014 to 2016. And we were essentially trying to do real-time surveillance of how the virus was spreading by sequencing directly in the field clinics and hospitals in Guinea. This is Joseph Bohr. He's running the Minai experiment. He's essentially pipetting the sample into the top of the flow cell, which is shown here. The DNA then gets passed through an array of nanopores, which is down here. Their signal properties are measured and sent to whatever the host computer is, which is just connected by a USB cable to a laptop, which is shown off-screen. So the nanopore sequencer is interesting to me because it's fundamentally a very different method of sequencing and then it doesn't lie on fluorescence unlike the other sequencing technologies. This is a schematic of an experimental nanopore. There's a few components of the system that I want to point out. In black we have a membrane, which is shown here. And there's the protein nanopore, which is embedded in the membrane in orange here. Now protein nanopore is just a protein with a channel running through it. And this channel allows molecules to pass from one side of the membrane to the other. So things like calcium ions can pass through that channel, which induces an electrical current that we can measure with very sensitive electronics. But also the width of this channel is just wide enough, such as single-stranded DNA can pass from one side of the membrane to the other. Now the fundamental idea behind nanopore sequencing is that as this DNA is passing through the channel, and particularly as it passes through this constriction in the channel, it's going to partially block the flow of these charge-carrying ions, which give you some information about what sequence is in the pore when those samples are made. So the instrument is continually measuring the flow of current around 4,000 samples per second, four kilohertz. And we're going to be able, we'll try to predict what the sequence is in the constriction based on how much current that we observe. Now there's one other component to the nanopore sequence that I must mention. In red shown up here is what's called the motor protein. This is a DNA helicase, which is unwinding the double-stranded DNA, and it's actually acting as a break from stopping the DNA strand from going through the pore too quickly. If you didn't have a motor protein on there, the single-stranded DNA would pass through the pore at around one million bases per second, which is far too quick to actually register distinguishable current measurements. So this motor protein, this helicase, slows it down to about 450 bases a second such that we have a chance to register individual current measurements. So here's an illustration of how the sequencing process works and what the data looks like as we see it. So at some initial time, when you start your experiment, we have some short sequence of DNA, which is in the pore, GCTAC, and the instrument is going to be sampling the amount of current that's going through the pore. So here it's sampled about 60 picoamps of current for a duration of about half of a second. The DNA helicase is going to push that DNA through the pore by single base at a time and introduce some new sequencing context in here. So now it's moved over by one base and the sequence CTA-CG is in the pore. And what we hope to see is this drop in current which reflects the properties of this new DNA sequence that's in the pore. So in this illustration, we've seen that it goes down to about 47 picoamps. Then as it moves through the pore again, it jumps back up to just over 50 picoamps and again and again. So what we hope to see in our nanopore data is these movements of current up or down, which correspond to movements of DNA through the pore. And what we need to do when we're analyzing the data is take these current measurements and predict what the DNA sequence was just based on what currents we've observed. So what we do to base call the sequence is that we've got trained models for all possible fiber sequences that's in the pore. There's 1,024 different fiber base pair sequences over DNA alphabet. And we've trained a set of current distributions that describe how much current we should see based on what fiber it is. And then when we're actually trying to base call, we invert that and we take the current observations and all these distributions and we try to fit what the best sequence of DNA was that matches the current observation. So the base order will affect the current? Base order will affect, yeah. Yeah, it's not just the, you know, which five, it's what sequence of five matters. Now you have hit on something that is important in that if you have long runs of the same base, like homopolymers, let's say there's 10 a's that go through the pore, we're not going to register these jumps in current up or down. So the systematic error mode that I alluded to before are essentially insertion and deletion errors and homopolymer runs. Because the only way we can resolve those is try to look at how long the signal was rather than movements of current up or down. And that's much weaker information and it's very, very difficult to model. I know this because I work on it a lot. And, you know, it's, we're only up and able to reconstruct the genome sequence to about 99.9% accuracy, which compared to other technologies, it's like 99.999 accuracy. Yeah, you had a question too? Yeah, I was just wondering experimentally, like, you know, this five lakes back to this specific current pattern? That's exactly it. So this slide illustrates it. Sensual adoption nanopore is done as they've taken DNA with a known sequence and, you know, chemically modified to make it very easy to distinguish. And then they've sequenced all possible contexts and then train essentially Gaussian distribution that describes for each one of those contexts what amount of current we should see. So this is actually a trained model for a SIXMR, A-G-G-T-A-G. And when that sequence is in the pore, we should see current drawn from this distribution, which is a Gaussian with a mean of about 59 p-clamps and standard deviation of a few p-clamps. But you're fundamentally right in that we need to sequence known DNA. We can't just analytically predict, you know, by, you know, molecular dynamic simulations or something that when this sequence is in the pore, we're going to see this much current. That doesn't work very well. We need to just sequence known DNA and do it empirically. If you have a homopolymer A's, we'd still have like a Gaussian distribution. It would be shifted to some RLs, but it's still fundamentally the same shape of the distribution. It does, yeah. So my lab published a paper on this last year where we produced software that can detect methylation. So if, you know, classical human methylation is five-methyl cytosine followed by G, CPG methylation. And we've shown that there's a perturbation of a few p-clamps when cytosine is methylated, which is enough to allow us to detect whether cytosine is methylated or not, similar to pacfile, where you can detect methylation as well. Sort of an open question in the field is what types of methylation we can detect and how strong is it. So what my group is spending a lot of time on is sequencing DNA that's been enzymatically methylated or methylated in a certain predictable way, such that we can train these distributions by knowing whether the DNA is methylated or not. So again, we're using sort of this data-driven approach to figure out what the signal should be if DNA is methylated. Some very, very preliminary and still exciting work is that you can sequence RNA directly from nanopores. So rather than converting it to cDNA, you can pass RNA molecules through the pore and we're working on whether we can detect RNA modifications, which are incredibly difficult to detect with other methods by just looking at these current patterns. Any more questions? One more in the back. That's a good question. I don't think so at this point. The pore is the same for RNA and DNA, like this part of the system. This is the same when you sequence RNA and DNA, but the motor protein is different for RNA. And RNA is actually sequenced from three prime to five prime, unlike DNA, which is sequenced five prime to three prime like other sequencers. And I think because of that, you couldn't mix a DNA and RNA sequence. Yeah, we typically shear to around 10,000 bases and that would be the mean of our read length distribution. But essentially, it's unlimited because if you can prepare and present extremely long molecules of DNA to the pore, they'll go through and you can sequence them. The longest read anybody's ever sequenced is 2.2 megabases, but that requires incredibly gentle handling of the DNA to avoid shearing it during your library preparation. And you typically take a pretty good hit in yield if you go for these ultra long reads, which are what we call them. So yeah, you can sequence incredibly long pieces of DNA, but you don't get as much, so we typically sheared to around 10,000 bases. Let's go forward. So summarizing that, of course, sequencing, there's two available sequencers. There's actually a third one that didn't show here, but the two main ones are the min ion, which gives you between five and 15 gigabases of output similar to a pack by a sequel flow cell. And then there's the larger version of the instrument which is called the Promethion, which can give you 30 to 90 gigabases, which is more designed for sequencing large genomes like human genomes. The advantages are obviously the portability. These sequencers are much cheaper to buy than Illumina or pack bio sequencers, so it's easier to get into now before sequencing the min ion is around $1,000, and then $1,000 per flow cell. The read lengths are incredibly long. You can sequence up to megabase reads. You can detect base modifications and sequence RNAs, I just mentioned. Disadvantages are the high read error rate. Deconvolving the sequence base on these current measurements is very difficult. The error rate is about 15% similar to pack bio, but the error rate is much more systematic, and then they cluster together in these difficult motifs, things like homopolymer runs, or sequences that just by chance have very similar current distributions when they go through the port. So a big open question in the field is how to get over the systematic error rate and improve the quality of our genome assemblies from nanopore sequencing. And tomorrow in the assembly practical session, you'll have a chance to assemble both Illumina, pack bio, and nanopore data and compare the strengths and weaknesses of those different sequencers. All right, just to wrap up, give you an overview of how genome sequencing works. Talk about the three key platforms and their strengths. Often question I get is, what platform should I use to sequence my sample? Of course, that depends on the question that you're interested in. If you're sequencing large numbers of human cancers, you probably want to want to sequence with Illumina. If you're trying to do de novo assembly of some very repetitive plant genome, you'd probably want to do something like pack bio or nanopore with some Illumina data to increase the accuracy. So before I turn it over, I think to the coffee break, is there any more questions about sequencing technology?