 Good morning, everybody. So this is the first lecture. This is going to be an introduction to how high throughput sequencing works. So the purpose of this lecture is to introduce you to some of the sequencing technologies that are commonly in use. And more importantly, introduce really how they work at a physical level and a molecular level so that when you get sequencing data of your own and you're analyzing your own data and something doesn't quite look right, you can understand a little bit about how errors are generated, how these sequencers actually operate, so then you can troubleshoot it and figure out what's gone wrong. Just a bit of preamble before I start. So all the lecture material in CBW is available under a Creative Commons license, which means that you can reuse it and share this material as long as you credit it back to the original authors. In this case, the authors of some of the slides that I'm going to use are Erin Quinlan and Ben Langmeade. So I appreciate their work in putting together some of these lecture materials. There's a great set of resources for teaching things like computational genomics, and I've benefited from some of their slides. So as I said, this is an introduction to high throughput sequencing. There isn't a lab for this module. This is more of a lecture and just an introduction to how the technology works. But the things that we'll talk about here in the next hour really underpin the rest of the course over the next two days. If anything's unclear or you want me to go into more detail and any topics, feel free to just put your hand up and ask as I go. I like lots of questions. It breaks up me droning on and on. So please, anything comes to mind, feel free to ask. So a great place to start when talking about DNA sequencing is a central dogma of molecular biology. Just from listening to the intros when everybody described their projects, I don't think I need to go into this in a lot of detail. Most of you are biologists and understand this. Central dogma of molecular biology describes how information flows from DNA through RNA through protein to carry out a cell's task. So DNA gets transcribed into RNA, which gets translated into proteins. Those proteins fold this huge diversity of shapes to create enzymes and all the biochemistry that goes on in the cell. Now, this is a model of how information flows in the cell, but it's a particularly good one. It allows us to make predictions about how changes in DNA sequence are going to affect the observable phenotypes of an organism. So the implication of this is that if we can sequence DNA, if we can sequence DNA at large scale by sequencing many individuals in a population or many different species, we can understand something about how these changes in a genome underlie different phenotypes. So the molecule we're essentially gonna be interested in today is DNA. Sure, many of you, probably all of you are familiar with the structure of DNA. It's a double helical structure made up of two strands that are bound together. The strands, the backbone of the strands are made up of a sugar phosphate backbone and the interlinking molecules are nucleotides. In bioinformatics, we represent them by four different letters, ATGC, A for adenine, T for thymine, G for guanine, C for cytosine, where A binds to T, G binds to C to make up these rungs in the DNA ladder. Now, sequencing genomes is rather hard because genomes typically are very, very large. The human genome is three billion base pairs or three billion nucleotides in length. Typical bacterias are in few million bases, viruses are tens of thousands of bases, but a lot of us are interested in sequencing these very large genomes, which are very difficult to read. So the human genome, which is three billion bases, to fit it into a cell is packed into higher order structures. So the individual strands of DNA, this is the pointer showing up on both screens, yes. They get wrapped around molecules which are called nucleosomes, which are made up of histones, these histones then get packed together and wound around into these fibers, and then these are then wound again into the observable chromosomes so you can see if you look down to microscope and metaphase. Now, why we want to sequence human genomes is primarily that we want to understand genetic variation. So a genome sequence on its own is great. We can do things like predicting where the genes, the protein coding genes are in the genome, but really what we want to do is sequence a lot of genomes, compare them to each other, figure out where they're different, and then link those differences to observable traits. So if we sequence any pair of individuals in this room, we compare the genomes, we'll find differences at about 0.2% of positions. Most of these are small differences, like single-nucleotide polymorphisms, but we know that these SNPs are crucially linked to observable differences. There's SNPs that control things like our height. These are very SNPs of very weak effect, what millions of them will go into controlling a complex trait-like height, and there's also SNPs that control more simple traits like eye color. Well, essentially what we want to do is sequence genomes, compare them to each other, look for where they differ, and then by comparing enough samples, build up statistical significance, we'll say this difference is linked to this trait. Now to do this at large scale, we need to be able to sequence genomes very cheaply, and to talk about genome sequencing, or DNA sequencing, I'm gonna go back to a picture of what DNA looks like. Now DNA is a stranded molecule, it's a directional molecule, there's a five prime end, which is denoted by this phosphate group in the backbone, and there's a three prime end, which is denoted by a hydroxyl off the sugar here. Now when we read DNA, we read it from the five prime end down to the three prime end, so we read it in this direction. So in this case, we read off the identity of the nucleotides that are linked to the sugar phosphate backbone, and this is an adenine, so we'd say there is an A, this is a C, this is a T, and this is a G. So we'd say the sequence of this DNA molecule from five prime to three prime is AC TG. We could also read the same DNA molecule in the opposite direction starting from the five prime end of the complementary strand here, and then go to the three prime end here. And we say that the two sequences of these strands are reverse compliments of each other, as we can infer the sequence of the opposite strand that we sequenced by taking the compliment of every base, the base that it binds to, and then just reversing the sequence, because we always want to represent sequences from five prime to three prime. Reverse complimenting DNA sequences is very common in bioinformatics when you will work with sequencing data in a fast A or a fast Q format. You'll see an ASCII representation of the DNA that was sequenced with just one of the four letters, and that's always gonna be five prime to three prime, and then if you just reverse and compliment it, you can get the sequence for the opposite strand. Now, as long as essentially we've known about the structure of DNA, we've wanted to be able to read what the sequence of genomes are, and we essentially had to wait until around the 1970s for a molecular biology technique to be developed that could determine what a DNA sequence was, and the predominant early sequencing method was invented by Gunning Fred Sanger, who is at the laboratory for molecular biology at the University of Cambridge, and he invented the sequencing chemistry, which is now referred to as Sanger sequencing, for which he received his second Nobel Prize for developing this. So Sanger sequencing is incredibly clever method of figuring out the DNA sequence of a molecule, and it has two key steps. So DNA polymerase is a key enzyme involved in copying DNA, and Sanger developed, Sanger was a biochemist, he developed specially modified nucleotides, which are called dideoxynucleotides, which replace one of those hydroxyl groups by a blocking group that will inhibit a DNA chain from elongating any further. So if you introduce one of these dideoxynucleotides, you give it to DNA polymerase, it will copy it into a strand of DNA, and then because there's no hydroxy, it won't be allowed to extend any further. So it'll stop at that defined position where this dideoxynucleotide was present. Now Sanger's sequencing method in the original way, it worked, he would do four different reactions, one for dideoxy A, one for dideoxy T, G and C, those would happen in four parallel reactions, whereas the dideoxy version of the nucleotide would be spiked into the reaction mixture at a low level, like a few percent. Now if you add DNA polymerase to copy DNA within that reaction mixture, that will introduce one of these stopping bases at random positions of the growing strand of DNA. And you know that because you only have, say, dideoxy A in the first reaction well, all of the strands that stopped end with an A. So you now know the last base of all of those fragments, but you don't know exactly where those last bases were introduced. So the second technique that Sanger used to sequence is gel electrophoresis, which sorts DNA fragments by their size. So I'm sure many of you have run gels before. If you put a sample of DNA in a gel, put it within a voltage gradient, the DNA propagates through that gel according to its size, where shorter fragments propagate further and longer fragments don't propagate very far. That allows Sanger to then sort these fragments with a defined stopping position by size, and it allows you to read off the sequence of the DNA fragments according to their size by identifying what the last base is. So in this example here, the shortest fragment, which is shown in this band here, was in the reaction well with DDC. So we know that the shortest fragment ends with a C. So we'd put a C into our output. The second shortest fragment ends with a G. We'd put a G into the output. The third and the fourth end with T. Fifth ends with A, and so on. You just go down these bands looking at which reaction well they are in to sequence that DNA fragment in Sanger's method. Now this was absolutely revolutionary. It was a fantastic breakthrough. They used it to sequence the first genomes, which were typically viral genomes, because they were very small. And you could also sequence isolated fragments of DNA. But it was pretty clear that we wanted to be able to sequence much larger amounts of DNA and do it at a much higher scale. So in the next 20 years, Sanger's technique was refined primarily through automation, which allowed the sequence in chemistry to be performed at scale. So the key innovations here were rather than using four different reactions, one for the different DD nucleotides, the reaction we've done in a single tube with fluorescently labeled nucleotides instead. So you'd have each nucleotide labeled with unique color. You'd separate them using a capillary tube, and you'd get the distinctive what we call Sanger trace, which is what this is on the bottom, where you just read off the sequence of the nucleotides by looking at the colors of each one of these peaks. So let's say blue is G, red is C, and so on. And you just go down these peaks, reading off the sequence of the genome that way. Now this level of automation and the simplification of the chemistry allowed the throughput of Sanger sequencing to vastly increase. So when Sanger invented his sequencing in 1977, a very hardworking technician might build a sequence around 1,000 bases per day. If you wanted to sequence the entire 3 billion base pair human genome, that would take you about 120,000 years. Obviously, this isn't very practical. The first automated sequencer was the ABI 370. This was able to sequence around 5,000 bases per day, or a 5-fold, 6-fold increase in throughput. That would drop the time to sequence a human genome down to about 16,000 years. Still not very practical. The sequencers continued to be refined. The ABI 377 had bigger gels, better chemistry and optics, more sensitive dyes, faster computers were on the base calling. This increased throughput by another 4-fold, up to around 20,000 bases per day. But finally, in 1999, a highly multiplexed version of the ABI sequencer, the 3,700, was developed, which could sequence 400,000 bases per day. If you ran one of these sequencers around a clock, you could sequence a human genome around 200 years. This sequencer was used as a workhorse for the human genome project. Rather than having one sequencer run for 200 years, a lot of genome centers bought very many of these sequencers, tens to hundreds of them, ran them around the clock, and were able to sequence the human genome over a span of about, let's say, 10 to 15 years. And this became really the pinnacle of Sanger sequencing technology, where you could sequence around 400,000 bases per day. Now, when we sequenced the first version of the human genome, it gave us this huge amount of information, but it was also clear that there was a lot of variation within genomes. And to capture that variation, the link to phenotypes, as I mentioned earlier, we would have to sequence many more genomes. We wanted to sequence these genomes at large scale. The original human genome costs around $3 billion, and it took around a decade to sequence, which is a really scalable process. We're not going to be scaling sequencing very many genomes at $3 billion per pop. So companies and academic groups were actively working on what we now call the next generation sequencing technology, which would drastically lower the cost and increase the throughput of DNA sequencing. This is going to be the core content of this module, and indeed the rest of this workshop, talking about these next generation sequencing instruments, how they work, how the data looks, and then, of course, how we work with the data. So you'll hear a lot of different terms used for next generation sequencing or the type of sequencing we now use. You might hear massively parallel sequencing. This was a very early term for referring to it. High throughput sequencing is now the preferred term. Ultra high throughput sequencing, next generation sequencing, second generation sequencing. All of these terms are essentially used to describe the sequencers we now have available, which can sequence genomes very, very cheaply. I prefer high throughput sequencing. A lot of people still use next generation sequencing, but that's falling out of favor now that we're going beyond aluminum technology to single molecule sequencers. They'll talk with a little bit later. So the key technologies we're going to talk about are the aluminum sequencer, which was introduced around 2006. The PacBio sequencer, which came out around 2011. And then the Oxford Nanopore Minion, which I heard a few people are interested in, which came out a few years ago. My group is primarily working on developing algorithms for this sequencer. So I'll spend a little bit of time describing how that works later on. Other sequencers of note, the 454 deserves recognition as it was the first high throughput sequencer. It came out in 2005. I think they use it to sequence Jim Watson's genome in a very prominent early paper. Unfortunately, that sequencer is now discontinued. It had a few properties of the data, were quite a bit worse than how aluminum sequencing works. So aluminum essentially took all the high throughput sequencing market for short reads. So I'm now going to describe how the aluminum sequencer works. So just like Sanger sequencing, it relies on copying DNA and having fluorescently labeled nucleotides. So DNA polymerase is a key step in aluminum sequencing. So if we have a single-stranded DNA template, it's a free-floating nucleotides and solution. We add DNA polymerase. It will synthesize the complementary strand, 5 prime to 3 prime direction, according to the strand, the sequence of what we call the template strand. Now another term you might hear, aluminum sequencing referred to as sequencing by synthesis. This is what they call the chemistry for aluminum sequencing. The way aluminum sequencing works is you take a DNA sample, you fragment it into many pieces, around 300 to 400 bases. These single-stranded fragments, which we call templates, get attached to the surface of a microscope slide. And then in place, we have these templates. They say there's two in this example. There's a step which is called bridge amplification, which is PCR run directly on the microscope slide, which copies these templates into a cluster of identical clones. So we've got one template here. We run PCR, and then you amplify that into a batch where all of the sequences in that cluster are identical. And there are around 10,000 of them within that cluster. Now, aluminum sequencing is a cycle-based chemistry. And what they do is they inject a mixture of color-labeled nucleotides, again, where there is one unique color for each nucleotide. Add DNA polymerase. And the DNA polymerase will attach the complementary base to each one of these templates. So you can imagine these are clusters, but we're just drawing them as single templates here for clarity. So there's a very expensive digital camera attached to a microscope, which then images the microscope slide and detects which color of base was just incorporated into this growing strand. So for this cluster, there was a green base. For this cluster, there was a yellow base. Now, just like Sanger's chemistry, which had this chain-terminating group on the nucleotides, these DT NTPs, the aluminum chemistry also has a molecule which blocks that chain from growing any further. But the clever bit of the aluminum chemistry is that this is a reversible reaction. They can add a chemical that then cleaves off that blocking group, which allows the synthesis reaction to continue. So all of these steps of adding in the color label nucleotides, imaging which color the nucleotide is with the camera, and then cleaving off this blocking group is referred to as a cycle in aluminum chemistry. And that sequence is a single base for all of these clusters on your slide. You then would repeat this cycle for this to sequence the second base, then the third base, then the fourth base, then the fifth base. So here's what it looks like in cycle one. We've detected a green flash of light from this molecule, a yellow flash of light from this one. In cycle two, we detected red in both. Cycle three, cycle four, cycle five. This repeats for however long your template is. Usually around 100 cycles to get 100 base read. And then at the end of it, some software, which is called a base collar, will take these images, match up clusters across the different images to say which of the clusters correspond to the same collection of DNA molecules. And then it will detect which color each one of those is in each one of the cycles to mint the base called sequence. So here, the pattern of colors for this DNA fragment was green, red, yellow, red, yellow. If you take the complement of those to see to figure out what the sequence of this original strand was, it was TACAC. And that gives you the sequence of that cluster. Now, this is happening massively in parallel. So these example images only have two clusters, but typically in a luminous sequencing run, there's around a billion clusters on the flow cell, all which are being sequenced a single base at a time, which is what gives luminous sequencing its enormous throughput and low cost. Now, I mentioned before, it's really important to understand how sequencing errors occur. And the dominant mode of errors in luminous sequencing are substitution errors, where the base collar predicted the wrong identity of a base. And these are caused by the fragments or the templates within a cluster getting out of sync. So we have this bundle of DNA molecules, which are all sequencing the same time. And we're hopefully sequencing them all the first base, and then the second base, then the third base, and the fourth base. But what can happen is if these blocking gate groups either aren't present on the nucleotide or don't get cleaved off, some of the molecules in cluster will lag behind, and some of them will jump ahead. So in this example, we're on cycle two. There's two molecules, which have a red base here, complementary to the green second base of the fragment. There's one that's lagging behind on the first base still, and there's one that's jumped ahead, and it's on the third base. And what happens is that when the microscope takes a picture of the slide, it's going to see 50% red, 25% green, 25% yellow. And this gives you a mixed signal that the base collar then needs to figure out which is the true base. Is it red? Is it green? Or is it yellow? So the base collars typically have fairly sophisticated probabilistic models to model this process of molecules lagging behind or jumping ahead to deconvolve this mixed signal that it's getting at each position. When it's doing this inference problem, it will report how confident it is in the assignment of individual nucleotides as what we call a quality score, which is just the base collar's estimate of the probability that that base is incorrect. Looking at quality scores when you're looking at real data is crucial for trying to determine whether, say, a low level somatic mutation is a true mutation or just a sequencing error. You typically want to use these quality scores to inform how much you trust the data that you're looking at. Now, another thing that's important to note about aluminum sequencing is that the error rate is not uniform. It's dependent on where in the DNA fragment the base is. So at the very start of the read, there's not a lot of opportunities for these fragments to jump ahead or go behind. So the error rate tends to be lower at the beginning of the read. But as you go from the 5 prime to the 3 prime end of the read, the error rate increases as there's more opportunities. And these molecules can fall further and further behind. So the signal becomes less pure as you go along the sequence of the read. So the error rate increases towards the end of the read. OK, so to summarize aluminum sequencing, it has the best throughput by far of any available sequencer. I've written 600 gigabases for an aluminum run here. It's now with NovaSeq up to over a terabase for, say, $10,000 to $15,000 run. The accuracy is the best of any sequencing technology. Because it's using this cluster of molecules, that gives a lot of redundancy in the signal that allows it to accurately call bases. The error rate's less than 1%, maybe around 1 and 200 bases are incorrect. And the sequencing technology is pretty robust. Usually you get pretty good runs and library preparation methods are pretty well worked out where you get reproducibly good data from aluminum. The big disadvantage and something I'm going to come back to a little bit during the rest of the talk is that there's this inherent limit to the read length. Because you're doing the cycle by cycle chemistry, where you sequence the first base of every cluster, then the second base. And there's a lot of time required to do those cycles around 10 to 20 minutes. You can only sequence around 100 bases, maybe 150 bases of DNA fragments. Now, why this is a drawback is that the human genome is huge, 3 billion bases. And it's incredibly repetitive. And some of the genomes that you guys are working with, like plant genomes, are also very, very repetitive. And 100 base pair reads aren't enough to completely resolve where that read might have come from in the genome. And when we talk about genome assembly tomorrow, we'll come back to just how much these genomic repeats affect the genome assembly problem. So there's a lot of the genome that's inaccessible to alumina sequencing because of the read length. So is a lane the same as the flow cell? Right. Yeah, it's a good question. So each flow cell has multiple lanes on it. So I don't know what the most recent, they've changed it. When I worked for a lot of aluminum data, it was usually eight lanes per flow cell. But did anybody know, like for a NovaSeq or an X10, how many lanes are on one of those? I heard somebody whisper something. 16? OK, we'll go with that. Thank you. It's a huge, yeah. The NovaSeq flow cell is a giant thing. That's how they're scaling up, is they're able to image these very, very large regions of these slides. OK, so we've known that alumina sequencing doesn't solve all problems. It's incredibly cheap. We can sequence a human genome for around $1,500 now. But as I said, we're leaving a lot of information out because we don't have the ability to resolve these most repetitive regions of the genome. So companies and groups have been working on what we call the third generation or single molecule long read sequencers that give much longer range information that can be useful to do things like complete genome assemblies. So the first one to come out was the PacBio or the Pacific Biosciences instrument. It does not require amplification, unlike the Illuminous Sequencer. The way the PacBio works is that it mobilizes DNA polymerase at the bottom of what they call a zero mode wave guide, which is just a little chamber here. This DNA polymerase will capture a fragmented DNA and then copy it in place at the bottom of this well. Now, just like Illuminous Sequencing, individual nucleotides have a fluorescent label, a unique color, scientific nucleotide. As DNA polymerase captures these fluorescently tagged nucleotides, holds them in place to incorporate them into this growing strand of DNA, we can detect that as a flash of light by imaging from the bottom of this zero mode wave guide. Now, nucleotides are diffusing in and out of this chamber constantly, but when one actually gets captured by the DNA polymerase and incorporated it, we see it as this pulse of light that has a fairly long duration. So in this image here, this is the signal intensity of the four different colors over time. And when we see this pulse of green light, that means that DNA polymerase captured an a-nucleotide and incorporated it over this period of time. This is based on the kinetics of how long it takes to do the reaction of incorporating it. Then it drops back down. Then we don't see a signal. Then it jumps up again when it incorporates this T, goes back down, and carries on. So now rather than having this cycle by cycle chemistry, this is essentially happening in real time as the DNA polymerase is copying the DNA. And each individual base is sequenced much more rapidly. So we don't have to wait 10 to 20 minutes to sequence the next base. It's going to get sequenced whenever it diffuses in and is captured by this DNA polymerase. So the Pac-Bio instrument is able to sequence 10 to 20,000 base pairs of DNA at once. So the fragment length can be up to 20, or the read length can be up to 20,000 bases. But because we're making single molecule measurements, there's a much lower signal-to-noise ratio. And that means that the error rate of the sequencer is much, much higher. The error rate of Pac-Bio is around 10% to 15%. And unlike Illumina, the error mode is mostly insertions and deletions, where a base was either erroneously introduced into the base called sequenced or deleted. And those are caused by pulses of light that didn't lead to complete incorporations into the growing strand. Something that's quite cool about Pac-Bio and it's also shared by the nanopore sequencer we'll talk about is that you can detect base modifications like methylation, like five-methyl cytosine, directly from these signal pulses that we observe. So you get this extra layer of information with Pac-Bio, and you also get much longer range information. The primary drawback, aside from the error rate, is if the throughput is much lower than Illumina sequencing, a Pac-Bio cell will give you around five to 10 gigabases of data per run. So if you're going to sequence a genome with Pac-Bio, it's typically much more expensive than sequencing with Illumina. This is just a histogram of the read length distribution for a Pac-Bio run for a breast cancer cell line that was sequenced here at OICR in collaboration with Cold Spring Harbor. So this is a histogram of read length. We see that there's many reads, where's my pointer, that are greater than 10,000 bases, even greater than 20,000 bases, with the longest read in this data set was around 70,000 bases, which allows you to resolve a lot of these repetitive regions that aren't accessible by Illumina sequencing. We're going to talk about genome assembly more tomorrow, but just to introduce that topic, the main challenge with genome assembly is that genomes are very repetitive, and 100-base pair reads aren't enough to resolve a lot of these repeat. 10,000 to 20,000 base pair reads often are, so there have been a lot of papers in the last few years of using Pac-Bio sequencing to improve genome assemblies of human genomes and many other species. This is a paper that I particularly like, which shows the advantages of long-rate sequencing. And again, we'll touch on that tomorrow. So the last sequencer we're going to talk about is the Oxford Nanopore Minion. I really like this sequencer. It's incredibly unique and gives you a lot of different capabilities. So the Nanopore sequencer is the first portable sequencer. It's roughly the size of, say, a smartphone. And you plug it into your computer with a USB3 cable. You bring it to wherever you need to sequence, and you can use sequencing directly in the field on site. So this sort of inverts the paradigm of how genome sequencing is done. As usually, you collect samples, ship them to a genome institute like the Broad Institute or here at OICR. They're sequenced centrally in this factory-style operation, and then you disseminate the results back to whoever collected the data. Now, the Nanopore sequencer, because it's portable and because it's relatively stable, you can bring this anywhere in the world and then do sequencing directly in place. So one of the best examples of this is a project I was involved in with taking the Minion to Guinea and West Africa to sequence Ebola samples directly in field hospitals and clinic during the Ebola outbreak from 2014 to 2016. This was a project led by Nick Lomond's group in Birmingham in the UK. And here, a researcher, Joseph Bohr, is loading a sample onto the Minion for sequencing. So the Minion has a flow cell. It's this black object which is attached to the base station here. Joseph is pipetting the sample in here. It gets drawn through the array and nanopores, which is this yellowish rectangle here. I'll show you the full episode in a bit more detail later. But using this in Africa, we were able to sequence Ebola genomes in with a much quicker turnaround on the order of a few days rather than a few weeks or months if you're going to ship the samples to Europe or North America for sequencing. So we were able to give information back to the epidemiologists who were trying to control the outbreak much more quickly. So not only is the nanopore sequencer give you these unique capabilities, I just think the sequence technology is quite cool as well. It does away with this idea of using fluorescence to detect which nucleotides were incorporated. And it directly measures the physical properties of DNA. So here's an illustration of how a nanopore sequencer works at a very low level. The nanopores are all protein nanopores. They're taken from things like bacterial cell walls, where it's a protein with just a channel running through it that allows molecules to pass from one side to the other. So in Oxford nanopores, MNION, these protein nanopores are embedded within a membrane, which is shown in black. And the channel is wide enough that single-stranded DNA can pass from one side of the membrane to the other. Now, the movement of DNA from one strand to the other is controlled by what they call a motor protein, which is acting as a break that stops the DNA from going through the pore too quickly. And it slows it down to around 500 bases per second. Now, not only is single-stranded DNA going through the pore, but also charge-carrying ions like calcium ions are going from one side of the membrane to the other. And the flow of current is being continuously monitored at around 4 kilohertz, 4,000 samples per second, which is the observable output of the sequencer. Now, the fundamental principle here is that the amount of current you observe depends on the identity of this portion of the DNA molecule that's in this constriction in the pore. So some sequences will allow more current through, some sequences will allow less current through, and we can use this information to predict what the DNA sequence was in the pore when those samples were taken. Now, this would be an illustration of the sequencing process, but unfortunately, my slides haven't survived conversion from keynote to PowerPoint. Essentially, the idea is that sometime we have some sequence in the pore, GCTAC. We observe some current samples like, say, around 60 picoamps. As DNA moves through the pore by a single base, the current signal changes. So it's gone from 60 picoamps down to around 45 picoamps. As it moves again, we see different current samples until we see the whole DNA fragment. Now, what my group's involved in is building probabilistic models of how much current we should observe depending on what the DNA sequence is in the pore. So the way this works is you sequence known DNA, so DNA with a known sequence, and you build up a profile of how much current you observe when short fragments of DNA, like AGG, TAG, is in the pore. And we fit a Gaussian distribution to this, and we say that when this sequence is in the pore, we expect to see around 59 picoamps. We can then build probabilistic models that use these distributions to predict what the DNA sequence was from the observed current sample. Now, Oxford Nanopore's sequencer was known for having a very high error rate, and the data being very difficult to work with. But over time, this has slowly improved. So the early Oxford Nanopore data had an error rate of around 20% or an accuracy of about 80%. They changed from what they call the R7 pore to the R9 pore, which improved accuracy to around 90% to 95%, or about a 10% improvement. So it's a little bit better than Pac-Bio, but the error rate is somewhat less uniform. It's more biased toward difficult sequencing contexts like homopolymer runs. So when you calculate a genome assembly, the consensus error rate is a little bit lower. So here's what a Nanopore sequencer looks like. This is the flow cell. This is the port that you use to pipet the sample in. This is the array of Nanopores. There's 2,048 Nanopores that can individually sequence a fragment of DNA. They can only sequence in groups of 500 at a time. And if you multiply the sequencing speed 500 bases times the number of pores you can sequence around 500, the theoretical throughput of the Nanopore sequence error is around 30 to 40 gigabases. Typically in the field, we're seeing between 5 to 10 gigabases of yield, which is comparable to the latest Pac-Bio chemistry. Just like the Pac-Bio, the read length is much, much longer than Illumina. So the Nanopore will essentially sequence any length fragment of DNA. If you can present extremely long fragments of DNA to the Nanopore, it will pull those fragments through and sequence them. Typically, we aim for around 6 to 10 kilobase fragments, but there's groups who are trying to optimize the read length and have gotten 900 kilobase reads of DNA, almost a megabase, which is like a quarter of the E. coli genome in a single read. As an illustration of just how portable the Nanopore sequencers are, NASA took one of the sequencers up to International Space Station to show that it can sequence in microgravity conditions. This is the astronaut Kate Rubins, who's running this experiment. Their sequencing data was essentially comparable in quality to sequencing that was performed in parallel on the ground. Essentially, the Nanopore sequencer works in space, the same as it works on the ground. The only difference is you need to attach it to the wall with a little bit of Velcro so it doesn't float away from you during sequencing. Yeah? Does the Nanopore only allow DNA to sequence for individuals, non-DNA, that can go through the house and potentially give you a signal? Yeah, so something they work on is detecting other molecules as well. So the big ones right now, they have DNA working very well. They're directly sequencing RNA as well. So you don't need to reverse transcribe RNA to cDNA. You can directly sequence RNA. Protein sequencing is a big one that people have been interested in a long time. It's much harder to sequence protein than DNA and RNA. And they're also trying to develop ways of detecting like arbitrary small molecules in solution as well. But again, that's very hard. So the two that are working are DNA, RNA, protein, and other small molecules are planned. So with other things in solution, can you pass the Nanopore? How do you know that you're not getting signal from a straight protein that's going through the Nanopore and non-DNA? Yeah, so there's a few parts to that. The size of the pore is about a nanometer and a half wide. So a folded protein is not going to fit through. And also you leverage the very long length of DNA as well. So you'll see in these current traces, these blips of current where it just goes, where the current drops, and then it goes back very, very quickly. And you just ignore those. If you have a 10,000 base pair fragment of DNA going through the pore at 500 bases per second, you expect to see this incredibly long signal, where you know there was a DNA strand rather than just something that happened to go through the pore by chance. Okay, so to summarize sequence technologies, Illumina has 100 to 200 base pair reads, up to 600 gigabases per run. This is a bit out of date now. It's over a terabase. There's now with very low error rate, predominantly substitutions, Pac-Bot and oxygen Nanopore single molecule sequencers which don't require amplification. They can sequence DNA reads that are longer than 10,000 bases. They can give you around five to 10 gigabases per run, but a higher error rate. And both technologies can detect modified bases like five methyl cytosine natively without requiring things like bisulfite conversion. This slide is difficult to maintain because sequencing technology moves so quickly. Even since I put this slide together a few months ago, the yields of basically all of these sequencers have improved. So take this with a grain of salt and know that this changes very rapidly, in particular in terms of yields. So that's all I have for sequencing technologies. I'm happy to take any questions before we move on. Yep. I'm not sure if I answered the question. Sure. So then you just take the specimen and prep it somehow and then put it in. So there's no amplification. There is amplification. So the ultimate goal of these sort of infectious disease projects is just to take a blood sample or some sample from an individual, extract DNA or RNA, and then identify what's in the patient. Unfortunately, a lot of these viruses are present in such low copy number that you just don't get enough reads to do that identification. So for Ebola, there was a PCR amplification step first to both amplify and also enrich for Ebola samples. We published a follow-up study that just came out yesterday sequencing the Zika virus in South America and the tropics. And Zika was much, much harder than Ebola because it was present in many lower copies, very, very low copies, like 20 copies of the genome per milliliter of blood or something. So they had to amplify very, very heavily. Ultimately, though, the goal would be to have a sensitive enough method that you just do metagenomics, but we're not there yet.