 Okay, sequencing technologies. So of course, most important part of, or most important part of the introduction of this course is what the actual different sequencing technologies related to next generation sequencing actually are. So if you think about next generation sequencing you also often think about the applications, obviously. So I think most like maybe 90, 95% of the applications can be sorted in five or six different topics. One of them is for example, transcriptome characterization where you actually sequence transcripts, so the genes that are expressed in a tissue. Another application is epigenome characterization for example, ATAC-seq or bisophied sequencing. Another one can be DNA protein interactions, for example, gypsych, all genome sequencing. For example, if you want to do an assembly very relevant for bacterial genomes but also for more and more eukaryotes or more and more eukaryotes, you have an assembled genome nowadays just because sequencing has become cheaper and we also now have the computational infrastructure to do it. Another application can be variant detection. For example, if you are interested in the relation between absence or presence of specific variants and specific and the relation between phenotypes, for example, disease resistance, metagenome characterization can be related to all genome assembly depending a little bit on the application force. Maybe you can think of other applications. Also, I have a hard time think of really other applications, so that's why I'm asking you. If not, then I continue to the next slide. So a typical experiment in one of those applications usually starts with the experimental design. You get your samples ready, you extract your DNA or your RNA depending on your starting material and your type of application. You do a specific library preparation and then you have the sequencing. And after the sequencing has started, then the bioathematic analysis starts. So after you have the sequencing, read, then the bioathematic analysis starts. So usually what you do first in the analysis is some quality control, just the screen on the FASTQ file. So the read you have, whether you get the quality you expected, one very important quality measure is, for example, the base quality of the reads. And that is a certain quality measure for how sure the sequencer was that the base, for example, you have sequenced an A or an B, whether that was actually that sequence. We'll go into that later on. After that, you have done, so you have done the quality control on sequencing reads, then the alignment starts. Very often people have a reference genome and then the reads you actually generate, you align them to that reference genome depending of course a little bit on the application. So if you are interested in assembly, then you cannot do an alignment yet. Of course, first I want to do the assembly. Then depending very much on the application, you do all kinds of downstream analysis. For example, if you look at an RNA-seq data set, you're going to count the number of alignments per gene, and that can be a proxy for the expression of that gene. And that can then of course, answer certain biological questions. So that particular part of the downstream analysis will not be part of this course, at least not primarily, because it is so different between all the different applications for next generation sequencing. After and during the analysis, you always want to do some visualization. In this context with visualization, I mean based on tracks in the genome, that you have a certain part of the genome that you're interested in, and you're going to look there, for example, at the alignment or the variance that you have called in a particular region. So in this context with visualization, I mean really looking at a track in the genome, at a part of the genome, and see what kind of features are there. And of course, the visualization doesn't have to be necessarily after the downstream analysis. You also very often use it as a quality control, for example, after the alignment. In order to generate those sequences, there are many ways to do that. And these different methods, they have definitely evolved during time. So this is an image from 2016, so that's already six years old. So it is in the world of next generation sequencing, already pretty ancient. But I think it nicely depict the evolution of these different methods. So what you see on top left in green is the ABI sequences. They have been discontinued, most likely, or I'm pretty sure it's mainly because Illumina was too strong of a competitor. It was very similar to Illumina, but just I think it was slower and just more expensive compared to Illumina sequencing. So the Illumina sequencers you see in brown, where during time, I should, by the way, also explain the graph. So in the graph on the x-axis, you see the read length in a log scale. And on the y-axis, you see the throughput in a log scale. So really meaning that if you go a little bit forward, you go already 10 times as high in throughput, for example. So what you see is that, for example, the Illumina sequencers, they evolved from really relatively low throughput, very similar to ABI, to longer and longer read length and also higher and higher throughput. And after that, you see that there is some kind of divergence with low throughput machines and higher throughput machines. And that can be important for scalability. We'll go into that later on. The blue one are the ion torrent machines. Maybe some of you have one of those in the lab because they are usually relatively small and relatively easy to operate. So you can have an ion torrent machine in the wet lab and they generate a reach of maximum about 400 base pairs and relatively low throughput, but high enough for many different applications. Then we have the 454 sequencers in orange, not used very often anymore. And then we see here on the bottom right, we see the Sanger sequencing. So that's not next generation sequencing. That's first generation sequencing. With really pretty high read length, as you can see, higher than most of the modern second generation sequencers, but very low throughput. On the right hand, on the top right, we see the long read sequencers, of course, on the very right of the image with Pac-Bio and Oxford Nanopore technology, especially. So I'm going to mute you, so, Jan. There we go. Yes. Okay. So on the right, Pac-Bio and Oxford Nanopore technology sequencers, and they have evolved very much in the last six years, also Illumina, by the way. So with the Prometion Oxford Nanopore technology really made a huge step in throughput, and also a little bit in read length, but mainly in throughput, and nowadays is the machine with the highest throughput of all sequencers. Then we have the Pac-Bio SQL 2, which is very much increased in both read length and throughput, so also making it more competitive to the second generation sequencers. And a disadvantage of these long read sequencers is by default they have a relatively low base quality, which means that they have a relatively high error rate, which means that they make quite a lot of mistakes during sequencing, both the case for Pac-Bio and Oxford Nanopore technology. However, error rate decreases more and more when these technologies evolve, and Pac-Bio actually has a pretty smart solution to that because what they do is they sequence the same molecule multiple times, and because these errors are random, if you make consensus out of that molecule, you actually get a higher base quality compared to, for example, the bad Illumina reads. And that is very interesting, of course, because they get both long reads that have a very high base quality and still with a pretty high throughput. So that's the high five with Pac-Bio. So during this course, we will not talk about Tanger sequencing, at least a lot, not so much about the technology. We will mainly focus on second-generation sequencing. So you can, nowadays, the most frequently used methods for second-generation sequencing are the iron torrent from Thermo Fisher and the Illumina sequencing machines. And then third-generation sequencing, so they are usually also called the long re-sequencing sequencers from Pac-Bio, Pacific Biosciences, and Oxford Nanopore Technology. I have a question for you. I'm going to stop here and share my browser. Here we go. So my question for you is, although Sanger sequencing has been around since the 70s, so for a very long time already, it's still used by quite a lot of labs. And why do you think that is? So what most of you answered is because the output quality is still unchallenged. I would actually disagree because base quality in Sanger sequencing is definitely not higher compared to Illumina or iron torrent sequencing. So I disagree with that. Second, after the second most given answer is because the per-base sequence costs are always very low. That's also actually also not true, or at least I would disagree with that. So if you sequence your sample, a bunch of samples on a NOVA 6000, which is one of the most high throughput machines or maybe a Promethean, Oxford Nanopore Promethean, then the cost per base quality per base are definitely much, much lower than if you would have standard Sanger sequencing wrong. So I actually also disagree with that. So the correct answer is because it's very scalable. So if we go back to our graph, there we go. So what you can see is over here on the y-axis, is the gigabases per run. So let's say nowadays the smallest iron torrent or Illumina sequencers, they generate about, let's say, 7 to 8 gigabases per run. And you have to generate these gigabases and that's really a lot, right? Gigabases, so that, I should say this correctly, I think this is a million bases, am I right? I think it's a million bases. So you have to generate all these bases, which means that if you have, for example, if you are cloning, for example, and you're just interested in whether you have cloned what you actually want to clone or if you have generated an AmpliCon and you want to know the sequence of that AmpliCon, then you actually only need, I don't know, 300 bases. That's it, or maybe a little bit more. Let's say you need 2000 bases and not more than that. So if you put that on a machine, then still it would be very expensive to actually sequence that because you just generate a lot of the same. It had to generate this amount of reach. The thing you're sequencing, you always generate a very low amount of a low number of bases so you can much better scale it. So if you have, if you're only interested in a few AmpliCon, for example, then it's still cheaper per base to use Sanger sequencing. However, if you want to generate a lot of data, of course per base is much cheaper to take the high throughput sequencers. Nowadays, Oxford Nanopore Technology is definitely competing with Sanger sequencing because it is min ion sequencers that have also relatively low throughput and are relatively cheap to generate the library for. They are kind of competing with Sanger sequencing. But other than that, Sanger sequencing is still unchallenged in terms of scalability, especially if you are only interested in sequencing a few bases. That's clear. So if you have any questions or remarks regarding that, please raise your hand. No. Great. So now we go a bit more in depth in the different technologies. We will, in these technologies, we, again, we will focus mainly on the limna sequencing, but I want to give you a rather broad overview of the different sequencing technologies. So we'll start with ion torrent sequencing. Also a type of maximum rate sequencing. Sukanya has a question. Yes. High or low is completely independent of the amount of input. So low input is completely independent of whether you're going to do a high throughput or low throughput sequencing. Yes, in a way, yes. So what we consider as low throughput still has quite really a lot of bases in there that can be sequenced. Meaning, so if you have, I don't know the exact numbers, but if you have only a few nanograms of DNA, for example, there's still really a lot of, a lot of bases in there. Just a challenge to be able to actually sequence those, those, those bases. So what usually happens is those machine still need relatively a lot of input. So what happens is that for low input library preparation is that there are quite a few PCR cycles before you actually start the sequencing. So that is not targeted PCR. So that you try to amplify the entire library, meaning on targeted PCR, but still there are some PCR steps there that increases the input amount. But in principle, even very low throughput samples still contain really a lot of diversity in terms of RNA, DNA and a lot of different nucleotides to sequence. Okay. So I enter in sequencing. Relatively, I think it's about the same age as Illumina. Maybe it's a bit older even because the technology is based on this, I guess, relatively simpler. Nice thing about I turn sequencing is that already from beginning you had relatively low throughput. You could also scale it, but let's say the initial library preparation has relatively low throughput. I think a few million reads per library prep. Also, one nice thing is that it's relatively small machine. It's a really bench top. So it's going to fit in many red labs. And that's also why it's still quite popular. What happens during I turn sequencing or just start with is when you do a library preparation, what library preparation exactly is we will go a more deeper into when we discuss Illumina. But where you start with is you have a biotin that you can use to actually you have a biotin sequence that you can use to use to annule to a magnetic beat, which is makes it easier to pull out sequencers. There's an adapter. You have your DNA in there. So it's usually a random piece of DNA. If you are looking for a whole genome or RNA sequencing can be specific a target that sequence. If you look, for example, at AmpliCol sequencing, and then the other hand and other adapter. Use magnetic beats to pull out those sequence sequences and use Emulsion PCR in order to amplify a single fragment, which is all the same in the same Emulsion. And that same Emulsion is loaded on a chip. So what you will get is in this Emulsion, you have fragments that are PCR amplified that are all the same. And those are on a chip. So within each force, let's say wells on the chip, you have an Emulsion containing exactly the same sequence. What happens then is that these, this chip is flooded with, for example, only G's. And if a G is incorporated, you get a difference in pH in that specific well. And that difference in pH can be measured by very sensitive equipment. Depending on the signal of the difference in pH, you can actually see whether how many these are incorporated, for example. So if there are multiple G's incorporated, you get a stronger signal than if there is one G incorporated. It all happens one at a time. So let's say if you have, if you flooded with a G, G is incorporated, you get a signal flooded with an A, but it cannot incorporate that A. Then you do not get a signal. So you always flooded with A, G, C, D. And then again, A, G, C, D. And based on the signal you get in one of the different nucleotides. You can know what kind of nucleotide there is. So over here in the image, you get two protons released if there are two these incorporated. Then you get a graph like this. So you have one G incorporated, then two T's, then two A's, two C's, and then A's. So these are depicting the flooding of that chip with the different nucleotides. So technically, or more, if you look at the output of ion torrent sequencing, it's pretty interesting. So you can go up to 400 base pairs of read length, which is longer than most Illumina sequences. It is quite scalable. So you can go to relatively low throughput at not too high cost. And you can also go to relatively higher throughput, not comparable to the huge Illumina or or for nanopart technology machines, but still it's quite scalable. However, Illumina also started producing relatively low input machines. So there's definitely some competition going on there. A big, an essential disadvantage of ion torrent sequencing is that it has difficulties with sequencing homopolymers. So if you have, for example, five T's in a row, it finds it very difficult to differentiate, for example, five T's or six T's. And that's because of this signal that has to be translated into a number. So you often very, very often if you look at ion torrent data and you have a homopolymer in your sequence and those homopolymers that occur very frequently, you also always see a lot of insurgent deletions, which are actually sequencing errors. Siba has a question. Yes, my question is about the use of the, the beat. So is it, is it just being used to, to fish out kind of the, the sequences, but does it also have a role in where it places now these on the pH sensitive chip and why exactly do I see. Yeah. Good question. Good question. So I'm pretty sure I would have to check to be honest. So it's been a very long time since I've used this, this machine. So I can look it up for you. I'm pretty sure this magnetic beat only captures a signal fragment. Okay. And that signal fragment is then used for PCR. And that is then loaded on the chip. Okay. Okay. Yeah. But I will have a look to be entirely sure. Yeah, thank you. Thanks. Okay, so that was ion torrent sequencing then Illumina sequencing. Illumina sequencing is the most frequently used. Method for generating sequencing data nowadays. Although, of course, third generation sequencers are gaining popularity. But there was the most popular one I would say it has really high throughput. Because 6,000 can generate about 500 billion basis per run, which is really massive. And they just introduced a even higher throughput machine. Forgot the name and that the throughput that can also look up, but it's even increased. So most of this platform today, definitely. How does it work? What kind of data do you get out of it? So typically you generate reads between 50 and 300 base pairs in length. So relatively short, shorter than ion torrent, shorter than Sanger sequencing. So therefore Illumina sequencing is also often referred to as short read sequencing. The nice thing about the sequencing is that you can sequence paired and which means that from the same fragment so the same DNA fragment, you can sequence that on both sides. You do not know how far those two sequences are from each other but you do know that they come from the same fragment so they would be if you align them back to the genome they have to be relatively close to each other. That already helps with a lot of different questions. Yeah, that means the size of a fragment should not be more than 600 base pairs, right? Not necessarily. Not necessarily. So, if you want to sequence the entire fragment, yes. So what people often do is they sequence longer fragments and they know there is so you know you often the average fragment size before you do the sequencing. And then you kind of know what kind of average length fragment there would be in between the two reads and that can already be valuable information. But just to compare to single-end sequencing then you only always sequence 150-300 base pairs and you have no idea of sequences that are in the neighborhood of that sequence. If you sequence paired and then you know that two sequences they are paired they should be relatively close to each other but you do not know the exact distance between them. But that is already much more information than no information. But why companies ask for 150-1000 or 800 base pairs like when we are sending sequences and things for a company they ask for optimal length should be 150-800 something. Of the fragment size. Yeah, so it depends a little bit on the question you have. If you are for example interested in, so if you let me use RNA sequencing and you're interested in splice variants, what you actually want is usually relatively large fragment size because then you can link axles to each other. And if sequences are paired and one read a line to an axon and one read a line to another splice path axon that can be splice path or not, then you know that these two axons are from or these two reads are from a specific transcript. Got it. For example. Yeah, thank you. There's, of course, huge throughput, but what you often are not interested in is generate, for example, 500 billion base pairs of the same sample of the same sample, maybe if you have a huge genome. And you want to do variant analysis in the genome so you will do whole genome sequencing they might be interested in generate 500 billion basis for that one sample was usually what you try to do is to actually load multiple samples on the same sequencing machine, but you will have to know where those weeks actually came from. And we do that by bar coding you'll go into it later on, but that's very typical when this this massive throughput sequence started is that you have to be able to do multiplexing meaning having been being able to load multiple samples on the same sequencing So how does the women I like preparation look like is ignore the image on the top right that shouldn't be there yet. So what you usually do is let's say you have your DNA. For example, you have reverse transcribed your your RNA and what you often first do is you share it here because often those fragments they are too long to be actually doing for example parent sequencing with the first year DNA so you got it up in smaller pieces. After that, also a little bit depending on your application you do a size selection and that's where you actually define this fragment size, you want that you're interested in usually somewhere between, let's say 200 200 and 800 base pairs depending on your application. When you have done this side selection, you like gate the adapters to the fragments of the right size. And these adapters, they are typical they are known sequences that are later on used to do for example PCR and to add other sequences to your fragments. Then barcodes are added if you do multiplexing and most of the times you want to do multiplexing so you want to sequence multiple samples at the same time in the same room, and the P five and P seven type are added these P five and P seven size they are required to a new to the actual sequencing lane. So they are also known oligos that are typical for Illumina that are used to be able to anneal to the sequencing lane. Very typical is to do them a PCR. Depending, for example, on the amount of input you've had, usually between eight and 16 cycles of PCR. Usually what you do is you try to have as little cycles as possible. Maybe because PCR can buy us the fragment distribution. There's a question from Daniela. Yes. What is the difference between the barcodes and the unique molecular identifier this is always confusing. Okay, okay. We will not go into that for this presentation, but I can explain it here. So the unique molecular identifier. Well, I should say both the barcode and unique molecular identifier are added to the fragment before PCR. However, the unique molecular identifier is a completely random sequence. And because it's completely random, it only identifies the molecule, meaning the fragment, the barcode identifies a sample where the fragment actually came from. So you have a lot of different fragments with a barcode that is exactly the same. So you have, in principle, you kind of assume that, let's, yeah, for simplicity you assume that all fragments have a unique molecular identifier unique UMI. Later on, after you did do the PCR. You have the same fragment with the same UMI. You know that those two fragments originally came from the same original fragment. So they are PCR duplicates. You can also know this with the barcode. Come again. You can also know this with the barcode. No, because all the fragments that come from the same sample have exactly the same barcode. All the fragments come from the same sample have a unique UMI. Okay. Yep. Okay, then you do a PCR, typically, and then you go on to the sequencing so you with PCR you generate enough input material for for the actual sequencing. In sequencing, there's a nice movie of alumina. It is the best movie I could find on the internet. So policies for the commercial commercial content, but it is what it is. So just to give you a recap of the movie. So what happens is, you get this bridge amplification and that bridge application what it does it create spots on the sequencing lane, and those spots they contain the sequence of the exact same fragment. So we've talked about this PCR steps over here. These are PCR steps for the library preparation that is done just in solution. Then you get a second PCR, and that's that bridge amplification and that will generate spots of sequences that are exactly the same. And you need that bridge amplification to get these parts with with reasonable exact the same fragment is to get enough signal off the base incorporation. So those bases are incorporated all at the same time at those fragments that are exactly the same in that specific spot. So we get the sequencing primer base is incorporated then depending on the base that is incorporated to get a light signal so fluorescent signal is a method that fluorescent signal is picked up by basically a very sensitive camera and that fluorescent signal, the order of the fluorescent signals are translated into a sequence so a single spot. So a group of bridge amplified fragments, they generate the sequences of a single read. All right, some definitions will because we have already heard the word fragment we have heard the word, I think, in size already. So this is how the library template looks before before you start the sequencing so you have this P five and P seven that are used to a new to the sequencing lane, then you have the adapters that are used for example for PCR in general, but also for annealing of the sequencing primer. We have the barcode. Sometimes you have dual indexing so in the movie they actually mentioned to barcode so you can also choose to have to barcode both on the five prime and the three prime man. You can also have one barcode can be enough definitely to have multiple samples sequence in the same lane. And then you get your actual unknown fragment, right, and that is that a sequence that very generate the reach from. So that black part so that unknown sequence, that's what you call the fragment. So I'm sorry. So the black part including the adapters is usually called the fragment, but sometimes people also call that black part the fragment, or the entire size. So it's used differently so depending on who you ask the fragment would be, let's say the unknown sequence or the sequence coming from your organism, plus the adapter sequences, or it is only the unknown sequence. But if you talk about insert size, that would be the unknown sequence that inserted in between the adapter sequences, then you always talk about this unknown black part. And then you have a sequence called inner distance, and that is the part of the fragment that is not even. There's a question from pumps. So the re primer attaches near the adapter or near the barcode because we need the barcode information as well right. Yeah, so actually what is done is the sequencing prime for the barcode is a different sequencing primer that is used for for example read to that is actually the reverse complement of study the other side other. You can call it reverse complement of the adapter so in that sequence is the other way around. So even does that to be sure that really explicitly generate the barcode sequence, so the barcode sequence is not mixed with the actual fragments as you want sequence it's a completely different process of sequencing. So, the software in the machine will align both the barcode and read that was seek like synthesized in the process or how does it like backward work like backward amplification works in this way and the read goes in this way so these two are two different amplifications right or no. They have different sequencing processes. Yes, so how they are going to club together, like it's a software. Because they would come from the same spot. So each spot has certain position on the sequencing lane. Yeah, and the sequencer start with doing the actual sequencing of the fragment. For example, generates 150 base pairs. Yeah, then it starts with that it adds the sequencing primer for the barcode. And then the machine knows okay now I'm going to sequence the barcode, and then you get a sequence generated and then you know from that same spot we have this read. And also we have to reverse read so we do, and we have this barcode. Okay, that's how the three are associated. Come again. Mission do everything like attaching and everything. Yeah, so because those those that signal comes from exactly the same spot though it has these three sequences and knows that they belong to each other. Got it. Got. Thank you. Right. My question. If you give the number of reads from a sequence you this read one and read two from a parent sequencing or come to this one. Again, sorry. So, in the data analysis you see this read numbers or sometimes they say spot numbers, how many weeks you have in the whole sequencing. So, and as I've seen it, I guess, so if it's a parent and sequence these two pairs pairs on this one. Yeah. Yeah, that's that's indeed. If you talk about definitions that's also this is that's also a good example. If you talk about paired and sequencing, then you let's say you want to generate 20 million reads, then you usually say okay I want to generate 20 million paired and reads, which means that you get to ask you files, both with 20 million reads 20 million forward reads and 20 million reverse reads. So usually do not say I'm going to generate 40 million reads, which are paired. So usually what you don't say because they come from the same fragment, and they come from the same spot. So, but ideally you would say okay I generate 20 million reads from 20 million spots. Because then you know that they always belong to each other and you do that for example pair them. Does that answer your question. So, basically, it's to reach countless as one, this we want to. Yeah, that depends on that's that's how you want would like to define it indeed. But so let's say if you generate 20 million paired and read in total, you get 40 million sequences. I guess that's the best way to describe it. Yeah, so I was just, just want to know the directionality. Is it even by the p5 and p7, meaning that all parents are directional. Yeah, the adapter. So it's the first one. Definitely. Indeed, so the first, I wouldn't know by heart but I think the first p5 from the p5 side, the fragmented sequence, which means that if you manage to make a library in such a way that you always at a certain part, or let's say if you always have at a certain certain side of the fragment to let's say the p5 side, then you indeed can sequence directionally and that is actually used very quickly in RNA sequencing for example, where you do for example, a poly A capture and you design your library or you generate your library in such a way that for example always the p5 side is at the direction of poly A for example, I don't know the exact which side is which side but at least then you know if you are sequencing from, let's say if you generate B1 on which trend of the RNA molecule it came from. And that can be very very valuable information for example if you are aligning back your reads to your genome and you have for example, overlapping open reading frames, then you can actually assign reads to a specific transcript. Or, especially small genomes, I think, for example, Drosopla that can be super valuable and I'm pretty sure with our doctors as well, or plants in general as well. Michael you do not have a question anymore suppose right. Great. Thanks. Okay, question. Okay, we talked about barcodes already quite a bit. So, I hope you have now a pretty good idea what these barcodes are. So what is the purpose of barcodes for lemna sequencing? Or indexes. They are basically synonyms. There we go. Very nice. And so with barcodes indeed it is possible to sequence multiple samples in one flow cell. So let me ask, oh, there's a question from David. Yeah, I just wanted to ask if you at some point going to talk about like a direct comparison of the different sequencing technologies like where's the strong suit. For example, handling like low quality DNA as an input. So what is where you could point out like for this purpose or this application, most of the time that's a little bit of better approach, rather than the other one. Not specifically what you mentioned I think so we will talk about, well, of course technologies and the general advantages and disadvantages of the technologies. It depends. Well, again, depend on the application very much about choices you make. So, especially if you talk about, for example, library preparation. Well, I hope with the knowledge of the different sequencing technologies, there are already some hints there on specific problems what you can do there. If you have, so for example, we just talked about this PCR right before the library preparation. If you have low input, then you usually have to do a lot of PCR cycles, which can buy it the distribution of the fragments so some fragments get amplified more preferentially than than others. So if you want, if your application requires a very, not a bias between fragments so very nice distribution of your genome, for example, then a low input. So I think a lot of PCR cycles in there might actually not be a good idea. So, by explaining these concepts. So there might be answers to your to your questions, or if there's anything in specific, we can discuss that later on you can ask it on Slack or for channels like that. Okay, thanks. Yeah. Okay, so Illumina has some limitations it has relatively short reads, but very high throughput. The read length is about 300 base pairs, but usually the base quality so the sequencing error by the end of those weeks is already pretty high so you, you already made quite a lot of mistakes at the end of those three. And that that gives some limitations of course so we have been living in that day in Illumina world for the, for the past, let's say 10 years. And there were always challenges. For example, on how to reconstruct repeat. For example, if you have here, if you have 150 base pair Illumina reads and you have a repeat of, I don't know, one can be long, of course have no idea about that we actually should be long on that repeat. So that's very difficult gives a lot of challenges during, for example, alignment. The same counts for isoforms if you have isoforms between so different isoforms of a gene and you're interested in the differential expression of those isoforms, then Illumina is usually, well, gives you limited information because you do not cannot get the entire transcript so you do not know what kind of isoforms are actually in your sample. Same counts for structural variations. So, for example, if you have very big institutions or deletions, also with short repeat, they are very difficult to solve. So if you're interested in which variants inherit together over long stretch of sequence, then also Illumina sequencing is very limited. Same counts for genome assembly. So if you work with the bacterial genome and you compare, for example, longer reads to short read, then you probably already have seen what we can generate much nicer genomes with these longer periods. That's just because you have much more information on which how did the genome of this organism actually looks like. And of course, definitely the same counts or maybe even more counts for you carries that they have often very complex genomes with a lot of structural variation and repeats. Illumina. We have been paying you a lot of money over the last, let's say, 15 to 20 years. Why can't you come up with a method that just generates longer read? Well, the reason for that lies in the technology itself. What you very often find when you look at, for example, quality control plots of Illumina reads is that you see by the end of the read, base quality declines, which means that there are more errors at two prime end of a week. So when the sequencing goes on and on and on, base quality goes becomes lower and lower and lower errors become more and more frequent. So at some point it becomes that bad that you have no idea what you actually sequencing is just generating a random sequence basically. This is typical for Illumina sequencing. And the reason for that is that you generate the spot. So the spot they are generated by rich amplification as you have seen in the movie. So what happens is you have the same fragment that is amplified by bridge application so all the sequences that are very close together, they have exactly the same sequence they are amplified by PCR. So it's required to get enough signal for the base incorporation to be picked up by this camera. So if you would incorporate a base in single sequence, you will never be able to pick it up. You need this bridge application in order to get enough signal. That would mean that incorporation of basis would happen, have to happen exactly at the same time for all of those fragments that are within a spot. And because they are PCR product of each other, you expect that their sequence are all exactly the same. So for example, let's say your original fragment started with a C, then all of the basis gets incorporated a C and get a clear in this case a red signal. Then the next basis incorporated and get a sequence signal for an A for example and so on and so on. The process of base incorporation is not flawless. So there are mistakes there. So for example, sometimes for one of the fragments of the fragments of base is not incorporated for some reason so it starts lagging behind. Or maybe two bases are incorporated so it's actually before the rest. And of course, if it lags behind over here, it's before a completely different base is likely to be incorporated over there. So you get a mixture of signal over there. So that is what you see over here on the on the on the bottom right. So that's an out of base signal and these errors. So these basic corporation errors that build up towards the end of the reach so the more sequence the more errors build up and the more mixed signal you get. But until it's so bad that this camera cannot figure out anymore, whether the color, it's, it's measuring is actually red, yellow or green. So and then the base quality becomes very low, the error becomes very high because the base color is just not certain whether the base that it, it thinks it is actually the base that was in your original fragment. Yes or no. So and it's a lumina of course worked on that right to improve that as much as possible but at some point, because you have multiple fragments together. You always get at some point to get this out of phase signal. So the technology is therefore always limited to shorter reads. So the third generation sequencers, they tried to solve this by maximizing the signal from a single molecule base readout, which means that we do not do this bridge amplification anymore. So what you try to do is to get the signal for for example base or corporation of in into a single molecule, so not an amplified one but a single molecule and try to be able to actually measure that. So then you do not have this bridge amplification anymore, you do not rely on base incorporation on multiple fragments at the same time. No, you only look at the base incorporation or the signal from a single single molecule. And therefore you do not get out of a signal. So if there's a mistake, of course, you see that mistake. So you, you, you measure a mistake but the base after it does not the base incorporation of that does not rely on that mistake that has been done before. Well it does if you look at the lumina sequencing. There are two of the use platforms that can do that. So it is single molecule based readout, and they are based on very different concepts which is pretty cool. I think that we have packed bio single molecule real time sequencing and we have Oxford now for technology. Now for technology is not based on the incorporation of a base or it's not based on sequencing by synthesis, but it's based on changes in electrical current then a read move to a poor. Read moves to a poor, and then the difference of electrical current that is measured over that poor changes based on the sequence that actually moves through that so all of those nuclear type. They have a different. They cause a different change in electrical current and based on that electrical current this differences in that electrical current that are measured, you can actually read your sequence. Well, also now for technology is very well known for scalability so you have very small sequences that can sequence that generate not a lot of basis. And you have a huge one. So often now for both are has on the both side has machines that you can that you can use so you have very small machines that are very that generate not a lot of basis, and you have huge ones like the Promethean that have the highest throughput in the field. And this is is not huge. It is still not the same as eliminate sequencing. So it still makes quite a bit of errors, but still you're looking at 95 97% accuracy and that that's increasing more and more. Nowadays, they're claiming that they have base quality 20 protocols which would mean an accuracy of 99%. So one out of 100 bases you make a mistake. But still it is for a lot of applications that is accurate enough. So then back by sequencing that is based on base incorporation. So very different from up for now for technology and in a way bit similar to eliminate sequencing. However, it is based on incorporation occurs in signal molecule. So the sequence is, they tried to amplify this signal coming from the base incorporation by having our template into very small wells that amplify the signal, they are called zero mode waveguide. And at the bottom of that zero mode waveguide. So there are, there is polymerase attached and your sequence moves to the polymerase. And by you then just measure the signal that comes from the zero mode waveguide. It works on circular molecules and that had quite some advantages, which I will explain later on. And the accuracy is about 90% I think now it is a bit higher but still definitely not comparable to eliminate sequencing if you have a single pass to that zero mode waveguide. However, what is possible with that biosequencing is to generate high fiery. And then what you do is just sequence that molecule not once but multiple times and because the random the errors are completely random. If you make a consensus out of all those sequences you generate from the same molecule, you can get actually a very high base quality so a very high accuracy. It looks so this is your fragment relatively long usually that you want to sequence let's say it's like somewhere 50 kilobases long so 50,000 base pairs long, much higher than Illumina fragments. But what's happened then is adapters are attached that actually make it possible to generate a circular molecule and a circular molecule moves through the polymerase that's at the bottom of the zero mode waveguide. And that causes multiple times to that polymerase, because you sequence it multiple times, you get multiple readouts of the same fragment, and therefore you can combine those into one and have a very high accuracy there. The polymerase is exhausted, and then you have to stop sequencing. So therefore, if you want very long reach with that biosequencing. You usually have only a few pass throughs or maybe only one, if you have, if you sequence shorter read, then you get more possible than therefore higher accuracy. In nanopore technology it doesn't really matter how long the reach are and therefore, you want ultra long reads of the nanopore technology is most likely the technology to choose. Okay. I think we're almost at the end of presentation. Yeah, my question is, do the third generation. Especially the nanopore does it need an amplification step in the preparation phase before I do the actual sequencing. Not necessarily the standard library prep requires a PCR amplification step. But there are ways to circumvent it you still require quite a bit of input material so they're. For example what you can do is so there are two library preparation methods that are frequently used. One is the direct RNA sequencing where you really directly sequence RNA not so not PDNA and so therefore there's also no application step in there. It can be a very interesting, interesting if you are interested in base modifications because you can also pick those up with external technology so if you're interested in base modification of RNA molecules. You can, you can use that method. Another method that is gaining popularity is that you actually use CRISPR to cut out a piece of your genome that can be basically any length that can be sequenced by optimal for technology, and therefore not use PCR to do target sequencing and therefore you get you usually need quite a lot of input DNA, but you do not have any PCR steps in there, which can be quite interesting. If you, if you're worried about PCR bias of your animals, for example. You can put a nuclear acid, do I need for non a poor sequencing just to give you like a raw size estimation of the input required. Yeah, that changes all the time. So we recently tested this CRISPR processing, and then you really need five microgram of of genomic DNA so that was that's really a lot. There are also low input library perhaps for oxygen nanopore technology and then you go can go down to the nanograms. So depends depends on the on the kids you you are aiming to use. Thanks if you use low input obviously there are PCR steps. Of course, very good. No. So, this is just, of course, something that I mentioned in the presentation so what kind of invention led to the ability to sequence those very long reads. Okay, most of you have answered. I'll stop. All right, so most of you are right. So, mostly answer differentiating between basic molecule skill, and that's exactly the case. So, well, both iron torrent and aluminum sequencing depend on the PCR step of the same fragment to increase the signal intensity to be able to pick it up. So basically sequencing or long read sequencing always singles. He went a single molecule, and because of that, they're not limited in terms of playing base incorporation that doesn't get exhausted. So it's always the case if you look at sequencing by synthesis. So, even for back by sequencing at some point, and the polymerase get exhausted, and therefore stops. Also the pores for oxygen nanopore technology at some point, get exhausted and that you cannot use them anymore. So that's that kind of a limit would be a limit for single molecule sequencing. So with unlimited amplitude line. I think there is always a limit to to PCR. So that's also not the correct answer so correct answer was differentiated bases at a single molecule scale. Last question before break, why are error rates for long read sequencing relatively high. Okay, I think once we have answered. We have a tie. So we have a tie between differentiating between nucleotide at a single molecule skill is challenging, and the longer sequences have a large chance on PCR error. So, longer sequence of the larger chance on PCR error. In a way that is true. But for many cases, or is it true. I'm not sure actually. So in many cases, that is not really the limit. And that also does not cause these lower accuracies. The lower accuracy is really about the readout of the basis, and that is challenging because if you get trying to have a signal of a single molecule. That's super tiny of course. So that is very challenging to do to get this very tiny signal to be able to pick it up and convert it into a base and convert it into an 18 C origin. That's the most challenging part for long reading.