 Welcome back, everyone. Also, if you're watching this on Moodle or on YouTube. First part, we talked a little bit about the assignments, about the command line, and we did some terminology. So I'm not going to go through this slide anymore. So we'll just go to the next one. All right, like I told you guys, if I would not be doing bioinformatics, I would be doing history. So when does the history of DNA start? The history of DNA starts in 1896. So there was this discovery by Friedrich Mieschner, and he discovered something what initially he thought was a new type of protein. But the thing was, is that this protein that he found had a massively high phosphorus content. So when measuring, he found that there was much more phosphorus than normally in proteins. And the weird thing was, is that the substance that he found was actually resistance to proteolysis. So taking an enzyme which breaks down proteins, this enzyme did not break down the substance that he found. And he called this substance, nuclein, because he found it inside of the nucleus of human white blood cells. So first step in the discovery. So in 1919, the Russian biochemistry Phoebus Lephine proposed that nucleic acids were composed of a series of nucleotides. So the substance, because the chemical properties were unknown at that point, so they started doing experiments to figure out the structure of this nuclein. And what they figured out is that it was similar to proteins, because proteins come in 26 different amino acids, which are strung together. And they found out that the nuclein was actually composed of more or less four different substances. And that these four different substances all contain the sugar molecule and a phosphate group. So that's the next step in the discovery of DNA. So first is finding it, thinking, oh, it might be a new protein. Oh, we can cut it. That's interesting. But at this point, they were aware that it was something that was not a protein, but it was a different biochemical substance. So big discovery in 1919. And of course, if you remember the previous lecture, this is after the theory of chromosomes. So there's a parallel development. On the one side, you have the biochemistry track. And on the other side, you have more or less the track from Mendel and people like Thomas Morgan Hunt, that did the mathematical kind of way of how inheritance works. So the next step was actually in 1950. So nothing much happened biochemically speaking in the next 30 years. But what he found was a very interesting find. And he found that the number of adenines is very similar to the number of thymines. And the number of Gs in the sequence is very similar to the number of Cs. So they broke down Erwin Sharqaf, broke down DNA molecules, and then started counting how many of the different types of nucleons there were, so how many different nucleotides. And then by doing this for a lot of different experiments, he figured out that, well, that's interesting because every time that I have 100 As in the one mixture, breaking it down, and then I find 100 Ts. And of course, with this discovery, it didn't happen very long before people started figuring out how DNA actually looked like. So in 1953, we have James Watson and Francis Crick who had an x-ray diffraction. So they did an x-ray diffraction experiment where they would crystallize DNA into a crystal. And then when you shoot x-rays at this crystal, you can determine more or less the 3D structure of it. And there is this famous photo called Photo 51, which is on the next slide, which actually is the discovery of the helical structure of DNA. So this photo combined with the knowledge that Erwin Sharqaf developed, with the number of As being similar to the number of Ts, all of a sudden, they were able to figure out, oh, but it's a double helix. It's every A is coupled to a T. Every C is coupled to a G. And it just winds around. And of course, nowadays, we know that DNA has two strands. So it always walks from 5 prime to 3 prime. And the other one runs kind of opposite to it. One of them is called the positive strand, so the coding strand. And the other one is called the negative strand or the template strand. But the reason how they figured this out is I always find this amazing is based on this photo. So this was the photo combined with the knowledge of having equal amounts of As and Ts and Cs and Gs that actually led James Watson and Francis Crick. So James Watson, Francis Crick, indeed, not the other way around. I always say James Crick and Francis Watson, which is wrong. But Rosalind Franklin and Raymond Gosling actually made the photo. So when they showed this photo to Watson and Crick, they directly kind of put all of the things together and figured out, oh, but if this is the structure, right? If this is the diffraction pattern, then it has to be a helical structure. And that is because of the kind of structure that you see here where you have like four of these kind of, yeah, you have like an X-shaped pattern. So this is one of the most famous photographs in modern biochemical history, photo 51. So what we learn is, of course, that it is a helix, right? And we know that it's a right, but helix can come in many different forms. So when we talk about DNA, we generally talk about standard DNA, which was discovered by Watson and Crick, which has like a 3.4 nanometer distance from one turn to the other, and where the helix runs. So you have the minor groove in which the helix is relatively small together, and you have like an open section, which is the major groove. So there's two grooves. But if you look very close at DNA, then there's actually three different types of DNA that you can, that there are in existence. So you have the alpha DNA. It is a short-winded right-handed helix, right? So this is the 2.8 angstrom, which is the 3.4 nanometers between the two grooves. And what you can see is that you have the tightly wound DNA. And tightly wound DNA, so the A form of DNA is not the type of DNA that Watson and Crick were looking at. So this is DNA which has been more tightly wound than normal. So you see the two different strands in two different colors. And you see here that the strand is kind of overcoiled, so it kind of fuses into itself. Then the type of DNA, the structure that was proposed by Watson and Crick, is now called BDNA, which is this form. And in this form, you see that it has the same winding. So it has a right-handed helix. But here you can clearly see that there's a minor groove. So here you see that the two strands are relatively close together. And then you have the major groove here, where the two strands are relatively far apart. And this is important for the functioning of DNA because things like protein can interact with the DNA in the major groove, but not into the minor groove. So when you have a protein binding to a DNA molecule, it generally binds here. And it cannot bind here because the groove here is not big enough for the DNA to touch the DNA properly. And then we also have ZDNA, which is very different from alpha and beta DNA, because it is a left-handed helix. So it's the same two DNA molecules. But instead of having a right turn, it now has a left turn. And this is very uncommon. But there are some species which have ZDNA in their genome. And this, of course, has all kinds of biochemical implications. But just for you guys, it's not just one type of DNA. And you always see the model of DNA being perfect, right? Because in pictures, they always show you this picture on how DNA looks, right? So the right-handed side. So 5 prime to 3 prime goes right-handed while the other one turns into it. But there are actually three different types of DNA. So a little bit of the history of sequencing. So the first sequencing was done by Maxim Gilbert, which are two different persons. Maxim and Gilbert. And they developed this Maxim Gilbert sequencing method to determine the order of DNA. The next development is the chain termination methods, which we will discuss in great detail. Then in 1979, we have whole shotgun sequencing, which was a big advantage. Because instead of having to cut out DNA, put it in bacteria, replicate the DNA to large amounts, and then sequence it. The idea here was that first you take your DNA from a cell, you cut it up into small pieces, and then you amplify the small pieces after which you sequence them. And this is a much faster methodology than first putting it into a bacteria, growing the bacteria, and then extracting large amounts of DNA. Although the chain termination methods generally generate longer reads. So they have bigger fragments of DNA that you sequence compared to the whole shotgun, right? Because whole shotgun means that you chop it up into little pieces. And these little pieces are just little, compared to the chain termination method. In 1986, there was the first semi-automated sequencer developed. And this meant that individual labs could actually start doing sequencing by just buying one of these machines. And then, hey, you would just extract your DNA. You would put it into the machine. And the machine would more or less tell you the sequence afterwards. So in 1998, FRED scores were developed because, of course, like I told you guys, DNA sequences called reads, so everything which comes out of a sequencer is called a read. It does not only have a base pair order, but also every base pair comes with a certainty. And this certainty is encoded using a FRED score. And there's a slide about FRED scores as well, I think. In 2000, we started doing massive parallel signature sequencing, which allows you to scale up the sequencing process by not just sequencing a single DNA fragment, but by sequencing thousands or even millions of fragments at the same time. And this is very similar to the kind of microarrays. So microarrays were miniaturized. And in the same time, also, sequencing was miniaturized. So instead of having an epi where you would have a single sequence from that epi, what you would do is you would take little drops on a little glass plate and had these little clusters. So these little drops that you put on the glass plates are then individually sequenced. So this, of course, allows you to scale up. And in 2004, we have the 454 Life Science Parallel Pyro Sequencing. And we will talk a little bit about Pyro Sequencing because it's different from whole shotgun sequencing. So it's an interesting sequencing technique to just talk about. So first sequencing technology is Maxim Gilbert sequencing. So Maxim Gilbert sequencing is an interesting sequencing technique because it uses agarose gel. So what you do is you chemically treat your DNA, which generates small proportions of one or two or four of the nucleotide bases in each of the four reactions. So you take your cup, which has the DNA. You split it into four different cups. And then you do chemical treatment. And the chemical treatment cleaves the DNA at certain points. So it cuts the DNA. And then using the size, you can then start sequencing. So for example, here we have a little bit of a DNA sequence. So it's G-C-T-A-C-G-T-A. And there is a 32 phosphor. So there's a label there, which is a phosphor atom, which is radioactive. So of course, when you cleave this at the A plus G, so in one of the cups, you add a chemical and this chemical cuts it. And it cuts only when there is an A or a G. So the first time that it's cut is here at the A. So you get a fragment, which is G-C-T. Then the next time that it cuts, it cuts here at the G. So the G after the C. And then you get a smaller or you get a fragment, which is now five base pairs long. And then it also cuts at the end or you get the whole, this is the whole fragment. So you also get the whole fragment because the A is the last one, right? So that's not cut. So this is one of the reactions. So what you do is you just cut the DNA, then you bring it on a gel. And then what the gel does is it separates DNA based on its size, right? So what happens is that you get a size band at seven, because the seventh base pair was an A. You get a base pair band at five because it cuts here at the G. And you get a band at three because it also cut here at this A, right? And you do the same thing for a chemical which cuts at the G. We have another chemical which cuts at C. And we have another chemical which cuts a C or a T. So because of this, right, you have all of these, so you have four of these cups. All of these cups are then brought on an acarose gel and you just pull them down using electricity. So it pulls the DNA towards the negative pole and because larger fragments have more resistance, they travel slower. So what you will see is that based on this, we can now do the sequencing, right? So if we look at the two fragments that we got, which were of length one, then the two fragments that we got were the C and the C plus T, right? So here we can then figure out that, well, we have two bands. So that means that this is actually a C, right? Because if it would be a T, then of course it would have been only a band at the C plus T cutter, right? We see that here. So here we see that the second base after the G, right? Because the G itself is not, so the first base pair is kind of gone. Yes, so, but what we can see is that indeed, hey, using this technology or technology, using this kind of four cup reactions and having four different chemicals, which either cut at two base pairs or at one base pairs, we can figure out the sequence, right? So we can see that the first base pair is a C, the second base pair is a T, the third base pair is an A. And this was done on agarose gel, very, very strong agarose gel because nowadays we use like agarose gel, which is like 1.7% because we want to see fragments which are like 200 to 300 base pairs long. But here, since you want to have a single base pair separation on an agarose gel, what they use was very high concentrations of agarose, but also massive amounts of energy. And normally when you run something in the lab on agarose gel, you put like 150 or 200 volts to the gel, but these gels actually were run on like high amounts of agarose and with like 30 to 40,000 volts, so relatively dangerous. And of course, like the way to make it visible was done by using, so that you have radioactive elements in there as well. So, but it was like the 1970s, so that was perfectly fine. Like nowadays we don't do it like this anymore. Remember that the original sequence needs to be read from the bottom to the top. Any questions about this? Do you think I explained it well? I'm always a little bit, I find this difficult, right? Like for me it's really difficult to see if you understand it. But I can promise you now that there will be a Maxim Gilbert sequencing gel on the exam, right? So you just get a gel like this without the letters here and without the numbers here and it will just say like at the bottom or at the top it will just say which two base pairs or which base pair was cut and then you guys need to be able to figure out the base pairs and then remember that when you write down the sequence you write down the sequence from the bottom to the top. So the sequence here, what was sequence is C, T, A, C, G, T, and A. And that of course is the original sequence here, right? And again, DNA is always written from five prime to three prime. All right, so there's no questions. Next sequencing, Mac technology. So pyro sequencing is another way of sequencing. So instead of using four different, I mean instead of splitting your sample into four different samples and having like different chemicals which cut the DNA and then bringing it on gel, pulling it down, seeing the sizes of the different fragments and then based on the sizes, figuring out what base pair was cut at which point. Pyro sequencing is slightly different because here you use nucleotides. So you use nucleotides which have a, which have a atype sulfurase, sulfurosil, right? So what happens is that when the polymerase includes one of these base pairs it gives a little bit of light because of the luciferase reaction. So what you do is you take your cup and then you add the A to your cup and you see that there's no light flash. So you know there wasn't, there was no T there, right? Because you always look at the opposite base pair. The next step is you add a G to the reaction mixture and then you get a little peak of light, right? So now you know because I added a G the position that the polymerase was at is containing a C, right? The opposite. Then you add the T, no flash of light. You add the C, no flash of light. So now after adding four of these reactions, right? We know that the first base pair of the sequence is a C. And then we just continue all over again, right? So we just start adding an A. We get a little flash of light. We add a G, we add a T, right? We see no flashes of light. So now we know that the second base pair is A, T. When we add a C we get a flash of light, right? So you can see that the order is always the same. So always A, G, T, C, A, G, T, C, A, G, T, C, right? So based on this we can now see what the sequence is because the polymerase only moves like one base pair forward every time when it can incorporate the nucleotide. And of course the nucleotide needs to be broken down which is done by this DXMP, right? Because it's a luciferase reaction you have to quench the reaction after every time that you add a base pair. So you add a base pair. You look to see if there's a flash of light. If there's no light, then you quench it because you don't want the A's to stay in the reaction mixture, right? And of course sometimes if there's two T's in a row or two A's in a row you get like a double light flash. And so on your oscilloscope, hey you have to kind of see how high the peak is and depending on the height of the peak you can also figure out if there's two T's in a row or three T's or four T's. So this is the way that pyro sequencing works. So it uses pyrotechnics, in this case luciferase to figure out the next base pair. And you just add all four of them. If you see a flash of light you know that the opposite base pair was introduced. Sanger sequencing. So Sanger sequencing is more or less the common way of doing it in like 20 years ago. This is not a very high throughput mechanism but it is a better mechanism. And a lot of things nowadays are still sequence using Sanger sequencing. So if you do a PCR you get a little fragment of DNA and you send it into a sequencing company then the sequencing company will not do next generation whole genome sequencing they will generally do Sanger sequencing. So how does Sanger sequencing work? Well first you make your reaction mixture. So you have your primer and your DNA template. You add DNA polymerase and you add DDNTPs with a fluorochlam and you add DNTPs, right? So the difference between a DDNTP is that when a DDNTP is included by the polymerase it cannot continue because this one doesn't have a free phosphorus group so the next base pair cannot be included. DNTPs can. So if a DNTP is included then no light is produced but the reaction can still continue. So what you do is you take your primer then you take the template. So the template is the one that you want to sequence. The primer is the beginning of the sequence which you should know. So the first thing which happens is that you take the primer and the template with each other and then what starts happening is that when you did it's just a basic PCR reaction in a way and some fragments will be elongated. So had the primer and the polymerase will bind to the template and it will start elongating. But the elongation will randomly take a DDNTP or a DNTP. So if it first picks a DNTP, right? Then this DNTP will directly terminate the reaction. So you get a fragment and this fragment is labeled with a certain color. The next fragment included one normal NTP so one normal base pair and then it included one of these terminating base pairs. Since this happens randomly, you will get fragments which are all of different sizes. So you will get fragments which are one base pair longer than the primer and you get fragments which are 10, 20, 30, 40, 50, sometimes a hundred base pairs longer, right? Because every time there's a chance that a normal DNTP gets introduced and the next one continues and sometimes you include one of the DDNTPs which will add a fluorophore there, so a color, a marker, and which will stop the reaction at this point. So what then happens is now you have a single cup, you added all of this stuff, right? You then did a PCR elongation so you raise the temperature to 70 degrees. The polymerase starts working and starts amplifying all of these things. So in the end, in your reaction mixture you have all of these different fragments, each colored in a certain way and each of slightly different length. And of course you don't have one but you might have 10 of these, 20 of these, five of these and these kinds of things. So what you then do is you run this through a capillary gel. So you run this through a very small tube which has agarose gel in there which then separates it by size, right? Because the smaller fragments will run faster than the longer fragments so they will hit the detector first. So you have a little laser at a certain point at this capillary and this laser will kind of excite the fluorophore and the fluorophore will then give you the color and this is detected. And what you then see in the end is a chromatograph like this because every head, so one by one the fragments go through, smaller fragments first, longer fragments later and one by one you will get a color pattern so you will see an intensity. So if there is for example a G there then you will get a orangey color. If there's an A here you will get a greenish color. So those are the three main sequencing by hand techniques, right? You can do this in the lab although this one is not allowed anymore because you use some Maxim Gilbert sequencing since you use radioactive molecules you're not allowed to do it in a standard S1 lab so no one uses Maxim Gilbert sequencing anymore. Pyro sequencing is still used sometimes for very small fragments because it's relatively cheap, you can do it yourself. And this Sanger sequencing is generally if you just send in a fragment to a company although we still have one of these machines upstairs. So it's just nothing more but a very small tube which has kind of agarose gel in there and then head molecules are just pulled through the agarose gel either by gravity or by electrophoresis. And then here you have a laser and a detector and that's it. So it's a very cheap machine to build yourself. You can actually build one at home if you want. And then here of course based on the detector you can get this chromatograph based on the color. So these are kind of the classical ways of sequencing and Sanger sequencing is still used a lot. So nowadays we always talk about next generation sequencing. So next generation sequencing is actually four different types of sequencing. One of them is resequencing, right? So if you sequence a human then you will use next generation sequencing because you're trying to determine millions of billions of base pairs, right? So for resequencing a human you use next generation sequencing because you cannot do this by Sanger sequencing or by any of the other methodologies because there's always a limit to how big your fragments can be and how long you can wait. So four different types of next generation sequencing resequencing of known samples. So if you know that it's a human then you chop it up into little pieces and you start sequencing it. We have nowadays transcriptome sequencing which is called RNA-SEC, right? So if you are not interested in DNA and the order of DNA molecules but you're interested in RNA and the order of RNA molecules because for example you're doing virology and the virus that you're working with is an RNA virus then you would do RNA sequencing. And in humans of course if you want to measure the activity of genes you can also use sequencing to measure the messenger RNA content in a cell. Furthermore we have chipSEC which is if you are interested in where proteins bind to the DNA. So here you use an antibody which targets the protein of interest. So hey, you first use a chemical which fixes every protein to the DNA and then you fragment the DNA and then you use a protein to then fish out your piece of interest because you want to know where this protein bound the DNA. So that is called chipSEC. And nowadays we also do this for epigenomics so for the state of the genome, right? Because the genome is ACTs and Gs but for example these ACTs and Gs are not all equal, right? You can have a G which is unmethylated or you can have a G which is methylated. So if you want to check out the epigenome if you want to know which base pairs in the genome have been methylated or which are unmethylated then you can also use next generation sequencing generally called bisulfide sequencing because you use a bisulfide treatment to do this. So four different types of fields where next generation sequencing is used. So just sequencing DNA or resequencing it. RNASEC to measure the activity of messenger RNA into a cell so not the activity but the amount of messenger RNA into a cell. You can do DNA protein interactions to figure out where a protein binds the DNA by fixing the protein to the DNA and then using an antibody to pull out these sections. And you can use epigenomic sequencing which is determining often the methylation state of the genome. So looking if a G is methylated or if it's unmethylated. So the next generation sequencing methods that we will go through one by one are real-time sequencing, by and torn sequencing, pyro sequencing because there is a next generation sequencing method based on pyro sequencing as well. There are sequencing by synthesis which is the most common sequencing technique at the moment. This is what the big Illumina sequencers do. We have sequencing by ligation also called solid sequencing. And then we have chain termination kind of Sanger sequencing which is the kind of multiplexing of this original Sanger sequencing method. But all of them will have a lot of things in common. But real-time sequencing from Pacific Biosciences is the new hot technology because it allows you to do single molecule sequencing and the way that it works is kind of like this. So you have, if you want to do real-time sequencing what you have is you have a single polymerase molecule on a glass plate. So this is a hundred nanometers but a single, so in each of these wells, right, you have a glass plate and there's all kinds of little holes in there. And in each of these holes there is a single polymerase which is already like tricky in itself, right? How do you get a single molecule there? And how do you determine that there's only one and not two? But what happens is that each of these little holes are reaction mixtures. So one single DNA strand is introduced into these cells and then what happens is that, hey, you use again this hexafosphate nucleotide, so nucleotides which have a label, right? So a fluorescent label. And then when the, because you give all four of these nucleotides in the reaction mixture together with your DNA the polymerase will start extending the DNA. So when it extends the DNA, right? So here, for example, if there's a G but the G is not the next base pair, so it will not work. And if the next happens to be an, oh, this happens to be a G, right? So if the G is incorporated what will happen is the G is incorporated the polymerase will bind this new G to the extending DNA strand. And by doing that, it will release this hexafosphate and this is seen as a little fluorescent pulse. So, hey, and you will get these pulses very quickly because normal polymerases do around 1000 base pairs in like one minute, right? So in a single minute you get 1000 of these flashes and the color of these flashes will tell you which base pair was incorporated. But this is a very novel technology in a way because it allows you to sequence a single DNA molecule. And so if you do single cell sequencing where you only have like two copies of the same genome then this is a way to still get information. But of course, there's a lot of noise here because of course, the polymerase goes really fast so it goes like a backman thing. So it goes like, and every time that it closes it adds one base pair and you get a little flash of light and then the next one is done already. So, hey, this is of course, across time this is very much extended. Like these flashes you have 1000 flashes per second and of course you need a computer to monitor which one of these. And you don't do this in a single well but you do this literally in like hundreds and hundreds of wells at the same time. So very novel technique allows you to do a single DNA molecule and it also allows you to do like single cell sequencing. So just as a little summary for each of the sequencing technologies I made one of these little slides. So it's a single molecule real-time sequencing. It's developed by Pacific Biosciences. The read length is really, really long so you can get 10,000 base pairs to 15,000 base pairs in a row. So because this thing does like a thousand per minute and you can keep the reaction going for like 15 minutes even up to 40 minutes sometimes but that is very uncommon to get 40,000 base pairs from a single sequencing run. So if you wanna sequence one million base pairs then that will cost you per base pair around that will cost you so one million base pairs of sequencing will cost you between $13 cents and $60 cents. So the advantage here is that it's the longest read length of any NGS next generation sequencing technology. The problem is that it is relatively moderate throughput because it's just, in a single well you can only do a single molecule and you need really, really expensive equipment like these glass plates with these little holes and these polymerases attached to it that is really, really expensive to make. So the setup is really expensive but once you have everything ready then it's relatively cheap per one million base pair. And you can of course, if you imagine that the human genome is 2.6 billion base pairs and so you then have to multiply this by around 2,600 to figure out how much a single human genome will cost at one X coverage and not 10 X coverage. Normally when you sequence you want to have every base pair read at least 10 times to be sure that you're really certain that the base pair is true. So this technology will cost you like if we take the highest one, so that's dot six times 2,600 times 10 times 10. So if you wanna sequence a human genome using this it will cost you around $15, $16,000. So relatively expensive equipment, base pairs are not that expensive but still one single human genome using this methodology will cost you around $15,000, $16,000. The next sequencing method is IonTorrent sequencing and I don't know a lot about this. I never worked with sequencing data from IonTorrent. So I'm relatively a noob so I took this nice little image and the way that it works is that you use a mass spectrometer which is a quadrupole, right? So you have eight of these magnetic poles and again it's more or less the same as what we saw before because you add nucleotides and if a nucleotide is incorporated it releases a haplus, so a positively charged hydrogen. So a single proton is excluded from the reaction and the single proton you can measure using your mass spectrometer, right? And if the next base pair would be the AA, right? So you add T's to the reaction then two T's are incorporated which will give you two of these hydrogen so you would get like a double peak. So it's very similar to maintenance cost. You mean maintenance cost of this one? This is really expensive but the maintenance cost is included here in the one million base pairs, right? Because you buy the equipment once so the cost of the equipment are negligible but the maintenance cost is included in the price per one million base pairs. That's also why it's so variable, right? It depends if you do it in China or if you have it done in the US. And there's a wide range of cost for these kinds of things. So quadrupole sequencing, interesting sequencing technology, I never used it. I know that you can use a mass spectrometer to do this or, well, a quadrupole which is a very specific type of mass spectrometer. And here you measure the positively charged hydrogen atoms that come out as a result of the sequencing. And again, here you do the same strategy, right? So you add an A, you see if a hydrogen atom is made. You add a T, you see if a hydrogen atom is made. You add a C, you see. So you just do it one by one. So one by one you add the base pairs and then in the end you get a signal of like A's and T's depending on what was incorporated. So the ion semiconductor, which is ion torn sequencing, the read length is very short. So you can get like 400 base pair reads. The cost is around $1 per one million base pairs. I say here it's less expensive and that is because of the cost of the equipment because a quadrupole is not that expensive and you can do it relatively fast as well because like 400 base pair reads are also relatively fast. The big issue with this whole technology is that you have these homopolymer errors. And homopolymer errors are when you have a lot of T's in a rows, right? So having five T's or having three T's or having seven T's it's really hard to distinguish that, right? Because you see you have a certain peak when you have one molecule expelled. If you have two molecules expelled, you have a double peak, right? But if you have seven or 12 in a row, then it's very hard to see the difference between seven and 12 hydrogens being expelled because the machine is really, really accurate. But the accuracy is, it can really accurately determine if you have one, two or three. But it's very bad at determining if you have 50 T's in a rows versus 51 T's in a rows because the peak will look very, very similar. So the homopolymer runs errors means that if you have, if you are sequencing a fragment and this fragment has the same base pair behind each other a lot of the time, then you are unable to determine how many there are. So you know there is a stretch of C's in a row somewhere more than 30, right? But I don't know if it's 31, could be 35, could be 32 as well, right? So that's a homopolymer error. So that means that you can determine the individual base pairs very well, but as soon as you have multiple base pairs behind each other, like a five or 10 C's, it becomes really difficult to see if there's five or if there's six. So that's called homopolymer error. So next generation sequencing using pyro sequencing. We already discussed pyro sequencing just to show you guys more or less in detail how it works. So what happens, and I also described it here. So you add one of the four DNTPs in the, so you have the first step where you have the polymerase being added, then you add the DNTPs. So one of the DNTPs is included. Then you have sulffuel asa, which then recruits the luciferin and then the luciferin gives you a little bit of light and then had the whole reaction is quenched using the apirase. So the sulfirase is the one that allows it to flash and then the apirase is the one that allows it to quench so that the light is gone because otherwise the reaction would just continue. So here in more detail, so the first step is adding one of the nucleotides to it and then DNA polymerase adds the correct one. So the incorporation releases pyrophosphatase and then artype-silver converts this to artype in the presence of blah and then, but you don't have to know this in detail. But it's the same thing as pyro sequencing in what we used to do in the old days, but now you have machines that don't do one of these reactions, but you have machines that do 10,000 reactions in parallel, right? So instead of sequencing one molecule and having a cup where you just add the things, you now have a machine and this machine is like 10,000 wells and in every well the same reaction is taking place. So pyro sequencing is also called 454 sequencing. The read length is relatively good. It's up to like 700 base pairs, but it is really, really expensive to do because one million base pairs will cost you around $10. And that is because the enzymes that you need, right? Because you need polymerase, which is kind of common. You always need polymerase to extend something, but you also need sulfurilase. You also need luciferase and you need apirasase. So these chemicals, these proteins are really, really expensive. That makes this a very, very expensive sequencing technology. But for a while it was the standard in a way. It gives you relatively long reads, right? 700 base pairs. It is very fast as well, but runs are very expensive. And again, this one also suffers from homopolymer errors. And this is again because you are using light and using flashes of light, right? You cannot determine exactly if you have like a flash with an intensity of 12 or a flash with an intensity of 13. So those are the homopolymer errors. They occur with any technology that uses light as its detector. So sequencing by synthesis is the standard nowadays. So this is done in Illumina machines. So what happens is first you prepare your library. So what you do is you take a very, very small amount of DNA and this DNA is fragmented into short pieces. And then each of these pieces gets two adapters. So it gets an adapter on the five prime end and it gets an adapter on the three prime end. Then what you do is you put these onto little glass plates. So these glass says, so you have a glass plate and those glass plates have the opposite sequence of the library, right? So you add a little piece of DNA, generally like 11 base pairs long. And then on this glass plate, the opposing 11 base pairs are there. So then you load your library onto the plate and these little things, so the blue parts will attach to the blue parts. And if you do not single pair, a single end sequencing, but paired end sequencing, you also have a little complementary sequence for the other side. And then what happens is that you get these little loops of DNA. So what then happens is again the same thing. So you have a polymerase which extends it and the polymerase builds in these four nucleotides which are colored. And then hey, depending on if you do four channel sequencing, hey, you see that these little clusters because these clusters are grown. So you use polymerase first or no, you use polymerase to extend the little clusters. So you make, so instead of having a single DNA molecule, so you attach a single molecule and then you grow it into a forest of similar molecules and then you sequence by going from one end to the other end. So you have a primer which binds the purple end and that just sequences down. So what happens is is that if you look at the plate, it looks more or less like this and this happens in runs. So hey, you do a single extension, another extension. And so what you see is you see that this dot here, first it is green, which means a T, then you get a different color, which means a G, then you get a red color, which means a C and so on. So standard sequencing which we do now. So sequencing by synthesis, like I said, is the default. It's done by Illumina. Read length is around 50 to 300 base pairs and it is the cheapest sequencing technology currently because one million base pair will cost you somewhere between five cents and 15 cents, which means if we take the smallest price, right, and we calculate how much a human genome would cost at 10 X coverage, so 10 times reading each base pairs, then a human genome will cost around $1,300 at five cents. At 15 cents it will be three times as much, but generally costs are around five cents. So it is cheap, you get high sequence yields, which means that you get a lot of reads. The problem here is that the equipment is relatively expensive and you have to start with relatively high DNA concentrations in the beginning, which generally is not an issue, right? If you take tissue, but for forensic, this might be an issue. For forensics, there are other technologies which are more expensive, but allow you to start with a much smaller amount of DNA. And this is especially, if this is especially difficult in forensics, because you only have like a little bit of DNA on a T-shirt that you recovered from the victim and you can't get more, right? But for normal biomedical research, like sequencing a mouse, it doesn't matter because you can get as much blood from the mouse where you can extract DNA from as you want. But for forensic science, it's not the best technology because you're generally working with trace amounts of DNA and it needs a relatively high DNA concentration. All right, next one, sequencing by ligation. So sequencing by ligation is also something that I have not ever done, but what happens is that it's again very similar. So you have interrogation probes. So these probes have different colors on the end, right? So you have a probe which is, for example, TT and the TT probe is color blue, right? So you see that you have the first and the second base pair. Those are the base pairs that are introduced into the DNA that are ligated to the DNA. So instead of using a polymerase, they use a ligase. And what happens is is that two base pairs at a time are ligated to the DNA. So you use your primer, you use a ligase. Two base pairs are ligated to it and then the newly incorporated probe is then cleaved, right? Because we know the cleavage sequence. So you have the two base pairs that get incorporated. Then you have a sequence, which is the cleavage sequence. So where it's cut off. And then once it is cut off, you have the color of the base pair. Of course, little pieces remain. So what you get is you actually don't get exactly the whole sequence, but you get every three base pairs, four base pairs you get to know what two base pairs were. So it reads two base pairs, then three base pairs, four base pairs of nothing, then two base pairs, then four base pairs of nothing. And of course, if you do this multiple times, of course, then of course you can start at any position and you can figure out what the whole sequence was. What it kind of does is that you have little gaps in the individual runs, but you can fill by just doing multiple runs. But it has 16 different probes and of course colors are shared between the probes, but it allows you to figure out the sequence as well. So solid sequencing, like I said, I never did it. And it is supposed to be relatively cheap, but it's relatively slow and it has a massive issue with palindromes. So palindromes are sequences which are the same, if you read it from left to right or right to left. And the big issue is that a lot of the things in DNA are actually based on palindromes. So palindromes are all over the DNA because palindromes are very common when it comes to things like transcription factor. A lot of things that bind DNA are polymers or are common too, right? So you have a left side and a right side molecule that bind together and by binding together they bind to DNA. But then that means that the sequence that they read is TGA-AGT because you have one protein, the same protein twice, which combines to couple to the DNA. So in DNA you have a lot of palindromes. So that makes solid sequencing relatively unsuitable for sequencing DNA, which is just a shame because it's an interesting technology but it never got really popular or anything. But I just wanted to show you guys that you don't have to use a polymerase, you can also use ligation methods which just ligate primers or pieces of DNA to the DNA and then cut off the part which had a little light signal. Chain termination sequencing, also Sanger sequencing. So this is just the scaling up of standard Sanger sequencing like we see. The read length here is like 400 to 900 base pairs. You will never do this for a human genome because a million base pairs will cost you like 2,000. It provides relatively long reads, which is really good but it's expensive and very impractical to do Sanger sequencing in parallel. So I just wanted to mention that it is a possibility but generally you only use Sanger sequencing if you have short fragments and you want to know, right? You did a PCR, you took out a little part of the genome and now you just want to determine this sequence. You're not going to do a whole genome sequence with it. All right, so last hour I tried to kind of throw you and like it's probably complex and all of these techniques are something in common because they either use light or they use like hydrogens or they use luciferase. So if you wanna get a good overview, right? Because like I can't teach you in an hour all of the different sequencing technologies. You can read this paper. So it's called Next Generation Sequencing from Basic Research to Diagnostic. So I will make sure that the link is clickable when I upload the PDF to Moodle. So you can just click the link and you can get the paper and you can just at home take like an hour or two hours to just read through the different sequencing technologies how they exactly work. And these come with some of the figures that I already showed you. So I hope that that will allow you guys to kind of do, to understand exactly what happens instead of just having to listen to me for an hour going through. But what I want you guys to know is that how these technologies kind of work and you don't need to know exactly like you don't have to know all of this, right? I'm not going to ask in detail these things on an exam, right? The A-T-P silver, silver, real converse PPI to A-T-P in the present now, that's too far. I just want you to know that there are different types of sequencing technologies that they share similar things, right? Some use polymerases, some use ligases. Others use like immobilized polymerase where you pull your DNA strand to. But the idea is that there are very, there are a lot of different techniques and that all of these techniques give you information and all of them have their advantages and disadvantages. If you're interested in very long reads, then you would go for the more expensive sequencing which uses immobilized. And this is also very flexible. Like in five years time, there will be 20 new sequencing techniques and they will have different advantages and disadvantages and also different prices because prices go down every year. Good, so I've been talking for an hour again. So for you guys and for the people on YouTube, I will take a short break. So I will be back. Let me first stop the recording. I will be back in 10 minutes. So stop.