 Good, so we're recording again, so welcome back. Second part of the lecture, I wanted to just quickly go through the history, so it's the discovery of DNA and then we will have some discovery of different sequencing technologies and I will go and explain to you guys how they more or less work. So in 1869, which is a long time ago, then Swiss chemist Friedrich Miesner made a discovery and he initially thought that he had discovered a new type of protein. However, when he did measurements on this new type of protein that he had extracted from his cell, he found that it has a relatively high phosphorus content and it was resistance to proteolysis. So they had enzymes which could kind of break down proteins, but this new thing, this new protein that he discovered, actually was, he was unable to kind of cut it down into smaller pieces and he was unable to break it down. So he was the first one to coin the term nucleine and that is, nucleine was something that he discovered which was inside of the nuclei of human white blood cells. He took human white blood cells, did his experiments and then found this strange kind of protein. And then in 1919 is the next big discovery which relates to DNA, which is done by Phoebus Liffene. And so he proposed that this nucleine, this nucleic acid, was composed of a series of nucleotides and that each nucleotide was in turn composed of just one of four nitrogen-containing bases. A sugar molecule and a phosphate group. And so at the early chemists that looked at DNA and didn't know anything about how DNA was organized or how it was structured, and so they just did tests to see how this molecule looked like because of course a molecule which you extract or big molecules which you extract from white blood cells or other types of cells, they are of course interesting. And he discovered that there were four different bases more or less, so four bases which are nitrogen-related, so like an amino acid in a protein. And he also found that there was a sugar molecule and that there was a phosphate group that tied these things together. So then we have to wait a very long time because only 40 years later we see Erwin Sharaf or Sharqaf, and he found out that the amount of adenine in this new molecule is similar or almost equal to the amount of thymine, and that the amount of guanine is approximately equal to the amount of cytosine. So this is something that we nowadays know and always assume because we have DNA, DNA is a double-stranded helix, so every A is coupled to a T and every C is coupled to a G. But this was something that in 1950 was discovered by Erwin Sharaf, and that these quantities were the same, but they still didn't have a model on how DNA looked like or what kind of a structure it had, but they discovered more and more. So the real discovery of DNA is credited to James Watson and Francis Crick, and this is based on photo 51. So the photo 51 is a diffraction, an x-ray diffraction done on a DNA molecule, and this diffraction was done by Rosalind Franklin and her PhD student Raymond Gosling at that time. And then had James Watson and Francis Crick, they developed this model and the DNA came into a helical structure. So just for historical reasons, I want to show you this photo 51 because this is kind of the most, it's one of the most telling photos in kind of the 20th century, so from 1900 to 2000. And this was the photo on which the structure of DNA was determined. So you can see that it's an x-ray diffraction, and what you can clearly see is more or less these kind of letters here, so these stepwise, and you can see this ax being formed in the diffraction pattern of the photo. So it's not a very clear photo, but this photo combined with the knowledge that the number of A's equals to the number of T's, the number of C's equals to the number of genes. The fact that there is a sugar part, that there's a phosphate part, this was all the information that Watson and Crick had, and then they proposed their model of DNA, which is kind of a double helical structure. So you have a sugar with phosphate backbone, and then you have the base pairs in the middle, and then you have 0.3 nanometers between each of the bases, and if you look at one turn of the helix, so one helical turn is around 3.4 nanometers big, so very small, but very impactful. So Watson and Crick came up based on this photo made by Rosalind Franklin with their structure of how DNA was organized. Of course, nowadays we know that there are more and different types of helical structure. So here in the middle, you see the standard DNA helix as was proposed by Watson and Crick, and this is the B form of DNA. So the B form of DNA is the standard structure for DNA. Besides that, we also have all our DNA, and this is the same, so it's the same type of helix, but this is much tighter wound together. So it's DNA and it doesn't have a major groove because here you can see the major groove, and then you can see the minor groove of DNA, but there's also a type of DNA which does not have a major and a minor groove. So this is tightly packed DNA, which actually has only a kind of minor groove on both sides. And of course, this has major implications for how DNA can be transcribed, because of course, most proteins work by binding and recognizing DNA using this major groove. And then there's also nowadays, we know that there's something called ZDNA, and this is actually the left-handed helix. So you can see that here, there's a right-handed helix, so it turns in a right-handed fashion, but we nowadays know that there's also DNA which is kind of wound much more extended in a way, so it's extended, and it turns the other way around. It turns via the left-handed helix. And all of these different types of DNA have different implications of, therefore, genes, transcription, and how proteins interact with DNA. So BDNA is the kind of original DNA invented by Watson and Grig. Alpha DNA is the same thing, but much tighter wound, so it doesn't have a major and a minor groove, it just has kind of a minor groove structure. And then of course, there's ZDNA, which is turned the other way. So it's inverted, so left-sided, instead of right-sided. Right, so discovery of DNA, and of course, after the discovery of DNA, people wanted to kind of determine the order of the base pairs because at that point in time, they knew that DNA was the carrier of genetic information from parents to offspring, and so they finally found this thing that people like Gregor Mendel were and what Mendel was looking for, because Mendel had this beads on a string theory, but he did not know what was the carrier of the genetic information from parents to offspring, but in the 1950s, 1960s, they knew that this nucleic acid that they found and that that had to be this mysterious carrier of information from parents to offspring. So then in 1977, we have the first two sequencing methods being developed, so the first one is Maxim Gilbert sequencing, and then the next one is the chain termination method, which is invented by Frédéric Sanger, who is also credited for Sanger sequencing, and Sanger sequencing is still in use nowadays. So if you send in a little piece of DNA for sequencing, then generally, this little pieces of DNA up to like 500 base pairs, they will be done by chain termination sequencing, also known as Sanger sequencing. In 1979, there's whole shotgun sequencing, so this means that you take a big molecule of DNA and then chop it up into smaller pieces, and then like a shotgun, you just split it up into all kinds of little reads or little pieces of DNA, and then you do shotgun sequencing, which means that you had sequenced all the individual molecules that you get from breaking up the large molecule. In 1986, we had the first kind of semi-automated sequencer, where you could just put in more or less your sample and it would give you back a sequence. In 1998, we have a big improvement in the bioinformatics field, and this is when they discovered that quality scores, you can actually write them down using Fréd. So Fréd quality scores is a way of writing down. I told you that when you look at the DNA sequence, you not only have the sequence, but also the quality of the individual base pair, so how reliable each base pair is determined. In 2000, we see the advent of massive parallel sequencing, so massive parallel signature sequencing, and then the first real usable is the 454 machine, which is produced by Life Sciences, which does parallel pyro sequencing. So we will go through all of the sequencing technologies, and I will kind of highlight the ones that are important and how they work and what is interesting. So the original sequencing method is Muxom-Gilbert sequencing, and Muxom-Gilbert sequencing is just based on more or less PCR in a way. So you have a chemical treatment that generates breaks at a small proportion of one or two of the four nucleotide bases in each of the four reactions. So in theory, you take this little piece of DNA and then you use the exposed phosphate group, you treat it with chemicals, and then some chemicals will cleave at the A or at the G, some chemicals will cleave at the G, some chemicals will cleave at a C, and some will cleave at the C or a T. So you use four different chemicals. Hey, you have the same sequence and by treating it with the four chemicals, and then you just use a basic sequencing gel, which is very similar to a PCR gel, which you do nowadays, and on this sequencing gel, hey, you then run the individual pieces. So you see when this sequence here is cleaved using the A plus G chemical, it will be G, C, T, and then there will be a cleavage because it will cleave at the A. Of course, when you do this, there will be a second fragment and the second fragment will be here because the G, at the G, it's also cleaved, so you'll get three fragments when you break the DNA at the A plus G point. Another chemical is there to cleave based on just the G, so this will only generate a single fragment and the single fragment is cleaving here at this G. The first one is, of course, not cleave because if you cleave at the first one, you only get the phosphate group. If you cleave at the C, you get two different fragments and when you cleave at C plus T, you get four different fragments, right? So you do these four individual reactions and then you put the output of this on a sequencing gel and this is just a basic agarose gel with a very high concentration of agarose so that you can more or less see like single to two base pair differences and this is a very, very interesting technique on how to make these gels and you also need a very strong electric field to pull the fragments through the gel but what you then see is that you see then that there are bands like in a normal PCR product and based on these bands, you can then say, you can then reconstruct the original sequence. So when we look at the bands created by the sequence above, we see here that at the first, I don't think that these two belong to each other. Yeah, no, they do. So you read the sequence from top to bottom in a backward fashion. So the first one that you see is a band for the A plus G cleavage but you don't see a band for the G cleavage meaning that you can deduce that this base pair, so the last base pair in the sequence is an A and then the next fragment that you see is the long fragment here and this fragment, of course, leads you to believe that there's a T because it was cut or it is observed when you see the C plus T cleavage but it's not there when you cleave only with the C chemical. And of course, when you see two bands, so you see a band in the A plus G and the band in the G cleavage and then you know that the sequence needs to be A G. So in the sequence that you read here is actually the sequence from the three prime to the five prime. So because the agreement is we write down DNA from five prime to three prime, the sequence here needs to be read from the bottom to the top. So when I show you a sequencing gel like this on the exam, remember to write down the sequence from five prime to three prime which is the default of how we write down DNA sequences. So you have to read the gel from the bottom all the way to the top. Is that clear? I think this is one of the more difficult sequencing technologies but it's I think a good way of, well it's not a good way because it's very expensive and you need like heavy equipment but I'm hoping that it's clear. So if not just say no in the chat and if it is clear then just shout yes and then we will move to the next slide because we still have like five or six different sequencing technologies to do. So this is the original kind of sequencing technology it is based on chemical treatment and of course it's based on this phosphate which is 32P which is radioactive so it's a difficult sequencing technology to do and we don't use it anymore nowadays but it's the original sequencing technology that's why I wanted to show you guys. All right so pyro sequencing is nowadays the way that we do. So pyro sequencing is a way where you use polymerase and using this polymerase you add nucleotides in a known order. So if we look at the order in which here the sequence is generated and we first add, oh and of course when you add the correct nucleotide it is incorporated by the polymerase the polymerase then does this, makes this Luciferase assay and so you're kind of doing Luciferase assays one after another and then have based on the order of the DNA you sometimes see a peak and you sometimes don't see a peak. So in this case we are adding the base pairs in the order AGTC. So we just add the A nucleotide and then we see that the A nucleotide doesn't produce a flash so it was not the current position where we're sequencing. When we add a G we see a little flash of light. So when we add a G then that means that the sequence the polymerase the point at which the polymerase was that there was actually a C, right? So the difficulty here is that when you add a G you are actually and the original sequence actually contains a C. So you just add these all in order and then in the end you look at the flashes of light and the height of the flashes of light because it might be that there are two T's in a row and of course then when you add an A you get like a double bright flash of light and you can have three times the height or four times the height as well and depending on how many T's or A's or C's or G's there are in a row. And of course the more you have the more difficult it becomes to determine the exact amount. But nowadays this is more or less still the way that when you do sequencing you kind of get a profile back which looks like this. So just little peaks and you get the order at which they add the nucleotides and then based on that you can write down the sequence. The nice thing about this thing is actually that it does go in the correct order. So here the original sequence would be C, T, G, C, T, T, A, A. So no reading backwards, just the right order. All right then Songer sequencing is a little bit more involved. So Songer sequencing is kind of the normal way at which we do it now. So Pyro sequencing is still used but Songer sequencing is a little bit better because you can do all four base pairs at the same time. So instead of adding them one by one and then waiting for the flash of light using the Luciferase assay. And what you do is you make a reaction mixture. So you have your primers and your DNA template. You have your DNA polymerase and then you use these DDNTPs. So these are NTPs which terminate the sequencing reaction there. So if one of these DDNTPs is incorporated then the polymerase is kind of stuck. It cannot move any further. And so that's why you first need to do a amplification of your DNA. So you have many molecules and so what happens is you have normal NTPs and you have DDNTPs so terminating NTPs and then of course you use your primer and then you have your DNA sequence that you want to extend. And of course the polymerase starts extending the DNA and sometimes it incorporates one of these stops. So hey, it normally incorporates a normal NTP so a normal base pair but sometimes it incorporates one of these stop base pairs and these stop base pairs they have a Fluto4. So here you see the ones which colors are generally used. So if you have a T, you have a red color. If you have a C, you have a blue color. If you have an A, then you have a green color and when you have a G you have kind of this pinkish color or the color that Commando is using in chat depending on how you look at it. Like there's four different colors. So you have the primer, the polymerase binds and then starts extending from the primer and as soon as it incorporates one of these DDNTPs, then the sequencing stops. So you get in the end a whole bunch of fragments and some of these fragments will be terminated directly after the primer and other fragments will terminate much later on in the sequence. So what you then do is you take all of these fragments that you've generated and then you run this through a capillary gel. So of course the smaller fragments will go through the capillary gel quicker. So they will be first to be scanned by the laser. So you hit the fluorophore with the laser and then you get a chromatograph which looks like this. So here you can then read the sequence based on the colors and the height of the peak. So a very, very good method and of course this method is used still nowadays. So if you send in your sequence 99% of the time they use Sanger sequencing if it's a small sequence. So you send in your sequence, you send in a primer and the company will do this for you and in theory in our lab we also still have two of these capillary machines so we can do it here in the lab as well. But this is more or less nowadays for small fragments the default although pyro sequencing is still used nowadays and each sequencing technology has their advantage and disadvantage of course. But this is based on primary elongation and the incorporation of these DDNTPs which are NTPs which once they are incorporated stop the reaction. So the polymerase gets stuck and cannot extend anymore. All right, so those are more or less the classical ways of sequencing and nowadays we talk about next generation sequencing which means that we do re-sequencing. We don't sequence a whole organism anymore no we already have a reference sequence and we then want to sequence another individual to just find the differences. So the SNPs and the indels compared to the reference. And so that is called re-sequencing. Besides that we do things like transcriptome sequencing or RNA sec where we use a reverse transcriptase step so we take the RNA we transcribe it into CDNA and then we sequence the CDNA. And then we have for example also DNA protein sequencing which is called chip sec. So here we have we look for example where a protein binds on the genome. So what we do is we take an antibody that is targeting the protein that we want to see and then we put the antibody on the DNA or we put the antibody on the protein which is bound to the DNA then we cut the DNA in little pieces and then we pull out the antibody fraction and wash away the rest. So then we have just a part of the DNA which originally was bound by the protein. And of course there's now new methods which actually allow us to do epigenomic sequencing. So determine for example if a base pair is methylated or if a base pair is not methylated because there can be little things attached to the DNA on the backbone or on the nucleotide itself so you can have like a CH3 group like a methyl group on the DNA and this of course will give you more information it won't just give you AC, G or T but it will also tell you if for example the G was in a methylated state or in an unmethylated state which also has an effect on eventual transcription of the genome. So next generation sequencing of course there's many many different next generation sequencing technology so I will only explain a couple which are interesting or which I find interesting. So the one that I am most interested about is actually the single molecule real-time sequencing and this is done by Pacific Bioscience and this is not the standard yet but it is getting to be. Nowadays you can kind of rent a little sequencer from them which is as big as a USB stick so you get like a little computer with a USB stick and then you put the sample on the USB stick and then you can read the DNA for like 10 KB to 15 KB so 15,000 base pairs you can read in one go so you get a very very long read and some at least when I made the slide it could go up to a maximum of 40,000 base pairs but I think they're already a lot better so I think that nowadays you can go up to 100,000 base pairs sequencing so that means that you get a single read so a single sequence of letters which is 40,000 base pairs long and if you think back about the DD NTP the Sanger sequencing method and the Sanger sequencing method can only go up to like a single read is around like five, 600 base pairs and after that the quality drops off tremendously because of course you need to terminate every base pair. This is relatively cheap at the moment. I'd say here that it's expensive equipment but the equipment nowadays you can rent it from the company and it has a moderate throughput so it won't allow you to sequence a whole human being but you can sequence like your little lizard and you can do that directly when you're working in the rainforest because you can just take like this little computer with the USB stick with you you put the sample on the USB stick or it looks like a USB stick and it has the longest read length so the advantage of this method is that you get very very long reads but it is moderately throughput so you can do like a whole bunch of things in parallel and the equipment is relatively expensive although you can rent it nowadays. So how does this work? So how does PacBioScience sequence? So they're the little USB stick that you have they there's these little holes so you have a glass plate and on this glass plate there's like this glass plate is around a hundred nanometers thick so there's little holes in these glass plates and inside of this hole is a polymerase and this polymerase is attached to the little glass plate and what happens is that you put in your DNA strands so a single strand DNA goes in and the strand is then pulled one by one through the polymerase and every time that the polymerase again this uses the phospholink hexo phase nucleotides so the same nucleotides is what you kind of do with Sanger sequencing but every time the polymerase extends you get a fluorescent pulse when it incorporates the correct base pair and so if you look at the time so here you have the time when you are adding it and then you see oh the blue one got incorporated because you see kind of this flash of light and this technology is nowadays possible because we can kind of measure time in like a femto second time frame so you can see the polymerase kind of extending the DNA base pair by base pair and every time that it extends you get a light flash with the color of the base pair that has been and this is an amazing technology and this wasn't possible like 10 years ago so this is a really, really novel sequencing technology and you get really, really long reach which is just working with alignments and sequencing the longer the reads the better you are able to find things like big insertions or big deletions so this is a really, really good technology and it's improving still so every couple of months they either are able to sequence more or they are able to put in more of these little wells so to sequence more in parallel but it's a really, really interesting technology all right, so I hope that's clear so one of the other sequencing technologies which are next generation sequencing technologies which is still used is the ion semiconductor is the ion torrent sequencing and this is based on an ion torrent and this ion torrents are normally used for protein sequencing or protein detection it has a relatively short read length so around 400 base pairs of DNA and it is really, really expensive or it's less expensive than the previous one because the machine is not that expensive it is very fast but the big issue of the ion torrent is that you get homopolymer error so if you have a stretch of DNA which is like five A's or seven A's in a row the machine cannot tell you how many A's there were it can just tell you there was an A and this A occurred multiple times but it cannot determine how often I'm not gonna discuss how the ion torrent works if you're interested in it you can just Google it and it just uses a quadrupole system and it's kind of like how you do metabolomics so metabolomics and proteomics they use a lot of these ion torrent machines so we will get back on how you can use an ion torrent quadrupole to do sequencing when we talk about proteins or when we talk about metabolites so it is less expensive although these ones have come down a lot in price so it's relatively expensive so if you want to sequence a million base pairs this will cost you around a dollar which is not the most expensive technology around all right so the next one is the pyro sequencing the original one the original more or less next generation sequencing technology so this is just based on the standard pyro sequencing it is parallel and this is really really expensive also suffers from homopolymer errors and so it also means that if you have A's in a row it cannot really determine how many A's or how many T's or how many C's in a row you have but it had a relatively long read size so you get reads which are around like five to 700 base pairs long but you pay around $10 for a million base pairs so it's around 10 times more expensive as using the quadruple and it's around 20 times more expensive than using the new PacBio machine so it has an advantage it's the reads are relatively long not as long as with the PacBio technology but this one is really really fast if you wanna do pyro sequencing so if you wanna quickly do the sequence of a new animal and then you can use pyro sequencing you get relatively long reads you get them very fast but it is relatively expensive so how does pyro sequencing work? Well, we already explained pyro sequencing it's the same again so hey, you have the different DNTPs which are added one by one and then you have the Luciferase essay which generates a pulse of light and then you have these Apirase which head so it's a kind of a, it's a circle so you add a base pair if the base pair can be incorporated you get a flash of light and then of course the Luciferine and the Atopeus actually broken down so you can then start with the next adding the next one and hey, nowadays they add all four base pairs at the same time and they have different flashes or different colors of light like the Sanger sequencing method this is an explanation you can read it so hey, it just goes into a step and it's a loop and so you add one of the four nucleotides you, it is incorporated if it is incorporated it releases the pyrophosphate the pyrophosphate is then transferred is then giving a little flash of light using a Luciferase-mediated conversion and the unincorporate nucleotides and the Atope are degraded by the Apirase and then the reaction can restart with another nucleotide so very similar to how I explain pyro sequencing but of course next generation sequencing means that you don't do one DNA strand you do many, many in parallel and of course that makes it much quicker so normally when people talk about next generation sequencing they are talking about sequencing by synthesis which is the Illumina technology and this has a read length of around 50 to 300 base pairs it is the cheapest way of sequencing nowadays so for a million base pairs you pay somewhere between $5 cents and $15 cents it is very cheap you get a very high sequence yield so there's a lot of these little sequences that you get but of course the equipment is very expensive if you wanna buy an Illumina sequencer nowadays even the cheapest ones will cost like three to 400,000 euros and if you want to buy one of the new kind of high end machines like the X10s then repeat pair to pay like five to seven million for a single sequencer and the problem here is that you need relatively high DNA concentrations it's not like the Pac-Bio where a single molecule of DNA is enough now here you definitely need a high concentration of DNA so you need to extract a lot of DNA which is just one of the drawbacks of this method but Sanger sequencing nowadays is kind of had the sequencing by synthesis is kind of the default sequencing method and you also have the sequencing by location which is solid sequencing which is very similar to the sequencing that Illumina is offering but it has its own drawback at the read length is much shorter so you get 50 plus 35 or 50 plus 50 base pairs it costs a little bit more still very cheap 13 cents per one million base pairs but it is slow and it has issues with polyndrome so if you have sequences which are similar in which you read from left to right or from right to left and the sequences are equal to each other which actually happens a lot in the DNA because a lot of like transcription is based on polyndromes it doesn't work very well with these technologies so different sequencing technologies each sequencing technology has their own advantage and disadvantage and then of course Sanger sequencing we already discussed Sanger sequencing it can provide a read length of like 400 to 900 base pairs but it is very very expensive if you want to do one million base pairs using Sanger sequencing that will cost you around 2,400 euros however if you have small fragments then the best way to sequence it is still Sanger sequencing although it is really expensive it is really impractical it is kind of still the default so if you send in your 500 base pairs that you PCR it out you send that in for sequencing then they use Sanger sequencing to sequence this just because the other methods have like bigger cost of startup but then lower cost per base pair so and the reads are relatively long 400 to 900 base pairs but it is very expensive alright so the key of how next generation sequencing technologies work is that they are producing very short reads but a lot of them so what you do, you take all of these short reads together and then you assemble them into longer and complete genomes which is called sequence assembly and of course the raw sequencing reads are only the beginning of a detailed bioinformatics analysis so bioinformatics is the thing that makes next generation sequencing work so when you do sequencing and you have your sequencing reads what do you do with them well you have to align them to the reference genome and then you have to look at how many reads align at a certain position and then you want to call the differences between the sequence that you just did and the reference sequence and this is generally done using Unix tools or Linux based tools and so that means that you need a lot of computing time and so generally generally if you want to reassemble a genome using short Illumina reads it usually takes you somewhere between 24 and 36 hours for a single sample you need a lot of hard drive storage you need a lot of random access memory so you just need a very very beefy computer to do that and you need a lot of file management because every step in the pipeline produces new files and all these files need to be stored and so you get a hard drive full of data and then from this one hard drive of data in the end you produce like another hard drive of data just in the alignment of these sequences towards the reference genome all right so how does this work? okay so we talked about sample preparation so extracting the DNA or extracting the RNA and then reverse transcribing the RNA into cDNA and then you do your DNA sequencing and then your DNA sequencing gives you a file which is called a FASTQ file so FASTQ file is very similar to a FASTA file I don't know if people know how a FASTA file looks like I think I have an example of a FASTA file but a FASTQ file is just a sequence so it's the name of a certain read then you have the sequence that was read then you have an empty line and then you have the score so the quality of the individual base pairs below that so that's called a FASTQ file so what happens with the FASTQ file well the first step in sequencing alignment is that you have to do read trimming and we will go through all of these steps but read trimming is the first step in before you even start aligning the reads towards the reference genome and this is just to get rid of kind of the reads which have very bad quality at the beginning or at the end of your sequence sometimes you have things like an adapter that has to be cut off and sometimes you have at the end that the sequencing quality drops down so that you have a long read of 150 base pairs but the last 70 base pairs are of such low quality that they are actually not reliable so after the read trimming we have the process of alignment so taking the reads, figuring out where on the reference genome these reads should have, where these reads come from then we have the handling of duplicates so one of these things which happen is that in the sequencing technology especially Illumina you have something called optical duplicates so that means that based on the way that the optics works you get the same read, read multiple times and of course like you don't want these optical duplicates because they are kind of a mirror image of the original sequence that you are sequencing so we will go through them all step by step but alignments generally produce bump files or crump files nowadays but a bump file is nothing more than a file where all of these reads are listed plus the location on the reference where these reads fits the best so we get optical duplicate handling or PCR duplicate handling then we do an indel realignment step so the indel realignment step is something that you have to do because when you align reads to a genome it might be that people already know that there is a small insertion in some individuals and if you know that some individuals have a small insertion you want to correct for that when you align your read at a certain position so it has to do with the quality scores we then do base recalibration so base recalibration is based on SNPs so if we know that there are SNPs in the sequence we want to not penalize for known SNPs so base recalibration is taken care of that and then at the end when we do the whole bioinformatics pipeline then generally what we want to do in the end is SNP calling so we want to find where in the genome are these single nucleotide polymorphisms so where are the differences in our sample compared to the reference sample alright so let's go through so here we see an example of why we should trim reads so normally when you get a read from a sequencer and then here we have the quality score so the quality score ranges from around 0 well here we start at 40 but 0 is also possible but the read quality starts around from 0 to around 70 and so 70 means that there's a 1 in 10 to the power of 70 chance that it's wrong and 1 to the power of 40 means that 1 into the power of 40 that there's an error so this is a kind of logarithmic scale but what you clearly can see and what you see a lot when you do sequencing using Illumina or other technologies is that the more you sequence so the longer your read becomes the lower the quality so when the quality drops beneath a certain part you want to kind of cut that off so instead of having a read which is like 150 base pairs long you cut off the last 30 base pairs because they are too unreliable and then in the end what you kind of want to see is that the reads kind of follow this pattern that of course the quality becomes worse so the quality is not good across the whole sequence but hey you don't want to have a massive drop like here at the end of the of the sequencing so and of course why do we do that and we want to cut off the unreliable ends of these reads and this is to improve the alignment because when we are not certain that a base pair is the base pair that we think it is then of course the alignment will suffer because when you try to align like 150 base pair read where the last 30 are not very reliable and then of course when you try to fit it to the reference you know it just doesn't fit at the correct position so re-trimming first step that you do when you get read from a sequencer and it is an important step because if you don't do it then your alignment will look very crappy all right so then you have the alignment step so the alignment step is where we align the individual reads to the reference genome so here at the bottom this is the integrated genome viewer and this is just one of the sequencing experiments that we did here so at the bottom you see the sequence colored using the normal color scheme so red being the t and green so you have the standard the standard DNA code and then what you do is you see here reads so you see that all of these reads are more or less similar in length but of course some of them have been trimmed so some of the reads are a little bit shorter and other reads are a little bit longer and so what happens is you have hundreds of different programs that you can that you can use to do the short read alignment but in the end what they do is they just figure out where in the genome does the read fit to the reference genome and of course it aligns read it does some mismatch detection and it does something with the read length because of course the longer the read the better it will or the more uniquely it will align to the genome but there's hundreds of different programs you can use the burrow wheels algorithm you can use bow tie or nova align or rna or stampy and so there's there's literally hundreds of tools available and hundreds of bioinformatics tools and all of these tools will have certain advantages and certain disadvantages and some of them are quicker but are less accurate some of them are slower and more accurate and so it's a it's a continuous field of research so hey you can still do research on what is the best way of aligning your sequences or that you have towards a reference genome but in the end you get a picture which looks like this and so you have your on the bottom you have your reference sequence and then you have the individual reads on top aligning to the reference and you see here with colors you see where the there are mismatches so where read is different or a base pair in a read is different from the reference genome all right so then the next step is to look at duplicates so there's two ways that duplicates can occur one of them is due to PCR so if you do PCR you generate a lot of duplicate fragments and this is not biology just is just an artifact and there are two ways in sequencing analysis so if you do a bioinformatic analysis of sequencing data so one of the ways is that you can mark them and keep them in right so you can say oh and look I find all of these reads and all of these reads have the exact same length so you can then say well I mark them as duplicate reads but I still keep them in the analysis or you can just remove them from the analysis altogether so that's just the way that you deal with it and of course what you normally want is you want to have because the chances of having 20 reads which all start at the exact same position and end at the exact same position is very small because all of these methods are based on randomly cutting the DNA so you would expect more or less a pattern which looks like this where every read starts at a slightly different position yeah of course sometimes can reads can start at the exact same position and end at the exact same position but when you see a block of reads which is aligning to the exact same position at the beginning as the exact same length then often these blocks are caused due to PCR or optical duplicates which is not biology but just an artifact of the sequencing technology that you are using so I told you that there are snips and indel realignments that you need to do and of course in real DNA if you are looking at for example humans then there's a lot of these like little insertions and deletions in human DNA so you have to and what what you have nowadays is an is overviews of which regions of the genome often are duplicated or often have insertions so we know where they are and then we have to account for that because if you have a lot of these little indels or snips in the real DNA then what happens is that when you try to align a read to the reference genome the read will not really fit there right because the read hasn't deletion there so the reference genome does not have the deletion so the read just doesn't fit exactly but if you do a SNP or indel realignment it will reconsider the reads that did not align in the first position and then see if taking into account the known indels and snips if they will if they can realign better and so if you look at an indel example then here you have the wild type sequence so you could have a three base pair deletion so instead of having C A T A and what what is deleted is A T A so you just have C and then three A's so this is a little deletion and here you could have for example a little insertion in orange and so of course when you have a read and this read that can come from either the wild type sequence or the reference sequence it can come from an individual which has just a C with three A's or it has come from an individual which has a C with the TG TG inserted in here and then of course if you know that this exists in nature then you want to compensate for that because if you're aligning your read then you're always aligning towards the reference genome and so that means that if you if you align a read here then you would have like three mismatches and these three mismatches would make the read be discarded by the aligner because the aligner said well if it has three mismatches that's too much I'm throwing away the read but by incorporating known information you can now still keep the read and you can still put the read at the location of the genome where it is or where it belongs so how does this look like so here on the top we see one of our examples where we have not done the SNP and indel realignment so we see indeed that there's a SNP here right so here what we see is we see the coverage of the reference here we see the coverage of the reference so the blocks which are like this they tell us how many reads there were at a certain point and here you see that there is an indel right so here in the sequencing that we did and the reference sequence is here and this sequence or the individual that we were sequencing does not have this little piece of DNA so it just continues so from this AA the individual here continues with the CAG so hey it just jumps this individual just has this little part of the genome deleted so but when you don't do the SNP and indel realignment what you see happening is is that hey the aligner tries to put put reads there and then insert these little insertions and deletions as sometimes the whole read will be put there but when you tell or when you then do the next step and you do the alignment hey then now it will know okay so in other individuals we already saw that there are deletions there so by taking that into account I retake or I take all the reads that are overlapping the deletion and I realign them and I update the quality score of these reads because the reads here would get very low quality scores because they are not matching exactly to the reference genome but of course they are matching exactly to the reference genome it's just that you have to take into account that some individuals might have a relatively large deletion here so it's a very important step and it really improves the quality of your alignment because instead of having reads with very low alignments you now end up with reads which have a very very good alignment and of course you can see that not all of the reads have this deletion but it sometimes just like the aligners are not perfect either so but SNP and indel realignment is a very important step and it upgrades the quality of your of your eventual alignment that you do all right so then the next step is the base recalibration so the base recalibration step that you do is to take the quality scores in the fields that are reported in the BUM file so you do your alignment right every alignment gets a quality how well the alignment so how well the read fits there and of course every time that you have a mismatch this penalizes the alignment score so a perfect alignment score is for example 60 so all of the 60 base pairs are matching the reference genome but hey if you then have a read which has a single mismatch then the single mismatch will make the score instead of 60 will bring down the score to 50 however when we know the same thing as what we did with the indel realignment if we know that some individuals have a have a real single nucleotide polymorphism there then we don't want to penalize for that so we don't want to give a negative one for the mismatch because we know that this mismatch is a real mismatch and that is called base recalibration so you look at each base pair in the genome you look to see if there is a known single nucleotide polymorphism and if there is we can recalibrate the score so instead of having the score being 50 for the alignment for a single read we can now say no we know that there was a SNP so we update the score to be 60 so to be perfect again and because it is it actually is a real biological phenomenon all right so those are more or less the steps one of the things that I showed you a couple of times is these IGV overview plots and the IGV is the integrated genome viewer it's made by the Broad Institute you can just download it for free and it is the way to look at sequencing results so when you have done sequencing then every step that you do you look at the current alignment in the IGV so the integrated genome viewer is a high performance easy to use interactive tool for visual exploring of genomic data it supports flexible integration of all kinds of common data types and metadata so it can also show you where the genes are and where the introns are so you can put all kinds of additional information into this genome viewer combined with the information on how good a read is so very important tool and it's it's one of these tools that if you are doing bioinformatics you kind of have it installed on all of your computers that you have just because when you're working with sequencing data every step of the way you need to check so you have this big pipeline where you go through and after the alignment you check then after you do the indel realignment you check again and after the the base recalibration you check again just to make sure that the alignments are are proper and are what you expect all right big step so how we we started off with individual reads then you trim the edge of the reads because they that the quality drops off then you align them to the genome and then you do all kinds of additional steps to make your alignments as good as possible and then in the last step is the variant calling step so in the variant calling step you're now going to take the reads that you have done compare them to the reference genome and see if any of these base pairs are different or if there are small insertions and deletions in your reads compared to the reference genome so variant calling means finding the snips the single nucleotide polymorphisms and indels based on the DNA alignments that you do and again in this step there are many different tools that you can use but you always end up with something called a VCF file so in this VCF file each line is a snip or an indel and these files are produced and these are used in follow-up analysis for example you can compare multiple samples so I think I put in a nice example of a VCF file that I created so this is one of the VCF files that I created this week so it's just a text file so you can open it in any text editor they're generally like relatively big I don't know if I can zoom in a little bit or if it's readable can you guys read this at home like in my computer it already looks a little bit small I think if you're looking at the stream in an iPad then it might be too small but the way that it works is that you have a whole bunch of header which tells kind of the downstream program how the data is encoded in there so if you can't read it not really yeah so I will put this little piece of VCF file on the Moodle so you can just get it from there and then you can look at it in your own text editor and then of course it's big enough but the way that it works is that for each line so you have your chromosome head so here you have the header of the file so in this case I'm looking at certain mouse lines and what you see here is that you see here the chromosome at which we found a little difference the position at which the difference was found the reference so what does the reference genome tell us like an A the alternative allele so what did I find in some of my samples so in this case it's a C and then when you look all the way in the back you get the individual reads and that's not even on this file here because there's a lot of additional information in the middle telling you how well the quality of this SNP is and all of these things but hey in the end you get a structure where you can read it in and then you can look and see which animals have for example the A and which animals have the C or which animals have the C as the reference or the T as the alternative when we look at the second SNP so we'll put this file online on the Moodle I will just cut it for like the first 10 SNPs but the variant call format is one of these these standard formats in bioinformatics and when we talk about file types and standards for analysis we will get back to the file format and I will explain in much more detail how this format exactly works but it is the standard format to kind of communicate what the differences are in your samples compared to the reference genome so it's always relative to a reference genome and in the in the header it actually tells you which genome was used so in this case I used GRCM38 which is the standard mouse genome which is also used in for example the IMPC database all right so I've been talking for an hour I don't know exactly I think I have like 15-20 slides left but I do think we should take a very short break because I actually prepared some more animated GIFs for you for the second break so we will do a quick break and I will be back at 4.10 so enjoy the animated GIFs and see you in 10 minutes then all right so