 Welcome back everyone to part three of the bioinformatics lecture about DNA. So thanks everyone for still being here, who's still here. By the way, the people that get VIP status, I generally give VIP status to the current student. So if you're a current student and if you say something in chat, you don't have like a little diamond in front of your name, let me know because I try to use it so that I can see who's a student and who's not or who's showing up multiple times. So let's just continue because we still have 30 slides to go. So I'll try to get a little bit of a pace on so that we can finish. Last week was also longer than I expected. So I do want this today to be not as long. Good. So like we saw almost all current next generation sequencing technologies produce very short reads or relatively short reads. The only one that does really long reads is the one that uses the immobilized polymerase. But so the next couple of slides will tell you how generally people use these short reads because generally you take the short reads, you assemble them into things called contigs. So those are continuous sequences and then these get assembled into complete genomes. So they get assigned to each of the different, they get assigned to chromosome one, chromosome two. But so this whole process is called sequence assembly and this is computationally relatively expensive if you wanna create a new genome. So raw sequencing reads are only the beginning of the detailed bioinformatics analysis, bioinformatical analysis. That's such a weird word. I shouldn't use a word like that. But remember that all of these tools that will be discussed in the next couple of slides, they are usually only available for UNIX or for Linux and that is because generally they run on cluster machines or servers. So machines which have a massive amount of hard drive space, a massive amount of random access memory and generally have 50, 60, perhaps even 2000 CPUs, right? So these specialized machines, they don't run Windows, they don't run Mac, but they are all based on Linux. So here you see the Linux mascot, which is called Tux. All of these tools, like I said, there's a lot of computation time, you need a lot of hard drive, you need a lot of memory and all of these files that are created also need to be managed because in every run that you do or every step of this process, you take the input files, you do something with it and you create an output file. So the amount of files just grow and grow and grow. So in a normal sequencing analysis, we have the sample preparation, which is not really part of bioinformatics, but it's more biology, right? And then we have the DNA sequencing. So DNA sequencing leads to something called a fast queue file. After we have done DNA sequencing, we get a fast queue file from the sequencing machine that we are using and then these are the steps which we do before we can actually discover single nucleotide polymorphism, right? So the first step that we do is re-trimming and then we do alignment, then we do duplicates, then we do indelry alignment, we do base recalibration and only then can we do SNP calling. So this is kind of the standard approach that you should take. So there's a lot of quality control steps in there. And the alignment is part of the process, so aligning it to the reference genome, but the re-trimming is something that you need to do, otherwise your alignment will be crap. So first things, first re-trimming is something that is necessary because when you look at sequencing technologies and all sequencing technologies suffer from this is that the quality of the reads becomes lower towards the end of the read, right? And this just has to do with how chemistry works is that the longer a certain chemistry process works, the poorer it works. And especially if you're using things which involve like luciferase, which generates light flashes, and then at the beginning these flashes are very clear and at a certain point things start getting out of sync and peaks start overlapping, so it becomes harder. So here, as an example, you see a sequencing run before trimming and after trimming. So before trimming you see that the quality, so the quality here is written down as a fret score, so the quality ranges in this case from 30 to 70. So this is perfect quality, like there's a chance of one times 10 to the power of minus 65 that the base pair is wrong, but you see that the longer we continue, the more bases we read, the lower the quality becomes. And the average quality is not that much affected, but you do see that the bars more or less of the bar plot go down all the way to below 40, which means that the error rate is going up and up and up the longer the reads are. So first things first is you need to cut the unreliable ends of these reads and this improves alignment. And generally we do this using Trim-O-Matic, but there are other tools available to do it, but Trim-O-Matic, again, is one of these tools only available for Linux and it just goes through each of the reads and says, well, if the quality is below 40, cut it and just throw away the rest. And so, hey, you take your fast queue file, which you have from the beginning, you do read trimming and then it produces a new fast queue file with shorter reads. And if you do look at the quality across the whole run, after you do the read trimming, now you see that this whole big kind of grayish area where you see that there's a lot of reads which have low quality is just gone, right? So now the quality of all of the different base pairs that we have is above like 55, which means that they are really reliable. So that's one of the first things that you do when you start sequencing. Then the next step is alignment. So alignment is finding for each read where it maps on the reference genome. And there are literally hundreds and hundreds of programs to do short read alignment. Some of the most famous ones are BWA, which is the borough wheels algorithm. You have Bowtie, you have Novoalign, you have RNA, you have Stampy. But what these things do is they align reads, they count how many mismatches there are and they do things like look at the read length. So in the end when you do your alignment you get something which looks like this. So here you have the reference sequence, which is just represented by a single line. Here is the same reference sequence, but now colors are used for the different base pairs. And what you see here is the different reads. So here, for example, you see a relatively short read. So the read here has a vertical line or a vertical, it's kind of a little arrow, right? It looks like an arrow. So the read starts here and ends here. And of course reads can go both ways because you could have read the positive strand or the negative strand, but in this case you see that all of the reads point to the same direction, which means that we did probably some other type of sequencing where we filtered out reads of the negative strand. But what happens is that all of these reads get assigned to the best position in the genome. So what this produces is something which is called a BOM file. And this BOM file has for every read the best location. And of course, if there are any mismatches or if there are any little deletions that you have to introduce in the read to make it fit, right? Because it's not that every read fits perfectly because sometimes there is a real single nucleotide polymorphism, right? So some people have an A, other people have a T. So when the read is aligned, then there is one mismatch or no mismatch depending if the person is equal at that position to Greek Greg Venter or not, right? So that's what the alignment does. And there are hundreds of different programs and we don't even have the time to discuss all of them. The next step is removing duplicates. And duplicates are generally artifacts. So there are two types of duplicates. One of them is the PCR duplicates because most next generation sequencing technologies have a PCR step somewhere in the process where DNA is amplified. So that is not biology. It's just an artifact of the sequencing. So here you see all kinds of different reads. The color of the read is the quality. So green reads are aligned very well. Bradley reads don't align very well. So they have a couple of mismatches. But here you see this big block of reads which starts and ends at the exact same position. And this is a telltale sign that something is an artifact, right? So here polymerase actually amplified this read like a couple of hundred times which was not intended to happen. So we have to get rid of these duplicates. So these duplicates can come from PCR but there are also things like optical duplicates where for example the reader, so the detector of the machine is detecting two clusters that are exactly the same which is not biology but it's just due to the fact that like machines aren't perfect. So there are two ways of dealing with duplicates. You can mark them and keep them, right? So you can say, well, this read was a duplicate or you can just say throw them away. And the approach here is up to you. So depending on if you want to remove them, you can just throw them away but some people like and keeping them in for next steps so that you can, well, for example, use some of the information because you have the same read read like 20, 30 times. So that is a good thing sometimes. So duplicate removal. So after that, we need to do SNP and indel realignment, right, because we know that there are many real SNPs and many real insertions and deletions in real DNA, right? Not everyone is exactly identical to Greg Venter and nowadays we have done a lot of like massive sequencing of different populations. And so we know, for example, that if you are from Asian origin, you always have a little gap at this point in your genome while if you are from African-American origin, you have for example a single nucleotide polymorphism at this position, right? So because we know the locations of the SNPs and these indels, what we will try to do is SNPs and indels will cause failure to align. And if I'm from Oceania, if you're from Oceania, then you're a mix, right? But yeah, no, every population in the world has their own SNPs and indels. And we had the HubMap project. So the HubMap project was a big project which took thousands of people across the world from different like racial origins and try to figure out what deletions and insertions do we see and these kinds of things. And so for example, here you have an indel example. So this is the wild type sequence. So for example, Greg Venter and then some people have a three base pair deletion. So they have the ATA is deleted. Other people actually have a TG, TG insertion. So everyone is a little bit different but you're not uniquely different, right? Because if you have a little insertion, then you got that from your father and your mother. And of course your father and your mother also got it from someone. So populations share indels and SNPs. So by using this knowledge, we can actually say or we can actually say that, well, this read was placed here but it got a very bad score because it has four mismatches to the reference, right? But if we know that these four mismatches occur more often in other people, then of course we can say, well, we're not gonna penalize the read for that. So instead of giving it a lower quality score because we know that this is real we actually give it a normal quality score because it is not a mismatch. It's actually something which is biological read. So the indel and SNP alignment step is really important to get good quality or good alignment of your reads towards the reference sequence. So how does this look? For example, here we have an indel realignment, right? So we have the reads that were aligned in the first. So before we did the realignment. And so we see that the coverage of this region goes down a little bit. So this might mean that not all of the individuals that we sequenced or that means that there might be a deletion. And if we then know that in the population there is indeed a deletion, right? Have we saw before that like thousands of people that we sequenced that half of them had a deletion here then when we look at a single individual before realignment, right? We see that some reads actually get aligned. So actually the aligner finds the deletion. But because we know that there is in some people really a deletion, we can now say, well, incorporate this knowledge. So what you now see is that this SNP is still there. But we now see that all of the reads get properly split across the deletion, right? Because normally you use penalizing scores to say, well, if a read doesn't match exactly then the score gets lower. But what happens here is that actually because we have knowledge that there is a real deletion we can then use this knowledge to realign the reads around this known deletion. And this really improves your quality and it also makes it much better to detect these deletions because we know that they are real because we've seen them in hundreds of people before. So basery calibration is the step afterwards. So after we did the SNP and indel realignment, right? So we realigned the reads, so we split reads when there's an indel and if there's an insertion we just say, well, okay, that's good. But we also have to deal with the SNPs, right? So the basery calibration is that the quality scores in the quality field of each read in the output bomb are more accurate than in the reported quality scores closer to its actual probability of mismatching the reference genome. Since we fixed some errors on indels we need to re-calibrate the quality scores. So we realign around the different indels and SNPs that we have and then in the next step we are now going to look at each of the quality scores of each of the base pairs and fix them if they were realigned. So those are two separate steps. So the first step is making sure that the reads know that there is an indel and then in the second step we have the basery calibration score and there we just say, okay, so this base pair got a very poor quality score but now because we know that it's not a mismatch but it actually aligns like five base pairs further because there's a deletion. That means that we can now say that no, the score of this base pair was actually quite good. So all of these pictures that I showed you like these ones, these come from the IGV so the integrated genome viewer. So this is a high performance, easy to use, interactive tool. You can also use it on Windows which is quite surprising for DNA sequencing data and it shows you where reads are aligning in the genome. So it gives you these pictures where you see the genome on the top and on the bottom you see for example the reference sequence and the genes and you can see then how much you can visually inspect your alignment to make sure that your alignment is proper and that your alignment looks good because if you see that there's hundreds and hundreds of mismatches with the reference genome then you know that something went wrong. So it's very important to every time that you do a step to check the results of the step and for that almost everyone in the world uses the IGV. The IGV is made from or is made by the Broad Institute so you can just click the link here and you can install it and look at some example BOM files that they have to see how these things look and it's a very interesting tool because you can also integrate all kinds of different data sources into it. So it's not just for viewing reads you can also do like overlaying like genes and known indels and these kinds of things. So then the next step in the whole process is now calling the variants, right? So after we've trimmed off the bad part of each of the read we're relying to read to the reference genome we take into account known SNPs and indels then we can start at the end doing the quality score then once we have the quality scores kind of recalculated then we can actually do a indel and SNP calling. So now we can see thus this individual that we sequenced have any SNPs which we did not know before or does it have for example a high impact mutation which knocks out a gene or something like that. Again, there are many, many different tools available but all of these tools will end up producing something which is called a variant call format file. So a VCF file. Each line in this file is a SNP or an indel and then these files are used for follow-up analysis. For example, you can compare multiple samples and you can see if a certain SNP occurs more in people who have a certain disease or these kinds of things. So how does one of these files look like? Well, they look like this. So they're just plain text files they are tab separated files. They have several columns and they have also a description. This is very poorly visible for me. Can you guys see this or should I just show you a VCF file in my own? Do I have a VCF file? Yeah, I have a VCF file. I think I will just show you guys one. So let me just get you guys a Notepad++ window so that's more clear. So these are the answers. We, for example, look at the Texas mice and then that's a big file. Let me open up a smaller one. So first off, you have the header, right? So this says that this is a VCF file which is format 4.2. There's the information on how this file was created. So what version of VCF tools was used, what filters were applied. It also tells you which reference file was used so which genome we aligned to. Then it tells you all of the different context that there were, so chromosome one, chromosome 10, 11, and blah. And then of course you also have parts of the genome that were not part of any of the chromosomes. It tells you what the file format is but in the end it looks more or less like this, right? So here it says on chromosome 18, 4 at disposition. So 81 million point two base pairs. The reference allele was a C but we found a T. The quality that we found this with is 447. Then it gives you all kinds of additional information like the depth, so there were 442 reads at this point. And then if we continue to the end, we see here that we have the last columns and the last columns show you for each individual that you called your SNPs on. If this individual was zero slash zero, meaning homozygous reference, or if this individual was one slash one, so homozygous alternative allele, or if an individual is zero slash one, which is a heterozygous, which in this file does not occur because all of these individuals are inbred mice so they're all homozygous. But in this case, if we would look at one of these files or one of these lines and we would look at it in slightly more detail, then this is just the additional information which we can kind of get rid of. So let me just get rid of some of this information here. So it says here that every field is giving you the genotype and then the PL, which is the likelihood, so the probability likelihood. And so the first individual here was homozygous reference. The second individual was homozygous reference and so on until we find two individuals here at the end, which are one slash one, so homozygous alternative allele, and here one slash one, so homozygous the alternative as well. And each of these lines in these files show you a single nucleotide polymorphism. So you can see that in this case, we just did a very, very small region of chromosome 18. So we started at like 81 million base pair. If we then go down, we see that we ended up at 88 million base pair, but we had like 153,000 little changes between the individuals what we looked at. So not only does it encode SNPs, where the reference is a T and the alternative is a C, but you can also see that here we see that there's a small insertion. The reference here says GTT, so the black six mouse is GTT at this point, but some of the mice in our sample actually had four additional Ts. So this is an insertion of four Ts. Above this we see a deletion, right? So the reference at this point was TGTTC, while some of the mice only had a T. So the GTTC part was deleted in some of these individuals, right? And it will tell you, okay, so this is an indel. So a little bit more visible like this I think then on the slide. But that is how the VCF files look like, right? So it's just a big description and then for each position in the genome, you just say, well, we have reference alternative and then for each individual here, we have numbers saying that were they 00, meaning homozygous reference AA or were they one slash one, which is homozygous alternative, which is CC. Good, so DNA sequencing is key to biology. There are many, many techniques to do sequencing and there are still more to come. But currently a lot of these techniques are based on polymerase chain reaction and amplification of DNA. There is a complex computational pipeline which you have to put in the end and already possible near future is that we can single to single cell sequencing. So do only a single cell and sequence the genome of that. And nowadays also single DNA molecule sequencing is getting more and more possible where you just sequence a single molecule of DNA instead of sequencing a bunch of them at the same time. All right, so a few words about genes and gene structure, right? Because in the last lecture, we talked a lot about what a gene is. A gene is a unit of inheritance. But in this lecture, I want to show you how a gene is or what a gene is in molecular biology, right? So a gene in molecular biology is defined as a union of genomic sequences encoding a coherent set of potentially overlapping functional products. So that's a very different definition than the unit of inheritance definition. So I will try to kind of explain you guys what actually they mean by this by showing you some examples of different gene structures. So if we look at prokaryotes, right? So if we look at bacteria, so bacteria have something which is called a polycystronic operon, right? So here we see the DNA of a bacteria in the middle line, right? And you see all kinds of different colors of blocks. What we see is that this gene in bacteria is actually coding for two different proteins, right? So we have protein number one and protein number two. So in bacteria, how does this look like? So we have something which is called a polycystronic operon, which is called an operon. And in front of the operon is generally a regulatory sequence, and after is also a regulatory sequence. So this sequence determines when a gene will be transcribed. So there are enhancers and silencers, and there are operators like, if this other gene is transcribed, then you don't need to transcribe this one. So these regulate how much messenger RNA is produced. Then what we see is that we have a section of the DNA which is called five prime UTR, which is the section of the messenger RNA which does not get coded into the protein. We then have something which is called an open reading frame. So this is the protein sequence, right? So three base pairs encode for one amino acid. And then because it is a bacteria, bacteria have a lot of the genes in one control region, right? So there's one control, and if this thing says, well, you need to go on and you need to make 10 copies, then both the first protein is 10 copies of the first protein are made and 10 copies of the second protein are made. But in bacteria, what happens is that this whole part gets transcribed as messenger RNA. So the whole five prime UTR, all the way down to the three prime UTR, which is part of the regulatory sequence behind the gene, gets transcribed onto a single messenger RNA. So in this case, it's two, but it can be up to 10 or 15. And the reason why bacteria do this is just because it's cheap, because you only have to make one messenger RNA and this one messenger RNA can produce like five or 10 different proteins. So the genome of bacteria generally groups genes together, which are also needed together, right? If there is food and we need to move forward, then we don't need to produce one protein, but we need to produce like 10 different proteins to start moving forward. So by putting them all under a single control, we can transcribe them all in one go. Of course, this has as a drawback that you cannot regulate the individual gene levels, but there are other ways that bacteria do this. So the RBS, this is the ribosome binding site. So the ribosome is the protein that makes proteins. So it binds messenger RNA and then starts making proteins. So these things are, there are multiple binding sites for the ribosome on a single messenger RNA. So that's just how a gene structure works. And here we can see that it's multiple overlapping, coherent, and so it forms a single unit. So in bacteria, this whole thing is called the gene. So the whole polycystronic operon, the regulatory sequences, the 5-prime UTR, the open reading frames, everything here is part of the gene. So it's not a single unit of inheritance. It's actually like a whole region of the genome, which has a single function or which codes for proteins that make a single function. So slightly different is the eukaryotic gene structure. So the eukaryotic gene structure has generally no operon structures. So every gene comes with its own regulatory sequence, its own regulatory sequence at the end. It has enhancers and promoters and silencers and again, 5-prime and 3-prime UTR. So 5-prime UTR with the ribosomal binding site. So what happens in eukaryotes is that you have something which is called the open reading frame, right? And the open reading frame here consists of introns and axons. So only the axons code for the protein while the introns get removed and the introns get removed in something. So the DNA is transcribed into messenger RNA, then it gets transported to these splicing speckles on the side of the nucleus, so on the side of the nucleus of the inside the cell and their pro-transcriptional modification is done to cut out the introns of this gene. And then what we see is then the open reading frame is transformed into the protein coding region and at this point also a poly-A tail is added to the gene structure. So this thing is then transported into the cytoplasm where the ribosomal then start transcribing this messenger RNA into, so the mature messenger RNA is then transcribed into a single protein. The poly-A tail is there to modify how many transcripts are made. So at this point, we already made a messenger RNA, but at this point the nucleus can still decide, well, I need 50 copies of this protein or I need 500. So the length of the poly-A tail is a signal for the ribosome to make more copies of the same gene. We have a five prime cap which protects it because it needs to be transported and that is not the case in bacteria. In bacteria, the whole machinery for the transcription of the mRNA and the translation into the protein is all attached to the DNA because bacteria do not have a nucleus. Had this whole machinery, all of these proteins, they just bind around the DNA and a strand of protein comes out which is then folded into the eventual protein or multiple proteins even. But since eukaryotes have a nucleus and proteins are produced outside of the nucleus, we have to have a cap which allow for transport and we need to have a poly-A tail so that the messenger RNA, so the mature messenger RNA, is not broken down during transport from the nucleus to the cytosol. The introns and the axons have a, so the axons code for the proteins. What exactly or why exactly we have introns is unknown, but there is some indication that introns are a counting mechanism for the cell so that the introns are used to kind of keep track of how many of a certain protein were produced because every protein has its own unique intron and the intron is actually a piece of RNA which can bind other, well, non-mature messenger RNAs. So there is a level of regulation in eukaryotes which is not found in prokaryotes which is found in eukaryotes. One of the things which also can happen is that not all of the axons are transcribed, right? You could imagine that in this case that the axon here in the middle would also be cut out. So this is also a way for a cell to have a single gene, but this gene, some genes can produce multiple different proteins. By leaving out, by mixing and matching different axons together, you can make different variants of the same protein. So on average, every gene in the human genome can code for I think on average like four to five proteins. So although humans only have 20,000 genes in the genome, we can make up to 150,000 different proteins. And this is because at this post-translational modification step, axons can be mixed and matched together to form different variants, right? So you can imagine that this might encode for something which binds DNA. This might code for something that binds an iron molecule and this part here codes for a signal transduction, which, and so then you could have something which binds DNA and then binds an iron molecule or you can have something which binds DNA and has the signal transduction part. So genes can be mixed and matched together in eukaryotic cells and that is because post-translational modification. We will talk a lot in the next lecture about post-translational modification when we start talking about RNA and messenger RNA and how this exactly works because we know exactly how this process works. However, the gene structure from prokaryotes is different from eukaryotes. The main difference is that prokaryotes tend to code multiple proteins on the same messenger RNA while eukaryotes almost always have one protein encoded by one messenger RNA but a single messenger RNA can be modified to encode for different variants of the same protein. And of course the intron axon structure is something which is unique to multicellular organisms to eukaryotes. So there are many difficulties when defining a gene, right? There's things like RNA-based inheritance even where RNA is passed on from father to offspring or from mother to offspring. Not just that, but there are things which are called gene fusions where genes next to each other produce a single mRNA leading to a fusion protein, right? So just like in bacteria, sometimes the polymerase doesn't stop where it needs to stop but it just transcribes two genes onto a single messenger RNA which then get two proteins fused together. Not just that, but there are also things like interchromosomal promoter regions where the promoter part of the gene is actually not located anywhere near the gene but located on a completely separate chromosome. There are also locations where axons are located on different chromosomes. So had this intron axon structure might actually be that part of the gene is located on the autosomes and another part of the gene is actually located into the mitochondria. So genes are very hard to define. It is very hard to define biochemically what is a gene. In theory it's very easy. It's a piece of DNA which goes for protein but because of the way that DNA works and all of the regulation that is there and because you have the translation and then the transcription step and then the protein is made, it is really hard to define exactly what a gene is. So what we know is that a gene is composed of multiple protein domains and so a protein domain can be an alpha helix or a beta sheet, a loycine zipper, it can be immunoglobin-like, it can be zinc finger. So a lot of these protein domains can actually be detected in silicoes so we can use bioinformatics to look at a sequence, a DNA sequence and then predict where will the gene start? What is the five prime UTR? Where will the transcription of the protein start? Which part of these proteins are actually axons and which are introns? So a lot of bioinformatics work is kind of looking at DNA and then figuring out how this DNA makes a certain protein and where these regions of the protein are located inside of the DNA. So we can detect genes up in each show using known signals. For example, in prokaryotes we can generally look at known promoter sequences but it is very difficult in eukaryotes to predict if I just give you a DNA sequence it is very hard to say there's a gene, right? Because so normally what people use is things like machine learning. So you show a computer a whole bunch of examples on how genes look and what their sequence is and then of course these machine learning algorithms learn how a gene kind of looks like in general and then they can do, if you then give them a new piece of DNA they can do predictions and say well it is likely that a gene starts here ends here that this is the entral and that is the axon. One of the main things that we do is use comparative gene finding. So if we want to know if there is a gene in a certain part of DNA then what we do is we take this piece of DNA and then we compare it to other species and because for humans we know more or less where all of the genes are located. For mice it's the same but if I am studying some kind of lizard that lives in the rainforest somewhere then of course the chances that someone has the whole genome sequence and has all of the annotation so knows exactly where each gene is that is of course not there. And so comparative gene finding uses two different species and compares the genomes of these species one which is really well annotated for example mouse or human and then compares it to see if we can figure out where in this rainforest lizard where a certain gene starts and where a certain gene ends. To make it even more difficult we have something which is called jumping genes. So jumping genes are amazing. I love them a lot. They are one of the main things that plants have which well humans have jumping genes as well but generally in our genome they're generally kept silent because you don't want pieces of DNA jumping around but in plants they are very common because a plant is very susceptible to inbreeding. A plant cannot simply move to somewhere else and because it cannot move to somewhere else it has a high likelihood of mating with one of its children or one of its parents. And inbreeding is bad because it reduces the amount of variation that you have. So to kind of counteract this plants are things which are called transposable elements so jumping genes. So these are pieces of DNA which jump from one part of the genome to another part of the genome and these cause for example to disrupt so they can for example jump into a gene and then they can disrupt the host gene. And some transposals they jump some copy themselves but jumping genes and transposable elements are one way of plants to counteract the inbreeding that goes on because they cannot move. So because they cannot move they have a high chance of inbreeding and to counteract that they have genes in their genome which are able to jump around and silence other genes and just generate variants in general by just inserting themselves at different positions or copying themselves. So they were discovered in 1984 by Barbara McClintock she's actually one of my favorite biologists and she got a Nobel Prize in 1983 for this discovery. So again like the jumping genes themselves were discovered before we actually knew that DNA was a helical structure. Of course people knew about Nuclein they knew about the ACTs and Gs have transposable elements themselves and the discovery is a very interesting story. So definitely check out Barbara McClintock if you wanna know more about this really weird phenomena where DNA jumps around copies itself and how they figured out that that existed. All right so we've been doing almost another hour. I have 11 slides left so I do, I am going to take another short break. I will definitely make sure that you guys have new animated GIFs and less holiday like music I think because I just downloaded a whole bunch of music. So for now for the people watching it on Moodle or the people watching it on YouTube I will say goodbye and until next time.