 Okay, so the introduction to variant analysis. So of course we will start very relatively low level really with the beginning, probably some refreshments from, well for your introductory course at university maybe in genetics and from there on of course it will become more specific to next generation sequencing. So why would you study variants? In general they are usually two reasons for that. They also are related to each other but one of them is to find causes for phenotypic variation and then of course mainly phenotypic variation that is caused by variant in the DNA. So inherited phenotypic variation and for example one thing you can do is do a genome-wide association study. So you measure phenotypes in a population and you also do a variant analysis on for example the whole genome or whole exome of that species and then you try to find associations between presence and absence of variants and the phenotypes. They find correlations between a variant and a phenotype and if you are lucky or if you have a good coverage of your entire genome you might actually find the causal variant. Second and that's again also related to that and that's mainly an application in for example ecology is to understand relatedness between individuals between species and to for example answer questions about evolution or other or ecological questions. So for example with variant analysis you can make such a nice chemistry. So variation has to come from somewhere and usually or actually always a variation in the DNA is caused by a mutation. A mutation has been an event that at some point occurred and then was ended up in the reproductive cells of an organism and then it became a germline mutation so an inherited mutation. So many mutations of course do not have an effect on the phenotype at all because they are not in for example genes and they do not affect the amino acid sequence of the gene but some of them have. Here's an example of a mutation that had an effect on the phenotype. There's actually a picture from my previous work where I worked in flowers where there's an important gene called CCD4A and that actually turns yellow carotenoids into colorless apoccarotenoids and if that gene is is mutated then that conversion does not happen anymore and then the flower turns turns yellow. So that's a mutation that has an effect on the phenotype but many mutations of course do not. So what caused mutations for example one cause of that are our repair mistakes during mitosis where for example there was a breakage in the chromosome and that's repaired but not completely according to what the sequence of that chromosome or the DNA had been before. You can have unbalanced cell division and often that has bigger effects of course on the DNA sequence of cells so for example you can gain or miss entire chromosomes or chromosome arms. Other causes can be transposable elements transposable elements are also known as jumping genes so these genes they can really find different places in the genome and for example jump into a gene that has an effect on the phenotype and then you also see the effect of active transposable elements. The very famous example of transposable elements is these colored kernels all in corn and of course these causes of mutations they can be the occurrence of mutations can be increased by by different environmental effects. So for example we all know that for example UV can cause more mutations and probably because of these effects on on the pair mistake. So there are two basically two types of mutations one of them will be the topic of this course so that means that a mutation has ended up in the reproductive cells of an organism and then that mutation becomes inherited. So of course that's also how evolution works right so that mutations that are increased fitness are more likely to be inherited to the next generation than others and also we have mutations or actually I should say variants that occur only in a particular part of an organism and do not end up in the reproductive cell and we call that a somatic variation but of course somatic variation can also become germline mutations if they end up at some point in the reproductive cells. So I have a question for you so in for that we go to VFox again for a new share there we go for the people who have just joined you can go to VFox.app on type in the ID over here or scan the little QR code which is on the left. So the question is we look at that mutation in the flower what kind of mutation has caused the flower to turn yellow or actually a better question would be what kind of variant this is actually. Okay I think most of you have answered so there's a well you generally kind of disagree with each other let's say like that most of you said it's a germline mutation some of you say somatic and some of you say they both I guess I would agree most with the people who have answered both because indeed it is a mutation that occurred only in the part of the organism so it is a of course a somatic variant because you have parts in part of the individual have mutation other parts of the individual don't however it is in the flower which are reproductive organs of a plant so they can be inherited to to the next generation so then it can become a germline variant. So I would agree most of the people have said both but of course in principle all answers are correct or ask question. No I was just curious if like if it went to the germline wouldn't it of the plant rather a reproductive organ wouldn't it then only be a germline mutation in the next generation so once that becomes a plant I'm just yeah I guess I guess yeah yeah yeah depends a bit on I would I guess it's semantics isn't it yeah no yeah but I I agree that's a good point because it only only if it is inherited it becomes germline you're right so if this plant would not reproduce then it hasn't been inherited one so I guess then it's not a germline mutation right okay back to the presentation so some definition I think it's important to have some definitions might be a little bit boring but then we know what we have been talking about so I have been using mutation and and variants a little bit interchangeably in the in the previous slide but they they they are different things of course so mutation would be the chain so that the chains have occurred. A variant is something different because that is any difference that existed between any DNA so that's something that's usually what we measure so if we measure variants we actually measure something that has occurred after our after the mutation we kind of doing forensics over there so we know if we measure variation that it had been caused by mutation but we are not measuring the actual mutation. Then what a lot of people use is polymorphism or for example single nucleotide polymorphism SNP and typically especially in human genetics a SNP is some a specific variant or polymorphism is even a specific variant because that is variation that is common in a population and typically you're then talking about an allele frequency of more than one percent so more than one percent of the chromosomes should have that polymorphic allele. However a variant versus polymorphism can be very problematic even in human genetics because it depends very much on the population so a variant can be a polymorphism in a european population but not in an african population for example so therefore typically we talk more and more mostly about variants because it is very difficult to define whether something is a polymorphism in this definition yes or no. Okay then some concepts that you probably have learned in your genetics 101 at university but i think it's nice to just repeat them and make sure that everybody is on the same level. So DNA occurs in a nucleus if we are talking about diploid and we will mainly focus on diploid during this course but a lot of the the same concepts hold for polyploid and maybe a little bit less than to haploid because it's a it's a simpler case but let's say we have an organism with a nucleus and it has two chromosomes and each for each chromosome there are a pair axis that makes it of course of course diploid. So we have chromosome one which is a smaller chromosome with two homologs and this A and B and with chromosome two with two homologs A and B and of course these two homologs they can pair during meiosis and recombination or crossing over can occur between those homologous chromosomes. Those homologous chromosomes they have a very similar sequence but very often not the case and if there is a difference between them we call that a heterozygous variant. So that would look like this so let's say of course it's very much enlarged we have a variant between our two homologs and we have an allele A and an allele B so allele A would be the blue one and the allele B would be the the grayish one. So we call that a heterozygous variant because it is different between the two homologous chromosomes. You can also have homologous variant depending of course on what you call a A variant so most of the two homologous chromosomes have the same sequence but if there is a variant that differs between individuals we can actually say okay we have alleles that differ between individuals and those alleles they can occur both in the two homologous chromosomes and then all that homologous variant because it occurs in both of the homologous. Then you can of course have many variants and typically you have depending on the organism you can have many so for example in this flower I worked and we had variants in every 30 base pairs in human I think it's every one every 10 kb on average correct me if I'm wrong maybe somebody knows that number but I don't know it by heart so depends very much on the organism how many you have of course the different types of the different versions of the variants we call those alleles so we have two different alleles in the heterozygous variant we have only one type of allele in the homologous variant if we focus on alleles we can also check whether they occur on the same chromosome yes or no so on the same homologous we'd say they do not occur on the same homologous we say they those two alleles they are in revulsion if they do occur so if you focus on these two alleles and we want to see where they are then and if they're on the same chromosome they're in coupling phase so if revulsion and coupling phase that can be important if we if we focus on inheritance because if two alleles are close to each other and in coupling phase it's likely that they inherit together if they are not on if they are in revulsion phase it is unlikely that they inherit together if we look at blocks that typically inherit together so are close to each other on the same chromosome this is of course a little bit enlarged so you can have in this case it's like that we have quite a few recombinations between the alleles of let's say they're very close together then we talk about haplotypes so haplotypes are a set of alleles that are in coupling phase so on the same chromosome that are typically inherit as as one one block and haplotypes can be very relevant for all kinds of genetic analyses that's it regarding those those concepts might have been a little bit quick for some or for some it might have been okay yes I know that of course because I had this in university are there any questions regarding this if not then I continue with a question because if there are no questions you have understood everything so let's check whether everybody understood it so based on this if two alleles are in repulsion in a single individual of course so that means that two alleles are on different chromosomes can both alleles be inherited by the same individual so by one cell so it's completely anonymous so if you do not know that's also perfectly fine can just like get your best answer okay all of you have answered so okay that's right so most of you are correct or at least I agree with most of you so the answer yes but only if there's crossing over if there's crossing over the event between the two alleles so let's go back to the presentation there we go so the repulsion would be this right so if we focus on these two alleles they are in repulsion and in principle if there's no crossing over between them so if there's a crossing over between these two during myotis then of course they will not be inherited especially that occurs so it's very unlikely that there is a crossing over between two alleles if the alleles are very close to each other because that particular place it needs to be the the crossing over location needs to be at a very specific place in order to have a crossing over between these these two alleles but if they are further away it's much much more likely and then a crossing over can actually cause these two alleles ending up on the same chromosome that is actually inherited by by the cell so yes it's true only if there's a crossing over second mostly answered answer is no because they're on different homologs and only one homolog will inherit it well that is that is true but you know if they're crossing over they they can still be inherited in the same chromosome so basically genetics one on one just the refresher so we're all on the same page so these variants they are important I guess because otherwise you guys wouldn't be here and therefore it's also important to to detect them of course an accelerated sequencing hasn't been around since a very long time only well since the 70s we were able to to sequence DNA and and of course not not in a very high throughput so how did people do that before typically through phenotypic analysis so they saw that phenotypes were inherited so certain characteristics were inherited in a certain way a very important example is of course experiments that Gregor Mendel did where he came to the conclusion okay I see variation in these plants and it inherited somehow so there must be variation in the inherited material between those plants so that's already the first evidence for genetic variants well a lot of people worked in genetics since then and and basically what they worked on was phenotypic analysis for example what they also saw is that some phenotypes inherited more often together than others so you can say something about linkage so whether two variants were on the same chromosome for example but you know genetic analysis and genetic research really boosted after people could do actual molecular analysis so actually visualize variants in the DNA and they started with for example ways when PCR was invented to for example cut the DNA only at places where there was a certain allele and not cut when there that certainly wasn't there so you can could actually visualize variants but of course nowadays people do not do a lot of variants analysis in that way anymore if you're interested in variants what you typically do is next generation sequencing so you actually sequence the the DNA and and do biosmetics analysis in order to to detect these features and that's what this this course is about of course the big advantage of of next generation sequencing to detect variants is that it's just very very high throughput you do not have to run for example a single gel for one one mutation of in principle with a single experiment which is not too costly anymore you can actually find and detect all the variants in in an entire you know so what kind of variants do we have well a very important one a very very frequently used one is the single nucleotide variant it's also the easiest one to work with because it only is a single chain in the I mean or in the in the nucleotide sequence uh in the in the DNA so let's say our our reference so our reference DNA has as an aim and the alternative DNA has as a theme over here I'm sure all of you have seen a single nucleotide variant before then we have the insertion deletion or or the indel where a part of the DNA sequence exists in one chromosome and do not does not exist in in the other chromosome so these are all relatively small and what they have in common is that they are relatively easy straightforward to detect with next generation sequencing with and with short short reads of course they can also be bigger um a bit of a disadvantage is that these these variants within a population especially the SNPs or Admin-D single nucleotide variants are biallelic meaning that there are only two alleles two versions uh of of an Admin-D in a population however a genetic variation is often multiallelic meaning that for example um you uh if there it's uh you have multiple gene variants and um different gene variants can have a different effect on on the phenotype and that's very difficult captured with uh with Admin-D so with each single nucleotide variant however what we just have learned is that what we can also look at how alleles are are faced on on a chromosome so let's say in a population uh let's say if we focus on on two uh variant and we look at a current of combinations at the same pace of those variants in in a population and in total we could find uh in theory four different versions of that so let's say we have an AT variant so at the first and ATC variant so in theory we can have four different variants there and then all of a sudden this variant becomes multiallelic instead of biolidic and that can really increase the power of polygenetic analysis and let's say that oh if a let's say those those variants are in a single gene for example and only if they're both mutated mutated so let's say the reference is PC but if they're both mutated so if we are both looking at the minor uh minor allele with AT then that might cause major symptoms in in a disease and maybe if you only have the the the AT variant mutated you might have have minor symptoms and that can have all kinds of implications on for example what kind of treatments you will use if you detect these variants and so on so uh focusing on these haplotypes can be can be very very relevant and that's also what we will look at uh during this course is the content of haplotypes clarity because it's quite quite important if so that's great doesn't have a question and that'll go back to debug again okay so um if we have uh three biolidic nips or add-and-deeds that are close to each other approximate how many haplotypes can we possibly have in our population so in order to understand what I mean I go back to the presentation so basically we have this right so we have three biolidic uh nips and um they occur close together how many different uh haplotypes can we have in in a population that will occur in a human population or any population but that's the organism is okay much we have answered so stop there we go it will be two to three in the so I would agree with most here and why is that back to live so uh let's say easiest would be the haplotype so the allele will occur on single chromosome which would be AGP but another possibility will be AGC and then we can have a CT AGC so if you know a little bit about um calculation of cancer you will know it will be two times two times two so eight different possibilities where we in which we have these different haplotypes isn't so clear awesome great perfect let's continue so in addition to short variants that are relatively easy to measure especially with next generation sequencing there is also for large variants and we we know more and more about those when after we are able to sequence more genome with long reads and get an idea of the pan genomic variation typical typically those variants have changes in more than one KB so more than well a thousand base pairs among those is for example copy number variation so a certain sequence occurs more more often or less often in in the genome these can be genes or or repeats you can have translocation meaning that a part of the genome starts up occurring somewhere else inversion is an entire part of genome is actually inverted on a chromosome and of course very big deletions and insertions in practice in principle they are not very difficult different than compared to the indels we can actually measure with next generation sequencing the only difference is that they are just large even bigger it can be chromosomal aberration meaning that some chromosomes occur more often than others very well known example is trisomy and chromosome 21 related to the Down syndrome and actually this mutation over here very typically is actually also chromosomal aberration because this plant that we have seen in the earlier examples is the hexa ploid and that hexa ploid plant or especially this cultivar at least has only one copy of the gene that can convert the yellow into the colorless variant and if that chromosome isn't there anymore the plant is still completely fine with that but it just turns out yellow so to conclude this presentation in this course we focus on inherited small variants we won't be talking about large variant structural variation a lot and about detecting them with next generation sequencing and then we focus mainly on luminal sequencing but we will of course also talk about laundry sequencing with for example pack bio and also for nanopore technology so basically what it means is that we you have we are in a situation where we have a sequence one or more individuals with a lumina we align those weeks to a reference genome so that is depicted over here very fine to find the most probable location of a at the reference genome and then we look at differences between our reads and the reference genome and if it occurs very often in our reads usually we sequence of course our genome multiple times so we have a coverage and coverage over here we see coverage that they have about 20 weeks if that mutation occurs multiple times we say we okay we are confident that this individual actually has that mutation and then we call that as a variant and we use then that for downstream analysis so for example over here it's very likely we have a variant because all of the reads are different from the reference they all have a C instead of a C so probably a homozygous variant because this is a diploid individual and right next to it is it very likely we have a heterozygous variant because we have some reads have a T which is different from the reference and some reads have a C that so they are not depicted in a different color so they're the same as the reference some have a C so have a T C snip over here and likely to be heterozygous we can also have insertions and deletions that depict over here so then over here there's a part of the reference that is skipped and a part of the reads that have actually something more compared to the reference we will try to detect those or reveal detect those variants with the DHK pipeline so that of course starts with aligning reads to a reference genome also called mapping then we do some additional computations on those on those alignments to make them ready for variant calling after that the actual variant calling takes place but then you're we're going to look for places that are different from the reference and try to calculate likelihoods on how likely it is an actual variant it's there and at the same time we also calculate the haplotypes and we combine all of the individual variants into a single VTF you've probably heard of the VTF the variant call format before and then we filter them based on how likely a variant is is true yes or no and after filtering we also continue with a variant annotation in order to see we have a variant in there that might have an effect on for example the amino acid sequence in the proteins and that can which might then might have an effect on on the team so you will see this image more often especially at the end of slides so I can show you where we actually are in this in this pipeline so today we will align reads so we will learn about how alignment works how alignments are stored what is a bump up for example and we will try to find duplicates in our reads because usually you want to get rid of the and what it is of course you will learn later on all right this was the presentation