 Start recording. Ah, they switched the male homozygous with female wild type. Okay. Yeah, we will look at that. So In the box plot yard, okay Interesting, yeah, I did the assignments this morning since I have to redo them every every Year because the database is update and I also found that like there was something really hard to find Which used to be very easy to find on the website, so It's it's difficult right you make the assignments and I check them quickly before but Sometimes it's just that database has changed so But we'll definitely look at that All right, let's start then. I hope everyone's here So today bioinformatics will talk about DNA so everything you wanted to know about DNA and perhaps a little bit more So let's make it a interesting lecture if you have any questions just throw them in chat and Let's start All right, so today. This is the idea of what we're going to do So we'll do the assignments from last week, of course But for the overview of today, we will do a little bit of history of sequencing and history of DNA I will go into great detail about sequencing So we will talk a lot about DNA sequencing so about like things like alignment But we will also talk about all of the things that are required When you do DNA sequencing and and which techniques there are or technologies there are We'll talk about genes since that's very closely related to DNA and how stuff is coded on the DNA So we'll talk about the structure of genes and how this is different from like bacteria and multicellular organisms so the difference between transcription in prokaryotes and and eukaryotes I will talk a little bit about transposals. I really love transposals Barbara McClintock the inventor of transposals is one of my favorite scientists and I just like talking about transposals Unfortunately, they're not that interesting when you talk about animals, but when you talk about plants and transposals are Well, they matter a lot. So in plants, they're one of the main ways of generating variation in the DNA So we'll talk about transposals. We will talk about regulatory elements briefly There's a whole bunch of them So I won't bore you with going through all 50 different types of Regulatory elements, but I want to highlight some of them which are important and We'll also talk about some other types of DNA. So Mostly the mitochondria and the chloroplast since it's well plans and animals As you can see, I've cleaned out part of the board because I'm going to need the board for the assignments today And we'll talk about biomarkers although I might have dropped that because I was working on the presentation and I think we had a little bit too many slides So I decided to drop a couple of slides and move some around Karol in curcou Karol Karol Cuckoo Welcome to the stream Karol in Cuckoo However, you want to pronounce your name? Welcome. Thanks for watching So these are the topics for today But first like promised, we will have the answers to the to the previous assignments So we'll just go through them quickly. I hope So I think I should start with the phenotypic databases and then do the Do the drawing of the of the different crosses. So the two point cross and the three point cross So that that will take a little bit of time so Who was it testosaurus testosaurus 3k said that there was an error on the impc database So let's just switch to the impc database and see if that is correct Let me switch to there. So here we see the impc database I'm on the about page, but So the first question was how many protein coding genes does a mouse have according to the impc database? olexander Welcome to the stream. Also hi to you. Thanks for watching So I was Clicking around on the website and I found it this morning And then I wanted to kind of click on it again and show you where you can find it, but um I couldn't I think I did something like search for big 3ca because that's their kind of gene of the month Um, and then you get one gene Um, but and I clicked on it and I I don't know exactly where I found a total number of genes in the impc database but um I was looking at it and the funny thing was is that it actually changed Since last year So last year when we did that they say that there were 22,955 genes and when I checked this morning There were only 22,924 genes. So I don't know exactly where I found that. Um, let me Click on here and but I think I found it somewhere In in in their total number of genes I don't know. I I clicked around and and but it's really really hard to find and it it used to be very Very easy to find because you would go to just the home page of impc And then it would say search all of the knocked out data and then there was like this total number of genes below um, but um You can just search for mouse gene. All right, let's search for mouse gene Just search for it. So then I get 774 results. Oh and the assembly. Yeah. Yeah. Yeah, but the assembly also didn't change from last year So I found that really strange that they they all of a sudden have like um 31 genes less compared to uh, To to the other one. Yeah, I was clicking around this morning and and um, it used to be a very simple question because Normally you would go about to about and oh no, you would go to data and then here they have the um, Latest data release right and then you would click on the data release and then it would say the number total number of genes But now they only provide the summary. So they only provide that there are 700 7,022 That they that they clicked out Yeah, 22,924 commando. That's that's right. That's also the exact same number that I got and last year When I looked into the database there were 955 so 31 more uh, than there are currently but Anyway, um, it's clicking around and then then you find the answer Of course, if you if you probably click on this version, then they might actually tell you how many genes there are Then you go to the jack's website But it's it's interesting mutant mice I know I found it this morning. I was clicking around because I was answering all the questions and then uh It's it's a little bit. It's a little bit hidden somewhere But I hope everyone was able to to find it. There might actually be release notes Where one of the older versions, um, but I I can't remember how I got there So I think that that's It used to be just in the top and they used to have like an overview as well So that you could like fold like in a in a in a tree structure But they don't provide that anymore and that's one of these things with databases Um, just search without anything. Can you do that? Yeah, that's it Not really. Yeah, that's it. Okay. Thank you commando like Kudos to you you you found it if you just press search indeed then you get the total number of genes So 22,924 It used to be 31 more so of course Question b was um, how many of the protein coding genes have been knocked out and have completed and have been completely phenotyped So that's um, that's just the number of here. So it's 7,022 knockout data So those are all the knockouts that they have and where there is phenotype data available You might actually be able to search for phenotype data available as well But that doesn't really bring up anything But it's just if you go to genes then it shows you here that there are 7,022 That that are there. All right. So it's question b So the next one is how many genes are associated with respiratory diseases and have been completely phenotyped And this one used to be an interesting Question because there was only like a couple of them. Um, but of course databases update every time So the idea here was just to search for respiratory. So you go to phenotypes then you search for respiratory And then when you search for respiratory, you can see that there are 26 different respiratory conditions that they have in the database And then it used to be that it would look like, okay So there were four here and there were like five here. So if you would add them all up, that would just be a handful But now they're actually a lot more. So you can see that there's 61 plus 123 plus 60 plus 51 And of course, there's also some overlap between these but So how many genes are associated with respiratory diseases? Well, there are 26 different respiratory diseases And of course, you can click on the individual phenotypes and then see which genes and how many genes So it already tells you that there are 35 genes Which when knocked out would cause abnormal breathing patterns and So it changed a little bit But if you just add them up, then you can see that most of them actually have no phenotype They are no genes associated with them but you can kind of You can kind of add them up by hand So I didn't do that because it was just way too many But the answer to this question that I I answered it like this and said there's 26 phenotypes or diseases And a lot of genes in the database. So of course, this will change day to day. So they might update the database in a week It's just that the question was there to get you guys to click a little bit on the database and to get familiar with how you can interact with it All right, and then the d question So question d was What effect does all 13 gene knockout have on the fat mass of a mouse? So I actually got an email From someone asking how can I look at that? So I send an email back But I will just do it live for you guys. So we go to genes Then we search for all 13. So you can see that all 13 they produced es cells So embryonic stem cells then they produce mice and then they did phenotyping of this mice And then you can just click on the gene then you get to the page which looks Like here and then the question was is how does it affect body fat amount So we just go to abnormal lean. No, we go to the abnormal adipose tissue amount So when we click on the adipose tissue amount, it gives you an overview and here You see the different types. So you have male wild type male homozygous and female wild type and indeed you're right So it might be worth actually sending them an email saying that their box plot is actually annotated wrongly Because if you look here in the summary statistics of the data set, you see that there are Male controls and there are female controls and they have female homozygous knockouts Go to the cursor on the box plot in the middle and on the right So observation You mean these ones Go with the cursor on the box plot in the middle and on the right Yeah, so it's a female wild type here Which is which is also wrong because the female wild type is actually this one and this is female home So what i'm thinking that they're doing is they're just having male wild type male homozygous female wild type and So yeah, there's there's something wrong because you can actually see here as well that here the overview is wrong as well Right because here they have male controls, which is which is okay Then they have male controls again And this should be male homozygous And here you see female control and female homozygous. So there is something strange in the in the page But if you if you scroll a little bit down Then you can see actually here that It's it's again the table, right? But now you have the summary statistics So if you look into the summary statistics because we want to know right if the fat percentage goes up or goes down So we can see that in female controls the average fat is 0.21 If you have a homozygous knockout in females, then they actually have 0.37 if I round up So you can see that female homozygous knockout mice have more fat than female control mice And then in the male control mice, you see that males have more fat than than females Okay, so the question from testesaurus is another question is this that they accept only data when they tested seven mutant females and males And why did they accept the data from this mutant when only seven female mutants were tested? It might be that the Knockout of this gene is lethal in humans are in in males. So if the knockout is lethal in males Then that actually causes an issue right because then no matter how hard you try No males will be born So then they will accept data because they they try seven females and they try seven males But if all of the males die during like gestation or when they are born or slightly after birth Then of course, there's no way to get phenotype data for them. So then hey, it will just be annotated as this Mutation is homozygous lethal in males and in females. It's not lethal It it happens some some genes you can knock out in males and not in females and the other way around So i'm i'm betting that they did make Seven female mutants and seven male mutants But i'm i'm betting that the male mutants died prematurely Or before they could test the phenotype So the way that I answered the question was all 13 it's only tested in females. So seven seven females And it shows it an increase in total body fat When all 13 is knocked out. So it means that it goes from 0.37 Or it goes from 0.21 to 0.37 So an interesting observation An interesting gene it might be that it that it's lethal in males But you also don't really get the The ideas that it might be lethal in males, but at least it might I will put it on my to-do list to at least send them an send them an email about the summary statistics page because it is It is it is strange that that The the categories are wrong Especially when you hover over them, right? This is male wild type. That's correct. Then here you see female wild type so that the The the thing that pops up is is correct It's just that the legend here is wrong For for some reason. Um, so summary statistics I am PC send email And I think that's that's worth it. But uh, All right, so those were the first like quick look in impc So the next question was Um Imagine that you have measured heart weight in a knockout mouse model How many animals should you measure at the bare minimum before you can submit the data and we already had the answer to that So you have to genotype or you have to phenotype seven males and seven females Um, and then the b question was is it easy to contribute your own information back into the database? And it is not easy like finding the information on how to contribute is Very well hidden in the website So you have to click on a lot of like open and how do I join and then you have to be a collaborator So It is it is getting harder and harder to join the impc project and to submit your your own data So the the answer that I had no, it's not easy information is very hard to find And currently you have to become a partner first. So you have to first like submit all kinds of paperwork Um, yeah, and you need to follow the protocol that they provide Yeah, they have a very extensive phenotyping protocol protocol. So they have a whole pipeline of of how you should kind of Test your mice at which day you can do which experiment And it's it's a very extensive thing So um, they have a lot of partners and all of these partners have set up this phenotyping pipeline But becoming a part of the impc is is really really hard Which in a way is a good thing It is also a bad thing, right? There's a lot of people making knockouts So it would be better if everyone could submit it back on the other hand the data is really nice and clean Because well, they they only accept data from very well vetted partners So if you want to know who the partners are you can actually scroll down and then see the different centers that are Contributing when you when you look at the phenotype overview So there's a couple of questions for impc and Remember it's it's one of the databases that is out there. So there's the exact same database more or less for yeast So if you're interested in yeast and the guys in yeast actually did all of the knockouts already So they knocked out I think 18 000 genes in in yeast And they are now doing like double mutants So they are taking the individual knockouts that they made and then they are crossing them So to get individuals who have two genes knocked out To see what the interactions between genes do so Very very interesting All right, so then the next questions were about The omem database so the omem database is a really really good database When you are interested in Mendelian diseases So let's switch to the omem database. So in this case we the question was what causes green color blindness So we can just search for green color blindness Like this And then the first hit that comes up is color blindness and it's called the doitron series So if you click on it, then it actually says green color blindness So the green color blindness is caused by a certain gene or a certain locus and this gene is called opn1 mw And it's located on chromosome x so you can click on it and see exactly where it is or located And it has a certain inheritance pattern, of course So it's x-linked inheritance And it has a certain omem number But then you can then have a whole bunch of overview. So had the first time that it was described is in 1968 So and that's what I like they give you a lot of background information on your phenotype and It's it's just a very good database and of course Hey, if you just search for color blindness, you get a lot more Has so you get like the the other color blindness, which is being insensitive to red So red color blindness, which is called protopanopia proton It's a it's proton So it's a good database to find things like color blindness Earwax and whatever Mendelian diseases there are so if you ever in the future want to study a disease Or need to study a disease Then it's definitely worth throwing the disease in the only omem database. All right. So then the question B was so 3b was has this gene been phenotyped in the impc database So, um, I think it was not. Let me look at the answers that I did No, um, it hasn't been genotyped yet. So we can actually search for the gene So if we search for the gene then we see that it has not been studied yet But there have been embryonic stem cells produced So they do have embryonic stem cells which have the mutation So the next step for them would be Is to implant the these or to kind of generate mice using this embryonic stem cells And then of course when they have seven males and seven females born Then they will go through the phenotyping pipeline And again here you can actually see that there's a green Cone opsin or green long wavelength sensitive cone. So so they already know what this gene is annotated as But it's just that they haven't been able to knock it out And of course there might be other phenotypes associated when you knock out this gene But they are working on it. Um, but currently, um, there's only embryonic stem cells which have been produced with the knockout So they still need to generate mice and then these mice still need to go through the phenotyping pipeline All right, so then back to the in more interesting question or more interesting question to the questions about Mendelian maps and Mendelian phenotypes So i'm going to close fire fox And then um, i'm just going to draw them right because that was the question So the first question is draw the Mendelian inherits diagram for two traits in an aabb versus aabb cross So the question here is is how do we write it down? So of course when we do a Mendelian inheritance diagram, we we just want to do something like this so Let's start with the easy one and then just say well, we have a parent which is aabb So of course this parent can generate only one type of One type of zygote. So of course every every zygote or every sperm cell or egg cell produced will have the a b genotype So, um, I hope people can see that I can actually Zoom in a little bit If I move back a little bit then it would be would be possible, right? So if I have an individual which is aabb, uh, then of course the only kind of sperm cell or or egg cell which can be produced is a b Because you get one a and a b or but you always get the smaller variant All right. So the next is an individual which is a a and b b So of course here there are multiple possibilities to be produced. Um, so we can produce Big a big b we can have big a small b We can have small a big b and we can have small a small b All right So then we can just write down the inheritance diagram, right? So in this case, we would have a big a small a um big b small b for the next one, we would have a Big a small a two small b's and we would have a a so two small a's One small b and one big b and then in the last situation We would have the same as what we had back. So we would get an a a and a b b So now had the first thing that we then want to do is to identify which are the parental genotypes So had these individuals or individuals that have this genotype are no different from this parent And then of course that we want to identify the other so this other parental strain Has these genotypes so if we want to calculate the distance between these two genes Then we count the number of offspring which are in these two groups, right? So imagine that we produce 500 offspring and a hundred of them Show a recombination so they have a different genotype And so that means that now a hundred out of 500 animals are recombinant meaning that the distance is 1 in 5 Which is 20 percent, which is 20 map units, which is 20 centi morgan All right, is that clear? I I think so, right? This is just a very um Interesting or it's not a it's a very basic cross So I hope that everyone is able to make this cross Because hey you can see that there are four different Possible offspring and of course head here is the a a b b So this is the homozygous parent and of course the homozygous parent can only provide one One different gamete while the heterozygous parent can produce four different gametes So now the question becomes why do we use a homozygous parent? And why don't we take so question one b was why don't we take two heterozygous animals? And of course the the answer to that is then we would have four Possible gametes from this parent as well. So we would end up with 16 different situations So we would just have to generate much much more offspring to get The the distance Right because if we have 16 categories and then many of these categories will be parental like some of the categories will not be parental like But had the the idea is is to make the easiest crossing scheme Which will allow us to do the calculation on the distance between these two genes So I hope that's clear like otherwise if we would have had four possibilities here as well We would have 16 answers and of course heads if you now generate in Individuals head and of course 500 individuals scattered across four categories is of course better than 500 individuals scattered across 16 Categories because you just have more individuals for each of the categories All right, so that was the first one. So this is a very easy two point cross. Um, let me Stuck with my microphone All right, so let me clear out the board And then we will do the other one So the other one is more interesting and a lot more difficult to draw because I'm probably going to mess it up Clean And the way that we do it is now we have the same Cross as that we had before but we now have three genes or three alleles that we're trying to track All right, so then I'm thinking that the best way to do it is to switch around because like the board is a little bit longer than it is wide All right, so we have one parent, which is a a b b and c c So of course this parent also just generates one single gamete and the single gamete that it generates is a small a A small b and a small c All right, so now we have the other parent. So the other parent is big a small a big b small b Big c small c All right, so now it becomes a little bit more difficult and I like to start from the small ones, right? So what you can inherit is a Small a small b Small c right, so this is the first possibility and of course we just get the original parent back a a b b cc all small All right, so then the next one that we can do is we can inherit a big a a small b and a small c So the offspring would look like this The next one would be we get a big a A big b and a small c the offspring would look like this We can inherit of course A big a a small b and a big c So then the offspring would look like this All right, and now it's getting tricky because now I'm having to deal with the duplicates So and I'm doing this by head and people always tell you to never do a live demo So never do this live so then the next one that that is a possibility is of course a small a A big b and a small c so then the offspring would look a a big b small b small c small c All right, then the next one would be a big b big c So then we would have the offspring being a a b b And big c small c All right, so we're almost there. So we have one two three four five six All right So then the last one that I'm thinking at is actually a small a a small b and a big c The offspring would look like this And then the last one is of course everything big big a big b and big c and then it would be a a bb and cc All right, then we defined it to parental genotype So these animals have not recombined right because they are similar And then of course we have the same thing here for this one and this one All right question in the chat super hamburger Do you think biotechnology should be used to greatly modify or enhance humans poo? That's a difficult question I would say no Biotechnologies have a lot of power and they can do a lot But biotechnologies in general should be applied in an ethical way or At least like I don't know exactly what you What you yeah, if if you're thinking about cyberpunk or these kinds of things but I think that biotechnology will make our life in future a lot easier And it will help us deal with a lot of things like climate change Because the more food you can produce from a meter And the the less water that you use Had these things will of course greatly enhance our lives and and make it so that we can kind of reduce our footprint on on on the world but I am very I'm very undecided if we should apply that to humans and of course we already do that Right like we already do in some countries do prenatal screening So have for example if if you nowadays get pregnant and you go to the doctor Then the doctor can tell you if your child will have like down syndrome or or other like inherited defects so It will it will help us in a great way But it will also be very It had like every technology you can use it for good and you can use it for bad so but I think that like every step of the way you should kind of Stand still realize what you've done and see what the effects of it will be All right. Anyway going back to the assignment assignment number two Draw the inheritance diagram. So this is the nice inheritance diagram. I think it's a bit blurry I don't know perhaps if I go out of frame then the camera will focus on the inheritance diagram And then question number two how many different gametes can be produced from the ab b b and cc parent Of course, those are one two three four five six seven eight Which is of course two to the power of three And of course when you would have four gametes, you would have two to the power of four So it kind of follows a really nice Kind of computer strategy, right? So computers also work with like Two to the power of something and here it's the same thing if two to the power of three So if you would have four different gametes, you would go to two to the power of four All right, so those were the assignments for for today Or actually from last week since we're discussing them now I hope everyone was able to draw the three point inheritance diagram A lot of times people forget a couple and the way that I actually like to write them down is just go one by one Right, so you just write down the first one all of them small then make the first one big First and second one big first and third one big. So kind of do like binary counting On the on the answers All right, so let's switch back to the lecture layout and those were the The answers to the previous assignment All right, so I need to be able to click on that as well. So are there any more questions about the If not then Let's continue All right, so lecture for today. I'm just looking at the clock. I've been recording. Yeah, that's okay All right, so first again like I like to start my lecture by having you guys think a little bit and put something in the chat About what you think So the question here is like we're like you're all biology students, right? So you should be able to tell me what the four biochemical parts of life are So if I am building a living organism, um, there are four biochemical parts that I definitely need And which are more or less essential for life as we know it So just throw your answers in the chat and then I will write them down and then we will see what the answer is so Who wants to start the four biochemical parts of life? oxidation Oxid that I'm actually looking at like, um Nucleic acids very good testosterone. Yeah nucleic acids is definitely one What do you mean by oxidation by the way? Is that like the oxidation? Fats fats and proteins. All right nucleic acid fats and proteins hydrolysis now, I'm not really looking for like, um Um, so if you build a living creature, right? So if you take a cell then the cell is made up of different, um Molecules and these molecules they fall into four biochemical Parts in a way. And so one of the things which is correct is nucleic acids, right? So you need nucleic acids to build a living creature like be a dna or be it rna water Water seems to be a very, um Yeah, you probably need that polysaccharides. Yeah. Yeah, that's the that's the last one so water, um interesting Yeah, it's generally like I'm when you're talking about biomolecules then water is not generally considered a biomolecule, but No, you guys are you you already have all of them so, um, the real four parts of life or more or less the parts of life that you need is, um proteins to as a factor molecules lipids to kind of protect or make a make a distinction between the outside of the cell and the inside of the cell um polysaccharides are very very important because they determine how proteins work So that they are more or less connected to proteins, but also, um, dna is is made up of polysaccharides And of course you need nucleic acid. So dna and rna, um Although although olexander, I do like your idea of adding water to the list because Without water, there's no real life And of course like for most of these things you also need things like iron or copper or real like metals But generally the metals are not really considered because metals are not biochemistry. So they they kind of fall underneath standard chemistry Um, but the the classical definition of the four biochemical parts of life, um is nucleic acids lipids proteins and polysaccharides. So, um to have, um sugars All right. So thank you for the participation and the different answers Um oxidation was an interesting one because it's a it's a biological process But of course like oxidation is like fire oxidizes as well in a way. So it's it's a little bit iffy But head the standard classical definition is four biochemical parts of life proteins lipids Sugars and nucleic acids. So today, of course, we will be talking about dna. So the whole lecture will be about dna Um, and next week we'll have a whole lecture about rna And then the week after we will have a whole lecture about proteins and had uh, not so much about dna rna and proteins because I haven't assumed that you kind of know what a dna molecule is But more about how bioinformatics relates to this and especially today we will be talking about dna So dna sequencing will be kind of one of the fundamental processes that we will describe All right. So and like I said, we will be talking about dna and dna sequencing So dna sequencing the definition is is that it is the process of determining the order of the nucleotides So the four different nucleotides that we have In dna adenine, guanine, cytosine and timene Within a dna molecule. And so that's what we mean when we talk about dna sequencing is just Determining the order. So which comes first, which is second, which is third and Which is next? But before we can start about dna talking about dna sequencing What do we use sequencing for? Well sequencing nowadays is used like Almost everywhere And it is especially It's especially it's a lot used in things like diagnostics at the moment. So nowadays when You go to the hospital and there's something well, not so much wrong with you But then you you have a disease Then people draw blood and based on your own dna profile They might do some they might sequence your whole genome or they might sequence part of your genomes to help them using Diagnostics. Yeah, so if you go to the hospital and you have breast cancer Then they will take a little biop of you So they will take a little bit of blood or a little bit of tissue They will sequence the tissue and then they will look to see if you have the Brca2 gene because they know that mutations in brc2 Are one of the main causes. I think like in 40 percent of breast tumors Brca2 is mutated So if they know that this gene is mutated, then they can adjust Your treatment based on that Biotechnology and also their sequencing is used a lot. So if you're thinking about biotechnology like making new New plants or animals or these kinds of things Then of course sequencing is one of the fundamental techniques that people use Forensic biology forensic biology, then I'm thinking about like the police when There's a dead body on the ground somewhere then the police comes in And they they swab all of the different spots or if they find a hammer And next to the dead body then they will try to extract DNA from that And then they will make a dna profile and then using these dna profile They will try and find the the perpetrator for the crime So forensic biology sequencing is used more and more It used to be that it used used to be Standard cutting enzymes so they would cut the DNA into using several cutting enzymes and then make a profile But nowadays sequencing also in forensic biology is getting more and more The norm because using cutting enzymes is relatively quick and cheap But sequencing is getting cheaper and cheaper every every month So it's used more and more also in forensic biology and of course virology nowadays I checked Last week And if you think about the SARS corona virus the novel corona virus, which is out there Then currently we have 38,000 Genomes of this virus sequence. So one of the things that of course it was the the main Thing that happened is in in just after the outbreak Happened in january the genome sequence of this of this virus was published But if you look from from january when the first sequence was published up until now so like Like 11 months later 38,000 versions or not so much different versions because they sequenced 38,000 Extracts from from patients all across the world to kind of keep track of How the virus is mutating and how these different Mutations are spreading throughout the population So that you can do more or less a track and trace of how this virus is moving across the world and how this virus is mutating so One of my favorite websites where they they showed is is the next strain website So the next strain is is showing these mutations and I will I will just show you the website right since we're sitting here and I I do love their website So so here you see all of the genetic epidemiology of the novel corona virus So what they do is they they build this tree of inheritance So here you see the first kind of reference sequence and then here all these little dots are individual Virus extracts from patients that have been sequenced And so it is kind of a mixture between forensic biology in a way because they're trying to track where the virus comes from But the nice thing is is also that they that they have like they they visualize how these things Work No, I don't want so if you would play the the head they would start at the beginning of the outbreak And then you see the virus moving around and head they It doesn't really show quite well here, but if you would zoom in They would also have little Let me see they have normally little Little lines between the different sequence genomes Perhaps I shouldn't show it by region then, but I should show it by So don't color it by region, but Color it by for example an amino acid Um, so let me see. It's a really nice website. So when we look at the spike protein and this is one of the main Nucleotide so color it by that one And then we now see that if you look at the strain, right? Here at the top then you see that the original strain and then you see this european mutation that everyone's talking about So everyone's talking about the new Mutation, this is not the mink mutation, which has recently been discovered But this is the mutation that occurred in Bavaria Slash italy More or less beginning of february And then when you look at that and you would reset the map And we would zoom out a little bit and then you can see here across Oh, that's way too much. So here across the world map You can you can see now what the distribution is of the blue one. So the original virus And you can see the The new variant of the virus. So when you when you play, right? And you see that originally we only had the blue virus But then all of a sudden you see the new variant of the virus pop up in finland and germany and also In in in italy and you see that that this virus is hit based on this single mutation You can more or less track and trace Where or how the virus spread and where it comes from. So Just something I wanted to show you But it is a main main thing of of bioinformatics is making these databases And of course sequencing is key to virology at the moment if you want to Trace an outbreak And then of course what you're doing is you're taking material from patients. You're sequencing it You're looking at individual mutations and you look to see How you can trace those back and nowadays in virology you can for example also determine if If a certain Ebola outbreak how that how that goes and how the how people Kind of spread the virus around so very interesting And it's becoming more and more of the norm is to sequence every virus that comes in so everyone who comes into the hospital they take some of your nasal swab and they can then sequence the virus And and look which mutations are there and how these mutations are are spreading All right, so when we talk about DNA and DNA sequencing we we have to talk a little bit about terminology and The first thing that I want to talk about is that everything in in sequencing that we do we talk about the reference sequence. So Also when you look at SARS-CoV-2 like the chinese published the the sequence in beginning of january And this sequence is the reference sequence So when people sequence a new individual or a new virus They are they are sequencing the individual But then they always report Sequences relative to the reference sequence because otherwise you have to always hey in humans We have two billion base pairs. And so you would have to write down two billion base pairs But you could also write down just the differences because generally the differences are not that much So by writing down the differences you you save a lot of of of of space by by storing it So and these differences we call variants. So DNA variants Come into two different formats. One of them is an s and p So a single nucleotide polymorphism, which you see like depicted here, right? So imagine that this is the reference strain and then the reference strain at a certain position might have an a And then you sequence two more individuals and some individuals have a g at this position and other individuals have a t Right. So here we have a single nucleotide polymorphism. So a single base pair in the genome, um, which is different from This individual towards the reference sequence. So single nucleotide polymorphisms the same as inserts and deletions So indel so small parts of the genome, which have been deleted or where you have an insert of a couple of base pairs They are always Kind of annotated relative to the reference. So the reference said there's a single human genome reference sequence So and that human that has been sequenced is greg venter. So if you get your more or less your your Your genome sequence, uh, then you get in the end the positions of your genome Which are different from the ones from greg venter. So the the the guy, um, who in Who was involved in the in the human genome project? So everything is relative So when we talk about dna sequencing, we usually talk about reads and dna sequencing reads Are base pairs and the quality of each of these base pairs So hey the read could for example be all of the all of the sequence here So g c a g c g t t a g a and that is the that is the read So this is something that a sequencer might produce but a sequencer also produces a quality score so it will say that well i'm was At the chances of this g being wrong is one in a thousand the chances of this c being wrong is one in 10,000 At the chances of the a being wrong is one in a hundred. So when we talk about a dna sequencing read We talk about a a string of letters and associated with this string of letters Are quality scores. So how confident are we that these letters are correct? And then we talk about dna alignment a lot. So alignment is Matching these reads towards the reference sequence. So what we do is we get a read from a sequencer And then we just scan across the reference genome to see where this read fits best And of course had we then taken to account the quality of each base pair And if there's a one in five chance that a certain Base pair is wrong, then we of course penalize less When there's a mismatch to the reference compared to the whole To compare to a base pair where there's a one in 10,000 Chance of being wrong. And so reads Base pairs and the quality of these base pairs And the alignment is the process of matching reads to the reference genome. All right So I think we should take a short break. We've been going at it for like 55 minutes And I want to do a quick cigarette So we will be back in at Three 305 so I will be back at 305 and and we will continue then And then we will talk about the history So I will see you in around 10 minutes and I will then stop the record