 So, welcome everyone. Also, if you're looking or watching it via Moodle, we will be talking about sequence analysis today. So, the idea behind sequence analysis is of course to analyze a sequence, find homology and these kinds of things. So, for today, I first wanted to show everyone again the evaluation. So, if you're a student, then you can click on the link on the old presentation in Moodle. It's just a repeat of a slide I showed last week as well. But, yeah, do fill in the evaluation because that helps. And it helps people. It helps me. So, if you have any comments or questions or remarks about what we should do differently, then please fill in the evaluation and I will get an overview. And it's totally anonymous unless you fill in your name somewhere. But you can do it completely anonymous. So, I just get the feedback. So, I'm not going to spend too much time last time we wait it, but I think we will just continue. So, the exam also a slide from the previous time. The exam is registered and it should be in Agnes. I didn't get any mails from anyone that they could not register. So, I'm guessing it's there. The big issue here is that I cannot look into Agnes. I just have a employee account from the HAU. And if you have an employee account, you can't even log into Agnes, which is a little bit of a shame. So, yeah, I can't check it if it's really there. And of course, you have to register. So, register at least two weeks beforehand. But since you never know when they will close it, make sure you register as soon as possible. All right. So, previous assignments. Let's start doing those. So, the previous assignments from last time. We can do a prediction here, I think. So, let me check it out if I can do it. So, the name of the prediction is did Danny do his homework? So, the outcome is yes or the outcome is no. And I don't know how this works. So, I'm just gonna start a prediction. That's an interesting fishy moticon. Because the question for you guys is did I do my homework or not? Or am I just going to do it on the fly for you guys? So, let's start a prediction about that and see how it works. Okay, yeah, I got it. I got it. All right. So, there's a prediction started. So, you can you can you can click on yes or you can click on no. And I think you can kind of put channel points on the answer. And then the people who have it correct, they can actually win channel points. So, I'm curious to see how this works. I never used it before. So, I'm just curious. So, we'll just wait for it a little bit. Of course, this doesn't really help the people that are watching via Moodle. They just want to see the answers to the assignments. But we'll see. I can always cut it out. So, that's interesting. How does this work? How much time do we have left? All right. So, I can see that one person actually voted. So, I don't think that it'll actually work if only one person votes. Because, of course, then the only thing you can do is kind of lose your points because no one else is there to take the points. This is so interesting. Like, I love all of these Twitch features. And every time that I'm streaming, I'm discovering something new. So, let's see. Predict with channel points. I actually can't even see how it looks. Like, I'm watching myself, or at least I'm watching the stream via the dashboard. So, I don't count as a viewer myself, but I have some more options, like moderator options. So, I have to see. I think it doesn't work for mobile devices. That's interesting. That's interesting. I can see that you predicted yes, actually, in front of your name as an additional icon. It's good that you think I do my homework, actually. I'm curious about that. I was too late. Hey, Commando, welcome, welcome, welcome to the stream as well. Yeah, so you just missed the prediction. I think that I can choose the prediction outcome and the prediction outcome was actually no, I didn't do my homework. So, I'm sorry for the 10 channel points that you lost. Well, you predicted yes, which was wrong. So, oh man, that's a bummer. Well, at least they are for free, right? So, you get them by just watching on Twitch. All right, interesting. Okay, so let's continue with the real lecture and stop doing all of these fancy, fancy Twitch features. So, again, I need to have, in this case, the notepad plus plus window. And this is not the window that I wanted to show you guys. So, I actually didn't do assignment one altogether. And I did assignment two just like 30 minutes ago. I knew where to find all of the stuff. And actually, I didn't do assignment three either, or I did it, but not really. I just filled in some tips for myself. But let's just go through the assignments and just answer them one by one. And you can tell me the right answer. So, Pubmat, right? So, use Pubmat to find all my publications in Pubmat. Remember that I've only been publishing since 2010. So, let's just show you guys the Firefox window. I have to find that one. And then we just say Pubmat. And it's the first one. And so, of course, if you're just searching Pubmat, and you're just saying, well, I want to search for Danny Adams, right? Then, of course, you get a whole bunch of publications, which I actually did publish. So, the first one here is a publication that I did. This one, this one, this one, system genetics and so on. But of course, if you go down, then there will be some which I didn't do. Like, there's some publications from the 1960s or 70s, I think. And you can see that sometimes stuff is in there double. And like, yeah, for example, the Eratum, which is just when we made a mistake in a paper, where's the Eratum? And when we made a mistake in a paper, you publish an Eratum, which publishes it. But so actually, these all look pretty good. I think they kind of cleaned up the older ones. But yeah, these are more or less all of my publications. You can look at the timeline and indeed, like they filtered out the older publications. So that's pretty good. So why are there so many false positives? Well, there aren't anymore. So question B is there are no false positives, just searching for my name is good enough. And then the idea was, yeah, you get more wrong. Yeah, of course, because there's more people just called Arans. If you search for D Arans, then hey, you get a lot more like you get vitamin D and all that. And this is then published by Arans J. But yeah, so if you if you have a very common name, like Chinese authors generally have, like, there's a lot of people called Wang or Li, then of course, you get like hundreds and hundreds of false positives. And it becomes really hard to figure out who is whom. Yeah, yeah, yeah, like Peter Arans. But yeah, so had the but if you just search for my full name, then it looks pretty good. So it finds 33 publications, which is slightly less than I really have. But if you look at the timeline, you see that indeed it makes sense. Because it only starts at 2010. I do like the new interface actually that they have. It looks a lot more like 2000 and something. And the old the older version was just more or less HTML. So use a more complex query, either by using the query builder or search fields to retrieve only my publication. So of course, you can hey, you can say that I want to have day Arans be the author. And the publication date has to be and the publication dates has to be after 2010. Old version what 1996? Well, no, up until like a year ago, the old version was still online. And it looked very like 1995 in a way. So it was just more or less plain HTML, but it looks much better now. So if you scroll and then it scrolls and you have a show more button, it would just used to be like an HTML link to to the other one. Alright, so the more complex query I can show you guys how to do it because you have the query builder somewhere. And you could also just say something like author, I think. And something like this. And then hey, of course, what's going on here? My keyboard changed to the German one. So let's change the keyboard back. So had Danny Arans and something like this as an author, shoot, it received zero entries. So text availability, publication date, custom range, additional filters. So why is it not called author anymore? Let me see article type species, language, sex, subject, age. Interesting. Interesting. Well, yeah, maybe capitals might be might be something like that. No, I don't know. You have the user guide so you can you can look it up. But it's not that important, right? In the end, like, the search itself was pretty good. It used to be worse. So if you would just search for day items or Danny Arans, it would come up with some other some other Danny Arans who published in the 60s and in the 70s. So but it's good. So they actually changed it a little bit. And I like the new look much more. So this is also very new. Looks very, very slick in a way. All right, so the next one is from Uniprot. So let's go to Uniprot KB. Let me just copy paste this. And of course, you can just click on the first link that comes up. So this is Uniprot. So had the first question was find the names of supported query fields from Uniprot help, right? Because this is a big database contains a lot of protein data. So you can go to the help, right? And then when you go to the help you have here the getting started text search button. And then you can also search here for the query fields, right? So the query fields gives you the ability to say, well, I want to have a certain accession number. I want the had the obsolete entries to be shown or not. Here it's called author with small letters. And hey, you have like citations that you can search for. And so you have all of these different possibilities to kind of filter down your list, so that you don't get too much things. So the second question was how many reviewed protein entries exist presently in Uniprot for chicken? All right, so this is one that I actually did do because I did it did it like very, very recently, like half an hour ago. So when you when you want to search for things like chicken, so of course, you have to make sure that you search in Uniprot. So the query here would be saying that well, I want to have reviewed being yes, because I want to have the only the entries which are reviewed and which are manually validated. And then you want to search for organism 9039031. How do you find the organism? Well, you can just search for chicken, right? So if you just search for chicken, then you get here the organism being gullus gullus chicken, you can click on it and then it will tell you the taxonomic identifier. You can also use the word C-H-I-C-K, so chick, but it's better to just use the taxonomic identifier. So that's why when I when I would search for something like this, I would just say well, give me all of the reviewed entries from this organism. And you can click enter and then it will give you a long list and you can see here that there are 2,296 proteins validated in Uniprot, so they are manually curated and they are from chicken. I made a little mistake before and instead of using organism, let me see what I used before. And I found that interesting because the query fields actually, they put in a new one. And I used that first and it was the host. So the host actually is just a way of filtering all the viruses that are infecting a certain host, which I thought was very interesting that you could just instead of having to filter for the virus type or these kinds of things, they have just an option to search for the host. So just which viruses are able to infect chickens and have protein validated in Uniprot. So you could do that using the host. But in this case, you need to use the organism to retrieve all of the proteins actually produced by certain organisms. So the next question is how many of them have been created since the first of 0109 2011. So again, you can look into the into the list that we found before. So the query fields. And then it shows that the query needs to be that you have to specify and create it and then two star. So we are writing down 2011 0901 to today. And we can just click search. And this will then yeah, so that will just search. And now you see that there's only like 86 entries, which have been added. So most of the entries which are in there are relatively old, because they are from before 2011. So do with it what you want. But at least it allows you to search for like new stuff. And things which have been updated in the database recently. So All right, next question refuted retrieve the reviewed entry of cattle myostatin. What is the alternative name? What is the accession number? All right, so the query that I build was this one. So I'm just saying give me myostatin reviewed. Yes, an organism 9913, which in this case is cow cows are boss towers. So you can just search for boss towers. And then it will tell you again, the organism identifier. So when I search for this, it actually comes up with three different genes. And why it comes up with fully statin and with the immunoglobin here, I don't know exactly. But the thing that we are looking at is the the myostatin, right? So mstn is the official gene symbol. And if you would click on it here, then it would say that well, it's the protein is called growth slash differential differentiation factor eight had the gene is called mstn. It is in boss towers and it will tell you shortly down that the alternative name is myostatin for the protein names. So the question is, what is the alternative name? So the alternative name of myostatin is growth slash differentiation factor eight. And what is the accession number? So the accession number in this case is 018836. So that is the number that brings you directly to this protein. What are the gene names of this protein, which is very similar. So the gene names in this case are mstn. It is also called gdf eight. And some people are calling it mh. So that's one of these things which is still from the old days, when people would just work on a gene for a long time and have their own name for a certain protein or for a for gene. So in the literature, these three names are used interchangeably for myostatin. Of course, in like recent years, people kind of standardized on the name and are not using the synonyms anymore. So it's it's not advised to continue using the synonyms, although the official name of the gene is growth slash differentiation factor eight. The gene name is actually mstn, right? So myostatin. So it is deprecated or it's not advised to use the kind of synonyms anymore. All right, to which ensemble identifier can the bovine myostatin gene be mapped. So we are looking just for the ensemble gene ID or the ensemble transcript ID. So if we scroll down a little bit, we see here that the ensemble gene ID is ENS-BTAG-00011800. And that is of course the gene ID. Of course, the gene ID, since genes can produce multiple proteins, might not be good enough to identify exactly which protein they are looking for. So if you scroll down a little bit, then you see here that there's the protein identifier in the string database. So the string database is the database which has all of these interactions to other proteins. And here you see that this is the ensemble protein identifier. And if you actually go down, then here in the polygenomic database, it also lists the transcript, right? So if the myostatin gene happens to make like six different transcripts, and of course, these are six different proteins, they will all be available under the same name. But of course, you have to make sure that if you are working on a certain transcript that the transcript that you are looking for is the transcript that is there. All right, so then the next one was to display this entry in raw text format. So if we scroll all the way up, then here we have format on the top. And then here we can just go to text format. And then you see that it kind of flattens all of the data and the information on the page. And of course, it has all of these like lines in front, like these two letter codes. And these two letter codes are telling the computer or can be used by a computer to figure out what is on each line. So what might be on a line is, for example, something like ID. So ID, of course, means this is the identifier of the gene. So the identifier of the gene is GDF8 bovine. It is a review gene. And it is 375 amino acids long. And then we go to the next line, which is the AC identifier. So the AC identifier is like a list of external, external identifiers for this gene. So of course, the first one is the one from the current database. But you can see that it's also in the same database under different names. And being under different names probably means that these things are probably different splice variants of the original gene, right? Like I told you, a gene has multiple proteins. So it could be that there's multiple proteins there. Then we have the DT identifier, which is the date at which it was added or updated. So you can see that it was originally posted on my birthday in 1998. That's actually why I took this gene because the sequence was discovered on my birthday. And it's a really, really important gene, because it's this gene which produces double muscling. So if you knock out the myostatin gene, there is no break on muscle growth. So things like Tesla sheep, or Belgian blue cattle, they have this double muscled phenotype. So they have uncontrollable growth of their muscles, making them interesting for breeding purposes. Because of course, muscle is very tasty. So it's a good phenotype. Of course, you can go through all of these, like what does the OS mean? Well, the OS is the description of the original species name. So in this case, postaurus, then you have the DR entries, which are further down. And the DR entries are more or less free entries where people can more or less write down. But they generally point to all kinds of external databases and annotation to the gene. And then you have the FT ones and the FT is more or less how the gene looks and what how the gene is built, right? So you see that there's a signal peptide has sort of first 18 amino acids of myostatin code for the signal peptide, meaning that it is transported to a very specific location in the cell. Then you have a pro peptide, then you have a certain chain. And then you have a site, which is a cleavage site, then you have so it just has so the FT identifiers are nothing more than just a description how are coupling the amino acid sequence to certain functional sites off the protein. Alright, so then we go back next question. So to H was to blast the myostatin gene against uniprot database. So we just have a button here called blast. So I can just click it and I can just say go and go. And then it will start a blast search for all proteins in the database, which are similar to this one, which would probably allow you to infer some kind of evolutionary relationship between different genes. So, of course, hey, when you do the if you go to blast and then click the advanced, you have the option to more or less set all of the different, different parameters. And so why we are we using the Blossom 62 matrix, we will talk about that in the lecture today, have why Blossom matrices are used when you want to compare proteins together, compared to other matrices, for example, for DNA. This will take a little bit of time. So I actually should have saved the previous blast search that I did, so that they could show you the answers. But those were the uniprot. So we'll just keep this running and wait until we get it. All right, so then the next one's the next assignment, assignment three was all based on biomarked. And of course, we first needed to install biomarked. So I hope everyone was able to load or install the biomarked library from online. It sometimes has some issues, especially if you're like using an older version of R, because the script is outdated. Yeah, yeah, yeah, it's it's for the older. Yeah, yeah, yeah, I know, I know. So because you now have to first install this bio bio C manager. And it tells you on the website, I think it still works, though, it should actually. Let me see if it still works. No, no, no, so it actually tells you that you have to use the manager to install. But at least that's fine. So and they recently changed that this is a change which was done when they launched our 4.0. And so in the older versions, like 3.6 and 3.7, you are still able to just have this two liner to install it. But yeah, for our 4.0, you have to do it like this. So yeah, it started from 3.5 or creator. All right, so you can install it like this. Let me show you guys my R window. And then of course, we can have the Firefox window. When the blast search finishes, of course, it's that the results are not that surprising. And of course, this gene, which is in Bostaurus is very much related to another couple of boss species. So all cows like, like the cows, which are from India, and you have some other cow species, which are, well, not really cows, but they look like cows. So they are called boss as well. And of course, these have a very big overlap to each other. All right, so let me get my answers here. So and the first question was 3a was connect to ensemble. So of course, we can just load the library biomark. I have it already installed. So I don't have to do that. And I'm really hoping that biomark will work. Because I have been having some issues last week. And it just gives me failure to connect errors. And sometimes it switches to different mirrors. So yeah, so the ensemble site is not responding. And then it tries like the Asian mirror, the one in the US. And if all of them are down, then we'll just have to keep it like it is. But it's been an issue with biomark the last couple of weeks, actually, I don't know exactly why that is, because it used to be one of these services, which was really responsive and really used a lot. But it seems that in the last couple of weeks, they have been having some issues with their hosting or these kinds of things. But yeah, it doesn't doesn't seem to work at all. It's not able to switch to mirrors, which is just a shame, because it's such a nice tool, especially for R, if you want to annotate genes, it's kind of the easiest way to do it. And if it doesn't work, I can, of course, not show you how it's done. But I will put the script online so that you guys can just run the script and hope that it works for you guys. And it should, but because it's just busy. But there might be some server issues there. All right, so this is not going to work. So I'm just gonna leave that there. All right, let's go back to Firefox quickly because the blast search did finish. Hey, so let me zoom out a little bit. And so you see here that the that it's 100% identical to, of course, boss Indicus, boss Kauran, Kaurus and boss Taurus, which is logical. And then also to boss mutus. So these are, of course, all cow species. If you scroll down a little bit, then you see that it's indeed similar to all of these other. But then you see, for example, here, so here we see that the the myosetogen, which is found in cattle is also found with a 95% identity in things like pigs. We also have the taurotragus derbianus. I have no idea what that is. But yeah, so it's the giant Eland. So it's a giant elk kind of. So you can you can find relationships and see how different species are related to each other based on a single protein when you use blast. So normally, you would have your phylogenetic tree of life, which is built up from like a consensus tree when you look at like hundreds and hundreds of these proteins had to build up and away how every species on earth are related to each other. And of course, we normally take like a 95% cutoff. And so you see that it's also elephantus and some some other species, even the lynx. But this gene is very well preserved. Hey, even in cats, the gene is not that much different than the gene is in in cattle. All right, so those were the assignments. And let me switch back to PowerPoint. All right, so I have to get my PowerPoint window myself. So those were the solutions, the biomark stuff I will put online so you can run it. Probably have to do it later. Early in the morning would probably be the best. Because later in the evening, the Americans start using it. So then it's even more busy. And if you use it early in the morning, then you're only bothered by the Chinese and Indians using biomark. All right, so the overview for today, what are we going to talk about today? Today we're going to talk about genome annotation, right, because we're doing sequence analysis and sequence analysis means that we have a genome sequence or we are going to produce a genome sequence, and then we want to annotate this genome sequence. So where are the genes located? Where are things like microRNAs, tRNAs, and of course, all of this is done using the homology trick. So I will tell you what the homology trick is, and I will show you how that is used. Of course, when you talk about sequence analysis, you also have to talk about sequence alignments and sequence alignments are more or less the core of bioinformatics. So when bioinformatics started, the reason why bioinformatics started was because people were looking at sequences and they were generating sequences in the lab and they needed to do something with these sequences. And of course, the first thing that you do when you have multiple sequences is look to see if there's an overlap. So if sequences are similar in a certain way. And of course, hey, we can do pairwise alignments, so you have two sequences and you want to kind of find the best match between them. We have things like multiple sequence alignments and we have structural alignments in which you don't really use the sequence, but you use more or less the structure, so the 3D structure that is produced by the sequence. Since we're talking about sequence and sequence analysis, I also want to talk about DNA motifs because they are a very useful tool to scan for things like transcription factor binding sites or microRNA binding sites in the genome. And head generally, the matching is not 100% exact. Proteins, when they bind the DNA, if they bind like a six or seven base pair sequence, if that is their recognition sequence, then of course, if one of these six or seven base pairs is different, then it still recognizes this sequence. And DNA motifs gives you, is a computational way of storing these kind of uncertainties about, well, sometimes there's an A here, sometimes there's a C here, and then you can use these DNA motifs to predict where proteins will bind. And then in the end, I wanted to say like, I have like two or three slides about genome assembly using whole genome sequencing. And this is more or less about how do we now do the novel assembly when we have no genome sequence available to test against. All right, so the genome annotation and homology check is actually very, very easy because we just infer the function from a homologous sequence with the known function. So we have a lot of sequences in a database, right? How we can use Uniprot or some other database which has like protein sequences or a database like ensemble, which has DNA sequences. And if we know the sequence or if we know the homology of one sequence, so if we know what one sequence does, we can then infer or more or less assume that all of the sequences which are very similar do more or less the same thing, right? So if you have, for example, S2, we know what the sequence for hemoglobin is and we know what the sequence is for terosinokinase, for example, in humans. And when we sequence a new species, and then of course what we will do is we will take the predicted protein sequences and then of course match them to the closest sequence, like using blast, and then of course when we find a good hit, then we're saying that, okay, so I now have this predicted protein in my species where I created a new genome sequence and this protein is probably going to be something like hemoglobin or it is something like octene. And this works of course because many species are very, very closely related and the relatedness is in kind of this tree. So why does this work? So the reason why this works is explained by Charles Darwin already in 1859 and he explained that when you have sequences and the illustration is actually done by Charles Darwin, when you have sequences then species are related to each other and these relatedness they share, right? All birds have wings, so if you see a species which has wings, then there's a big chance that it's a bird. Of course this is not always true, there are other species which have wings like bats and they are not birds but they are mammals. But in general if you see an animal which has wings, then 99% out of the time this will be a bird. And that's just the way that these things work. And this homology trick works because there is one common ancestor from which all life more or less started. And if you would look at the different domains like plants or animals, then these domains also generally have one founder which kind of sprang forth the entire domain and meaning that at a certain point they were all similar and of course these sequences they drift apart and they, the sequences change but in the end like a hemoglobin sequence doesn't change into a ubiquitin sequence, right? These are individual sequences. So the homology trick works because we have a common ancestor. So sequences are of course changed during the course of evolution, things like mutations, insertions and deletions. We already talked a lot about these in the context of DNA and then we have things like chromosomal rearrangements and chromosomal rearrangements are bigger changes which occur during the course of evolution and that is for example a duplication. So duplication means that a part of a chromosome or a whole chromosome is duplicated and this duplication is then transferred into the children and is stable. So duplication occurs a lot and genes tend to be duplicated during the course of evolution. Not only that but we have things like inversions so it sometimes happens that a gene which is located on the positive strand is actually when the DNA is copied. You have a double stranded break twice and then the part in the middle is just wrongly inserted. So it's inverted and of course this creates all kinds of incompatible problems and is one of the reasons or is probably one of these reasons why speciations occur. Because of course if you have chromosome 2 and half of your chromosome 2 turns around and then of course you are unable to mate with people who do not have that inversion anymore. Besides that we have translocations. So translocation means that a gene moves from one chromosome to another and had this also happens. Not very frequently but it does happen and of course here and because like DNA, if you look at DNA then when DNA gets, if you have a sperm and an egg cell and then of course these have to be homologous for chromosomal pairing to occur. So things like duplications, inversions and translocations can result in infertility. So it means that you cannot breed with your species where you came from anymore and this is not really the case for things like mutations, insertions and deletions because they are really small. So you have very small deletions, very small insertions. You also have bigger deletions but in general these don't really break the homology between chromosomes. And things like duplications, inversions and translocations does. So the correspondence between homologous sequences is of course not exact so that is why a lot of time was spent in the early days of bioinformatics to find a method that can do inexact pattern matching between sequences. And so in the end this became pairwise sequence alignment. So inexact pattern matching, of course for a computer, inexact pattern matching is relatively hard. A computer is very good at determining if two things are the same. If you have two springs which contain the exact same word then a computer can just quickly match these things together but if one or two letters in these words are different from each other then the computer has to calculate like a distance score and then based on this distance score it needs to decide if these two things are equal or if they are not equal. And that is of course one of these active fields of research still. And so when we look at pairwise alignment we have for example a gene sequence of interest and we have a gene sequence with a known function and so I just wrote down two sequences here and now the question to you guys is what is the similarity between those sequences? If there is a similarity between the sequences then we assume that there are more or less starting hypothesis using the homology trick. And so that is the hypothesis is that there is a similar function of this protein which is being made by this DNA and of course for humans it's relatively easy to just look at a sequence and say well these things look relatively similar but for a computer this is really really hard so this is a really it's not a well it's a more or less solved question but it's still a question where hey you have an active field of research and you can still contribute to this field by making the algorithms go faster because hey it's it's not a well the idea of matching two sequences together is more or less solved but the the computational part of course can always be done more optimal. And so this is the idea or this is what we want to know in the case of pairwise alignment. So does anyone in chat have an idea already if these two sequences are similar or if they are not similar? It depends on the perimeter similarity yeah yeah but that's that's the thing right that's that's the hard part like how similar is similar and how this similar is not. So they're not the now they're not the same length of course but that's because as sequences change that there's little point mutations changing individual base pairs. Hey you have like little insertions and deletions which might not be. But if you just with human eyes look at this sequence and then you would say that there's probably something similar right because they all end with ATTT, ATC right yeah So but for you as a human this is relatively easy to determine to see and look if two things are similar but for a computer this is a very very hard task and because a computer only works with ones and zeros in in internally so it can do it can say well a zero is equal to a zero and a one is equal to a one but when you give it these these strings of characters then for a computer it becomes relatively hard to figure out if two things are similar with a certain similarity threshold. All right so the computer could compare the single letters yes it can compare the single letters but if you have an insertion right so if one sequence is one longer than the other one then how does the computer deal with it? Do you allow it to introduce gaps and if you allow it to introduce gaps where should it introduce gaps? Should it optimize the number of matches or should it select against the number of mismatches right? So there's a lot to think about when you are designing these kinds of algorithms. The thing is for example do we force these things to be at the ending and the beginning? Do we force them to be equal before we start matching? So computers can indeed compare every single letter but there can be a shift right? If you have one sequence and you just move them all by one then of course the similarity is not the same but the computer cannot simply match them one by one and so there has to be an algorithm for this to kind of figure out how you can figure out what the similarity is and of course in DNA you have to remember that there is a difference between transversions and translations so changing an A to a T is much more common than changing an A to a G so if you see an A to T mismatch then this is a relatively weaker mismatch than an AG mismatch because of the way that DNA works and the same thing holds for proteins. So AI would help with this problem? Well probably not so much. In a way it can, in a way it can't. If you think about AI then you generally think about self-learning algorithms but in this case we are the deciders right? We as humans are very good at pattern matching. If you just think about these little games which you used to have in like these little books where you would have like a square filled with letters and then you had words which were inside of this square with letters and you had to find them right? Humans are very very good at that. You can look at the word and then you just scan through the matrix and then you find these words because our eyes and our brains and all of our, the way that we are built is we are built for pattern matching. We are built to recognize faces from from other humans and that's why people see their toast right? And they look at their toast and they see a photo of their favorite rock star on there. Then they make a photo of it, they post it online and like 90 percent of people are thinking like well I don't see anything in this toast but people are very good at recognizing patterns and that's that's one of our strengths and that's because our whole body and brain is is trained to to look for patterns all the time. But if we talk about pairwise sequence alignment like the number 13. How do you mean like the number 13? Is that a movie you referenced that I'm unaware of? The number 13? I don't know 23. I'm lost, I'm lost Commander. But if you want to do pairwise alignment you mean 42. No 42 is the answer to life, the universe and everything. Not 23 and also not 13. Yeah but we are the deciders in this case so humans are very good at finding patterns and we want to transfer this knowledge that we have about when things are similar or when things are dissimilar to the computer to be able to more or less analyze millions and millions of these sequences. So if you think about just two sequences right and you want to do pairwise alignment there are two fundamentally different ways of doing it and so the first way is to do a global alignment and so what the global alignment attempts is to align every residue and every sequence so to each other and global alignment can be used when sequences are more or less equally long. Right so if if we look at these right then we would say yes this is something where you could use global alignment. Had they these sequences are they are different in length but they are not too different. Right then we can also see that well here it starts with a gt and so that might match here at the beginning and then we have a ct which is this part here right so you would say that well there's a little ac insertion in the first sequence which is not found in the second sequence right and then we have ctg which is an exact match here and then we have this part a so there's another g a insertion right so these two sequences are completely similar except for the fact that this one has a g more at the beginning it has a ct no it has an ac insert and it has a ga insert so there's only like three you only have to do three modifications to come from this sequence to that sequence right so three mutations more or less would be enough meaning two insertions and one kind of point deletion to to block out right so these sequences are more or less suitable for for global alignment so global alignment is is when you have two sequences which are of relatively similar length and what you are saying is well i have a protein sequence of a hundred and forty amino acids i have another protein sequence which is a hundred and twenty amino acids align these as best as you can um then the other the other possibility to do alignment is of course to local alignment and local alignment describes the most similar regions which in the sequences to be aligned and so this is generally used when you have a very short sequence on the one hand and a very long sequence on the other hand and so if you think about um i have a little RNA sequence that i found and i have a genome sequence so the genome sequence is billions and billions of letters long yeah of course i cannot start with this RNA and start at the beginning and just compare it to every everyone right that that would just take too much time it would do billions and billions of comparisons before you would have actually looked for an exact match and then allowing for one or two mismatches would be even worse yes it would do millions and millions of comparisons and so global alignment is when sequences are equally long local alignment is when you are looking for a sub sequence in a large sequence and so in theory hey if we have s1 and s2 which is of course a little bit they are of this similar length in a way but what a global alignment will do is it will try to match the whole string and by matching the whole strings it will start to insert gaps into s1 and s2 while a local alignment will match the optimal substring so it will allow mismatches in the middle um but it will just it will not directly align to the beginning or the end and so when we do a global alignment we we put extra weight on stuff which is matching at the beginning and stuff which is matching at the end and we more or less have we we we don't penalize for gaps in the middle have while here in local alignment we will just ignore the terminals because we're looking for the local minimum where we can fit s2 without having to modify s2 too much so in global alignment we generally allow s2 to be kind of modified and chopped into pieces while in local alignment we just want to find the most optimal substring so where does it fit best and if it fits there so local alignment generally doesn't introduce gaps while global alignment introduces gaps in the shorter sequence of course there's many different ways of aligning and in order to assess the quality of an alignment we have to have a scoring function so the the most basic scoring function that you can come up with as well a simple human or simple bioinformatician is the the most simple score and that is just the number of matches that you have right so and you you express that as a percentage and so here we have two sequences which are aligned to each other and in the first alignment are they the same yeah so why is this five out of seven oh okay because here we have the yeah here we are missing the missing the last two and here we have the last with a with a mismatch but the the the idea is is that you align two sequences and hey of course we now decide which one is better hey is alignment one better than alignment two no because alignment two is the best alignment because here 11 out of 16 base pairs match while in alignment one five out of 17 or five out of 16 depending if you count the last two deletions here I think these last two should not be here and the first one here as well so head this would be five out of 15 and this would be 11 out of 15 base pairs matching right and this is the this is the idea behind just it's the basic scoring algorithm that you can come up with and of course this is useful right because now we can decide which one of the alignments we would prefer but of course we can do a lot of we can we can do a lot better right so doing a lot better means for example doing our scoring function that we had before but now we are going to add a gap penalty right so for introducing a gap we want to penalize for that right so have we do additive scoring with a linear gap penalty that means that we we look at all the similarities for position one and similarity two right and then we add for every gap that we open we add a negative score right so here we look at the similarity so we we go through each of the positions in s1 and s2 but when we introduce a gap we actually penalize for that and so we just say well if if there's a match between two bases we give a plus one if there's a mismatch between two base pairs we give them negative one and if we introduce a gap we just say negative one as well and so that's just to kind of deal with the gaps the question here is is should you penalize gaps as much as as mismatches right um is it is it worse to have a gap or is it worse to have a mismatch and of course from biology there are from the way that DNA works and that things mutate and we have kind of an idea what good fitting parameters are right and this is just a way of extending the scoring function to include the possibility to kind of introduce gaps in one of the two sequences you can actually do a little bit better and doing a little bit better is meaning that you now get instead of a linear gap penalty right for every gap that you have or for every missing base pair in one of the sequences you score minus one you can say well opening a gap is expensive but once you open a gap it's relatively cheap to kind of extend the gap right and that comes from the biology idea is that if you have a deletion hey this deletion is generally bigger than like a couple of base pairs so hey if you if there is really if these if this one sequence is actually originating from the other sequence but it just has a deletion in there and then of course you don't want to penalize it based on the size of the deletion hey because biologically technical speaking or DNA technical speaking it doesn't matter if an deletion of five base pairs is introduced or a deletion of 10 base pairs is introduced in both cases you just have one deletion in the DNA right so what what the what a fine gap penalty means is that you have two different penalties one is for opening a gap which is generally high and then you have another which is the gap extension penalty so when you extend the gap you don't penalize as much as when you open up the gap is that clear what the difference is between the the linear gap penalty and so here we just look at the number of positions that we have where there is a gap and then we multiply that with minus one which is the same as for a mismatch and here we just say well no opening up the gap gives you for example a minus one but extending the gap only gives you like minus 0.05 all right I hope that that's clear so the affine gap penalty all right you have to realize that alignments can be done at DNA or on a protein level and or right so an exercise for you guys here I have a little example and I'm actually going to stop recording because we need to take a break as well let me see can I actually because if you take a break I do want to show you guys the gifts of course I will at least stop recording so