 every one to part number three of the sequence alignment lecture or sequence analysis lecture. If you're watching this on YouTube, welcome back. If you're watching this on Twitch, welcome for still being here. I hope you enjoy the alligators. Good, so let's continue. So like I told you guys, Clustle W or Clustle Omega, which is the updated algorithm, is the most popular alignment tool today. W stands for weighted. Different parts of the alignment are weighted differently, and that is because of course it uses this transition-transversion or the protein substitution matrix. So it's a three-step process, so it constructs pairwise alignments, then it builds the guide tree by using neighbor joining, and then it does progressive alignment guided by the tree. So sequences are aligned progressively according to the branching order in the guide tree. So I have a little example for you guys on how this works. So of course, hey, you align each sequence against each other, given the similarity matrix, and then the similarity matrix in Clustle W is defined as the number of exact matches divided by the sequence length. This is also known as the percentage identity, not the sequence identity, but the percentage identity. Right, so if you have four sequences that you align together, then dot 17 here means that v1 versus v2 showed 70% identity. So that's 70% of the base pairs matched. So step two is then creating the guide tree using the similarity matrix, and Clustle W uses neighbor joining instead of, for example, hierarchical clustering. But the guide tree is thought to reflect evolutionary relationships between the different sequences that you're looking at. So if we do this, right, so if we do the alignment, so we do the alignment pairwise of all of these sequences, then what we see is we see this similarity. And then what happens is that, well, we see that v1 and v3 are the sequences which have the highest similarity, right, 0.87. So these clustered together. So we first align v1 versus v2, and then the next sequence that gets put into the alignment is v4, because v4 is the alignment which is closest to v1, v3, and then v2 is introduced last. So we start by aligning the two most similar sequences, then we follow the guide tree, and then we insert gaps as necessary. So how this looks, if we have these four sequences of four amino acids each, the first thing is to the pairwise alignment, create the distance matrix, then from the distance matrix we create the neighbor joining tree. The neighbor joining tree is then used, so we align s2 with s4, we align s1 with s3, and then we align these two groups together. And then we have our multiple alignment in the end, which looks like this, because we also needed to introduce some gaps. Of course, if you want to know exactly how Clustle W works, then just look up the original paper, and there it's explained in much, much more detail. But the results from Clustle W, we already saw one of these results, let's scroll a little bit back, where we had the example, you see that underneath it gives you these codes. So it says double point, star, double point, and dot, and these have a meaning, and this meaning corresponds to what we looked at here. So when we look at Clustle W, then a star means that the alignment contains identical amino acids in all sequences or identical bases if you are aligning DNA sequences. So a star means there is no variation at this point. A double point means that the column conserves, contains different but highly conserved amino acids, so that this is where all of the residues are, for example, hydrophobic or are all hydrophilic, and a dot means that this column of the alignment contains different amino acids that are somewhat similar, right, so they're not exactly identical, they're not all hydrophobic, but they all are having a very low score in this in this Blossom matrix or in this pump matrix. And when there's a blank, then that means that the alignment there is very dissimilar or that there are gaps, so it means that there's no homology in this region, right? So if we look here at, for example, human mouse, rat, cow, and chim, we look at histone H1, and then we see that there are many regions where there is no variation at all, so it gets a star, some regions are conserved, and other regions are completely non-conserved, and there are some dots, so there are some semi-conservative regions, so where amino acids still are relatively similar, but not exactly identical to each other. Good, so that's how you read a cluster W multiple alignment. So multiple sequence alignments has one major pitfall, and that is because it works under the assumption that you know what you are doing. It always produces an alignment, even if the stuff that you are aligning has nothing to do with each other, and that is the major pitfall. So you, when you do a multiple sequence alignment, make sure that the sequences that you give to the aligner are related to each other, right? If you have completely unrelated sequences and you just throw them in multiple sequence alignment, then it will produce an alignment, but the rule of thumb here is that if it looks wrong, it probably is. So it took me some time of searching, but this is one of these pitfall examples where the aligner will more or less go completely crazy, right? So if you have the fat cat, Garfield the very fast cat, Garfield the fast cat, and Garfield the last fat cat, then it doesn't understand this. These sequences are so dissimilar from each other, right? That when it does the first alignment, it will just align cat with fat and introduce a gap at the end. And why will it do that? Because a penalty for creating a gap at the end is less than creating a gap in the middle. So instead of aligning cat with cat, it'll actually aligns cat with fat. So it introduces a miss-swatch while there's actually a perfect match here. It even becomes worse when you do the very fast cat, because now it will align cat with cat, but here it starts introducing gaps between fat and cat and fast. And of course, when you have the in there, then it still doesn't align properly. Right? So you have to make sure it's up to you that the sequence that you give to the multiple sequence alignment makes sense. So multiple sequence alignment is still an active field of research. So Clustle W was published by Thomson in 1994, but before that and slightly after that there were also other people writing multiple sequence alignment tools. So this is just a very basic selection of tools that are out there. Clustle W or nowadays Clustle Omega, which is an updated version of the algorithm, are mostly used, but a lot of people still use Muscle, for example. It's a good aligner and it has its advantages and disadvantages, but there is no optimal alignment tool, like the Swiss Waterman algorithm. The Swiss Water algorithm is a computational algorithm, which is guaranteed to give you the optimal alignment. All of these will try to get close to the optimal alignment, but none of them will actually find the optimal multiple sequence alignment. So pairwise alignment is a solved problem in biology. Multiple sequence alignment is still an open question and there's still research being done to find the most optimal multiple sequence aligner algorithm. So keep that in mind if in case you're interested in a study that you want to do or if you want to do bioinformatics and you looking for a problem that isn't solved yet, multiple sequence alignment is not a solved problem. Nowadays, instead of multiple sequence alignment, people actually want to do structural alignment, because of course the function of a protein is dependent on the structure of the protein and not on the sequence of the protein. Of course the structure has something to do with the sequence, but there's not a one-to-one relationship between the sequence and the structure, right? And we also talked about this during the protein lecture, is that also there is no algorithm which you can give the primary sequence of a protein, which will then predict the 3D sequence of a protein. So again, one of these unsolved questions, right? So the structural alignment is based on the fact that you want to use information about the secondary and tertiary structure of the protein or the RNA molecule to aid in aligning your sequences, right? If you know that well it's important that there's a loop in the RNA at this point, then of course when you're aligning to RNAs you want both of them to have this loop and have the loop at the same point, right? So protein and RNA structure is more evolutionary conserved than the sequence itself. So a structural alignment should in theory perform better or give you more real answers about evolutionary distance between two proteins than a just multiple sequence alignment. So there are two tools available which I know of and there's bound to be more because again it's an active field of research, people publish papers about structural alignments, well not daily but they're still being published. And one of them is Dali, which is a fragment-based method for constructing structural alignments based on context similarity patterns between hexapeptides in the query sequence. So they chop up their peptide sequence into hexameres and then they try to find if the structure of this hexamere matches structures in the hexamere in the database. And then you have SSAP which is trying to do the same thing but they use atom-to-atom vectors in structure space as comparison points. So they just use kind of the distance between atoms within proteins to kind of have a description of the structure. But again structural alignment is a very interesting field of research, it's not a solved thing and there's still active research ongoing. But you can use these tools and the alignments that they produce are generally interesting compared to multiple sequence alignment. And by interesting I mean that they are either better or different from multiple sequence alignment. So hey it gives you another viewpoint on what the optimal alignment is. So why do we do all of this alignment? I told you guys that one of these things is homology, is finding if you have an unknown genome now you might want to annotate sequences of that genome using a known database. But one of the other ways where alignment plays a big role is finding DNA motifs. Because a transcription factor binding site or a transcription factor which binds DNA will find the same DNA sequence in mouse as it does in humans, as it does in giraffes, as it does in crocodiles. And so by doing multiple sequence alignment not just of like 10 sequences but using like hundreds of sequences from hundreds of different species what we can see is we can start finding regions in the genome in front of genes which are highly conserved. And then once we found these highly conserved regions we can build a motif. So we can say well if you see t-a-t-a-a then this is generally where the where the tata box protein binds. So had the representation of DNA motifs. So a DNA motif is a piece or a code which is found in many different species not exactly identical but very identical which is believed to be the binding spot of a protein to the DNA. And there are three different ways that DNA motifs are represented. There is the standard string representation and then we have the matrix representation which is which is the most commonly used representation nowadays. And then we have the representation with nucleotide dependency which again is one of these unsolved issues is how do you code nucleotide dependencies within a DNA motif. Right so the string representation is very easy it's just a a character string like t-a-t-a-a for tata box. Right so it uses the symbols a-c-t-n-g but it also uses all of the other symbols from the EUPAC chemistry list. Right so when in DNA right if you have a DNA sequence and you see a w then a w actually means an a or a t. An s is a strong which actually means a c or a g. If you see an m then that means a and c. If you see a k then it means gt. Right and these representation is what builds the string representation. Right so t-a-t-a-a is a good motif because there are no kind of wild cards in there. There are no uncertainties. Right but it could also be that you have a motif which is t-s-s-m-c-c-g. Right so then you know that okay so I had the first couple positions are fixed because there's a t or an a but then there are some positions where you have a choice because for the protein that recognizes the DNA sequence it it doesn't really matter if there's a g or if there's a c and then you would code that as s because the only thing that you require for this position is that there's a strong binding which means a c-g binding and of course have we used this iubuck so the international union of pure pure and applied chemistry notation to write down these sequence motifs. So much more commonly is to represent motifs in DNA like this and then you use position weight matrices or also called position specific scoring matrices. Right so here we see the matrix representation that is used by the computer so we say that at position one we did in total 31 alignment so we aligned 31 sequence and we see that in 28 out of 31 sequence there is a c at this position so we say that at position one of the motif there is almost always a c found right then in the second position of the motif you see that there's only 22 sequences having a t some have a c and some have an a right so this means that the representation means that there's a t but this t is smaller than the c because it doesn't occur as often right so these matrices can be represented by a matrix when you load them into a computer or they can be viewed as a sequence logo so now you can kind of see that the most common sequence which is bound by this protein is c,t,t,t,g,a and then unknown and that there's nothing really it might be an a which is most common but the information content is low there so how do you now calculate one of these PWMs so hey imagine that we have a multiple sequence alignment of 50 or 60 sequences just take 50 sequences as because it's a nice round number and so to create a position frequency matrix what we do is we count for each of the positions the number of times that we see an a, c,t or g then we create a position probability matrix and that is just dividing the number of a dividing each of these numbers here by 31 because here we have 31 observations here we have 9 also 31 31 here we have also 31 so in this case you divide every column by 31 but it doesn't have to be it might be that at some positions there was a missing value so that you divide by 30 right but so you you you create your frequency matrix so how often do I observe something then you create your probability matrix by dividing by the number of observation and now we do something special because now we have to compensate for the ac,t,g content of the individual right because if I look at bacteria bacteria generally have a high gc content so it is much more likely for a bacteria to have a c in their genome than it has to have an a at that position right because the whole genome itself is like 80 cg and only 20 at so to compensate for that what we do is we calculate for each entry in the table we calculate the log 2 of the entry of the table divided by bi and bi is the occurrence of a base pair in random dna right so for example some if we look at a bacteria then we across the genome we can count and we can see well there is a 50 c then there's a no 50% is impossible because then it would also be 50% g so you would have like 35 c 35% g you would have then 15% a and 15% t right because that adds up to 100% so we we take the the entries in this matrix right we divide by the number of times that we observe the first base pairs and then we we compensate for the ac,t,g content by dividing by this bi and then this is the occurrence of a base pair in random dna and then we take the log 2 to get a nice score because that's just how people roll because as a computer scientist you generally want to work with large numbers and and very visible numbers and you don't want to deal with like something divided by a small number right because this is already a small number and then if you would divide and then becomes even smaller and then the log 2 actually makes it big again so this is the most common ways nowadays to have this position weight matrices or to have these sequence logos where you can look at what a certain protein binds however we know from biology that there is that sometimes a base in the motif is not dependent from other bases right if you have an a on position one then you have to have a g on position five well if you have a t on position one then you have to have a c on position five right so position one and position five are not independent of each other um so they are so they there's a relationship they they they depend on each other and again this is called then a scored position specific pattern that's the way that people try to represent this so that a computer can understand this but again this is very complex and this is not a solve problem at all so it is just a different it it's closer to the biology it's a better model we know this because it represents reality much better because we know that there is dependence in these base pairs for when a protein binds but currently we're unable to write this down in such a way that we can have a computer understand this and have a computer figure this out but of course all of these things are only possible because of multiple sequence alignment if we could not do multiple sequence alignment we would not be able to create position weight matrices we would not be able to find sequence logos right so all of this is coming out of this um out of the multiple sequence alignment it is all based on multiple sequence alignment so another so if you want to look at motifs right then there are two online databases which have this motif on formation the first is transfac it is the largest repository of transcription factor binding sites the big issue here is is that it's commercial so you have to pay for getting access to the database it didn't used to be used to be a free database but then the guys behind it left academia and made a company out of it there is still a small public database so at the moment that they left academia of course they had to leave the database as it was because it wasn't theirs well it was partly theirs but also partly of the university so there's small public databases there but this is just a database which contains all of the known transcription factors up until a certain point in time so nowadays the open source database that most people use where which has motif information so if you want to know um for example a certain uh protein um i know that this protein binds dna so what is the sequence of the dna that it binds then you can go to jaspar this is an open source database and it provides an api and it is also directly available from r so you can directly from our query the database and say well the what motif is bound by the estrogen receptor in mouse and then it gives you this position weight matrix which you can then use to do your own predictions so to see in your sequence of dna might the estrogen receptor bind there so finding motifs is done either using multiple sequence alignment there is also another way of doing it and that is phylogenetic footprinting um has so binding sites tend to be conserved in evolution we can use dna sequencing data for multiple species as a blind the upstream region of homologous genes and then transcription factor binding sites tend to be conserved patterns from five to fifteen base pairs long so this is this is how you find it so again multiple sequence alignment is kind of key um yeah because there are a lot of changes from one species to another species but these changes will only occur outside of this motif because if a gene is regulated by a certain protein and then of course this protein regulates this gene not just in humans but also in mice and in cattle and these kinds of things again there's many many different tools available for motif findings so if you want to find new motifs um then here is a list of tools that you can use if you want to search for transcription factor binding sites in your dna that you have or in your species of interest um then you can use these tools um these tools generally take a position weight matrix and use the position weight matrix to scan the dna and then tell you well there's a 90 probability that at this point in the dna there will be a binding site for the estrogen or at this position there's a 70 chance so head there's tools for finding motifs and there's tools for searching using motifs so of course alignments or multiple sequence alignments are key to many many other fields when you are assembling a genome you are also using multiple sequence alignment or pairwise sequence alignment when you do RNA sequencing head the alignment to a genome then of course you have a sequence genome i wonder who's calling me let me just pick up the phone really really quickly for you guys all right sorry about that guys the ipv4 phone thing is still bugging me like i still get calls from other people who actually are listed under my number all right so um for multiple sequence alignments or pairwise sequence alignment there are many different fields of application like genome assembly so if we want to assemble a new genome from all kinds of short reads that we got from a dna sequencer um then we have RNA sequencing which which is sequencing RNA molecules and then aligning them to a reference genome or to an already known genome and of course here we do this to locate genes like where's the introns where's the exons but it can also give us information about alternative splicing so about which different transcripts are being produced from a gene also multiple sequence alignment is used when you do when you do population level snip detection so i wanted to show you quite quickly where multiple sequence alignment is used when you do de novo genome assembly so de novo genome assembly is the process of determining the dna sequence composition of an organism and generally you use whole genome sequencing short read sequencing which we talked about a lot and then the other way of doing it is using bacterial artificial chromosomes in which you take like a little piece of this sequence put it into a bacterial chromosome duplicated a lot of times and then use pcr to figure it out but genome assembly is a computational step that follows genome sequencing with the objective of reconstructing the genome from its reads and so how does this work so have when we have overlap between short reads so if we have for example different reads that came out of the aligner like the green read the red read and the blue read and then of course we need to find where these reads are overlapping each other so we can do pairwise alignments to couple reads together say that this part of the read is the same as the other part of the read and the same thing here between the red and the green so what we then do is we then assemble these into a contig and of course when we have read some reads of course are able to be positioned in many positions right this is a read where there's no single match but when you take individual whole genome sequence read align them pairwise then you find that the end of one read is overlapping the beginning of another read then you combine these two and then you create this sequence which is called the contig so the contig is here black and then what we then do after we've assigned contigs because you never get the whole genome in one go is that you have to use scaffolding to kind of figure out how far are these contigs from each other so a scaffold links a non-contiguous series of genomic sequences into a contiguous sequence separated by a gap of known length so these these scaffolds are made like this so when we do paired and sequencing what happens is we get a read from the one side of the of the fragment that we have and we get a read from the other side of the fragment and we know more or less that there will always be between like 300 and 700 base pairs between read number one and read number two so now we do pairwise alignment of read number one versus the contigs that we have we do pairwise alignment of read number two versus all the contigs that we have and we if we find that the read of that the forward read so read one is in one contig and read number two is in another contig then we know that these sequences are very close together on the genome and we know that there's around a five to seven hundred base pair gap and these gaps are then filled with ends so moderator can you ban the spammer in chat thank you all right so have when when these reads when one read falls in one contig and we have another read falling in another contig then we can have we know that these two pieces of DNA are very close together and that we are that we know it from and that we know that there's 500 base pairs in between good so now we talked a lot about multiple sequence alignment what you can do with it why it is so important and that it is key to bioinformatics so how do you guys do multiple sequence alignment at home in r well multiple sequence alignment can be done directly from r it is provided by the msa external library and you can install this library directly from ground so this is one of these libraries that is not that you don't have to go to bioconductor you can directly get it from the main repository so to install it you say install the packages msa this will take a little while i think it's a relatively big package will download like four or five mbs and then it will install it and then of course once you've installed the package you then have to load the library so say library msa and from now on you can do multiple sequence alignments in r so how do you do multiple sequence alignments in r well first things first you need to create a vector with sequences right so these are the sequences that we used as an example to find which parts of the which parts were conserved and you see here this this region which is conserved at the end but we have to create a vector with sequences this can be DNA or protein right so the thing that i'm doing is i'm just saying create a new variable called sequences which is a combination of all of these different sequences all of these little strings so before we can do the multiple sequence alignment we have to convert this to an amino acid string set or a dna string set and there's a difference between an amino acid string set and a dna string set because why do we make this difference and why don't we use the the strings themselves directly and this is because of the transition transversion but also because of the fact that we have the eobug coding for dna right and and so from a from a sequence here you cannot directly know if it is dna or amino acid so what we do is we convert it to an amino acid string set in our case because these are amino acid codes we could also use dna string set when we want to align dna codes so what happens is we say amino acid string set off our sequences and then i'm just calling this myaa-sex for my amino acid sequences and then we can perform alignment using for example cluster w the msa function actually supports three different aligners i think it has cluster w it has muscle and i think it has a third one as well which i never used but it's a very easy call so you just say do a multiple sequence alignment of my amino acid sequences and then you specify which tool you want to use and then you store it in a new variable for example called my alignment so if i then look at how this my alignment looks then hey it gives you that okay so it uses cluster 2.1 it did it you made this object using this call and then it tells you that we have a multiple sequence alignment of amino acids with eight rows and 27 columns so there were eight sequences that we started off with and there are 27 columns after aligning the sequences together in besides that it so it shows you each of the sequences where the gaps were introduced but it also here shows you con which is the consensus sequence right so we we we have a wildcard wildcard but at the third position there's always an l at the fifth position there's always a c and then at the seventh and eighth position there's a v and there's an s and so forth right so it's it's very basically it's the consensus sequence it's what is conserved between all of the different sequences that we input it of course if we want we want to see generally the distance between sequences we generally want to build like a phylogenetic tree right we want to build a tree which we which we can visualize so that people can look at the tree and say oh so these species belong together like cows are very closely related to whales and whales are very distantly related to ants right so you want to build a tree so if you want to see the distances we need to use a different library which is called second r so when we do library second r to load this library of course we have to install it first so install packages second r but then what we can do is we can take the alignment that we just produced and then say multiple sequence alignment convert the alignment that we have to a second r alignment and that is because the dist.alignment function so computing the distance between the alignments is something that is not available in in in the msa package but it is available in the second r package so we go from one representation to another representation so have we have the alignments then we say well we want to we have the distance of the alignments based on the similarity um yes so give me the similarity and then we get a distance matrix this distance matrix is nothing more than a matrix which in this case will have eight rows eight columns and each value inside of the matrix will tell you how similar two sequences were then we do clustering so we take the distance matrix make a clustering of this distance matrix and then we just plot the clustering that we get and in our case it looks like this right it doesn't look very pretty but you can do a very quick examination and of course we can have in in the beginning when we defined our sequences here we didn't give these sequences names but of course we could have used the names sequences function and then assign name and say well this is cow this is human this is pig or whatever right so when you when you when you give names to the sequences before you make it an AA or a DNA string set and then of course the plot will also show the names and not just one two three four five six seven eight but here we can actually see that sequence one and sequence two are related to each other head and sequence three is also relatively related to one and two but sequence one and two are very different from sequence four and five and sequence four and five are very different from sequence six and seven although no i clustered it by similarity so it's the other way around right so these two are very dissimilar um anyway you can figure it out so more meaningful names are good because then they will show in the plot and if you want to make a really really nice clustering then you can use the APA library so when you load the library ape it gives you more options for clustering so have it not so much for clustering but for showing dendrograms and so here we see a standard dendrogram where have we have an un we have an unrooted tree which just shows the different distances between the different things so if we want to make it look a little bit better then we can say well transform the clustering as a dendrogram object so we go from the cluster object that we have into the dendrogram representation which is a tree representation then we go from the tree representation to the phylogenetic tree and then we can plot the phylogenetic tree in many different ways so we can say we want to have a radio phylogenetic tree and i think one of the lectures that i have in the r-course actually shows all of the different options that you can use so we have a radio is the one that i like to look at most but there are many different ways of visualizing trees of course this plot function you can also add colors so you can say well make sequences one two three and eight make them red because they all come from mammals make sequence six and seven make it blue because they are reptiles and make four and five yellow because they are marsupials right so for each of the of the tips of the tree so for each of the the the tips here you can give them colors and that is done using the tip dot color function or the parameter right so you say plot phylogenetic tree comma tip dot color then you have to give it eight colors one for each sequence and then you say type is radio so i did this like a couple of years ago two years almost already and i did this based on the spike protein of the coronavirus because i was really interested in it so i looked at it and i did all of these yes so i use biomark to automatically download all of the known coronavirus spike protein sequences then i use multiple sequence alignment to do a dendrogram and then i use the ape library to make it look pretty and not only that but i went through the literature and looked up for every known coronavirus what is the clade so is it an alpha coronavirus a beta or a sarbakoff yes so there's different types of coronaviruses and then in the end it didn't cost me that much more code but then in the end you see that you get a really pretty looking tree and of course from this tree we can actually learn some things right if you see here then we see that in the in the yellow group here there are three sequences which have been wrongly annotated right they have been annotated as saying well these belong to the purple group but they actually don't belong to the purple group because they fall into the yellow group and the same thing here this coronavirus this purple one is wrongly annotated because it falls into the blue group so this is probably of the blue type so it's probably a alpha while in the database it's annotated as a beta right so we can see directly which things match up very well and we can see which things do not match up but the radio plot here shows the protein distance between the different spikes found on coronaviruses and with all of the tools that we've discussed in the last couple of weeks you can make these plots as well right because i showed you guys how to use biomark to automatically download the sequences you now know how to do multiple sequence alignment you know how to do it then do a dendrogram and then using the ape library you can make these fantastically looking radial trees which is kind of in it wouldn't stand weird in in a like nice publication but takes a little bit of work making stuff look pretty right the standard r representation look like this and of course going from this through this takes a little bit of work finding the right colors and doing the annotation properly good so 10 minutes that was it we done we done 10 minutes before four so we actually did a good job we we kept it within three hours a little bit of a shame that there weren't any more questions but that's up to you guys if you don't have any questions then that's fine with me so today i told you guys about genome annotation i told you about the homology trick right that when we have two sequences which are homologous we can assume that the first sequence and the second sequence have the same function i talked almost two hours about sequence alignment so i told you what pairwise alignment is and what substitution probabilities are so that transversions are more common than transitions in dna and also that certain amino acids are more commonly swapped out by other amino acids because they have very similar side chains so they have the same or less biomolecular activity we talked a lot about multiple sequence alignment i said a few words about structural alignments furthermore i showed you that multiple sequence alignment is essential in finding motifs so finding where proteins bind dna i i said a few words about that multiple sequence alignment and pairwise alignment is also core when you do whole genome sequencing had to assemble new genomes but also to align reads to the genome you need to have a scoring on what is similar and what is this similar and i showed you guys how to do multiple sequence alignment in r which is very easy because you just use the msa function which is provided by the msa library all right so that was it for me for you guys for today there's any questions then i'm here for you guys to answer any questions that you um so for youtube if you're watching this on youtube like super you made it to the end like fantastic thank you for watching and i will see you guys in the next lecture the next lecture will be about ooh that's a difficult one i think i will do the standards for analysis or the literature management lecture since we only have a couple of lectures left before the exams i'm i'm having to pick and choose um what i forgot to tell you guys is also that i invited a guest so we will have one of my previous students uh she did a master project no she did a bachelor project in our group she's now doing her masters in bioinformatics and she has a really interesting um topic that she's working on so she's doing machine learning to identify animals which are caught using these camera traps so i invited her to give like an hour long talk and we will have that i think in two weeks i think it's the 27th or something like that um but i will let you guys know like i will send around an email as well will be nice to have as many students show up as possible um so that she has a little bit of attention as well good so for me that's everything um people on youtube thank you for staying for till the end um and see you next time