 All right, so welcome back everyone who's watching this later on Moodle for some reason that you couldn't join the stream So ClustleW like I told you most popular multiple alignment tool today. Multiple alignments, unlike A pairwise alignment is not a solved thing. There's multiple alignment software tools being created And every year new tools come out which are either slightly better than ClustleW One of them, which I really like nowadays is Muscle because it's very very fast So if you want to do multiple pairwise alignment Or multiple alignments then do look around what the latest state of the art is But I think ClustleW because it's the the first one that really was out there is one Which is really nice to describe and it's easy to describe because it actually only has three steps All right, so how does it work? So step number one is constructing the pairwise alignments, right? So hey it what you do is you line a line each sequence to each other given a similarity matrix, right? And the similarity can be described as for example the exact match is divided by the sequence length and normally you would use a very you would use a more like a more proper algorithm things that have like gap opening or Penalties like a fine gap Scoring but the idea is is that when you have four sequences sequences one to four Hey, of course the sequences are always a hundred percent similar to themselves. So the Diagonal of the matrix doesn't matter But here we have for example a matrix and here dot 17 means that they are 17 percent identical So what we do is this is step one is we just do this pairwise alignment between all of the sequences Which is computationally expensive, but if you would use something like blast or some other Step or some other optimized algorithm. This is perfectly doable and it doesn't matter So what you then do in the next step is you use this matrix to create a Because this the guide tree is based on the similarity matrix Hey, so you use the neighbor joining method and the guide tree roughly reflects evolutionary Relationships so for this little matrix that we have here. Hey, of course v1 and v3 are highly similar So they are branched together and then we have v4 which is Very similar to v1 and v3 But the way that it does this is it first finds the two which are most similar, right? So v1 and v3 so it makes the alignment from v1 and v3 and this is called v1 3 So we create a new sequence which is more or less the consensus sequence between v1 and v3 And then what we do is we then align the consensus sequence against all of the other ones in the matrix So the matrix becomes smaller to have a step for that. Yeah, I have a step So I have an example, but then then hey the consensus sequence is aligned against v4 creating a new sequence which is build up of 1 2 3 1 and 3 and 4 and then you align this new sequence against v2 the last and the most distant sequence And then you get your fee you get your your alignment for that And so you start by aligning the two most similars Then you follow the guide tree add in the next sequence to the existing alignment and you insert gaps as necessary To to make things work So the example imagine that these are the four sequences that we have on on on On just letters, right? So we're not dealing with proteins or anything Have but first we do the pairwise alignment. So we do sequence one versus sequence zero sequence one versus sequence two And and so on so we create the matrix of the pairwise alignments Why did this zero? Why this is nine four and seven? I don't know There was a rationale when I made when I made this example, but that just assumed that I calculated the scores Correctly, so we build a neighborhood joining three and which shows you that sequence one and sequence three are most similar So here I'm actually doing it the other way around So a distance of nine means that they are highly dissimilar and a distance of four means that they are very similar Right, so we create a neighbor joining three, which is always a rooted tree. So it starts with Root and then have based on the distances between the sequences. We cluster things together And so from this we learned that s2 and s4 are very similar to each other While s1 and s3 are very similar to each other But these two groups are very dissimilar from each other Which you can more or less see in the in the alignment as well And then what you do is you then do your alignment steps So you first do aligning you first start aligning s1 with s3 Then you align s2 with s4 and then you align the two kind of branches from the three together and then have Because of this you then end up with your multiple sequence alignment So it's it's just an example to kind of show that you pay you calculate the distances You make a tree and then with the three you start doing the alignment pair-wise All right So when cluster w gives you the output then it has an additional line underneath where it tells you What is going on right so if the line underneath has a star that means that this column of the alignment contains? identical amino acid residues in all sequences or Identical bases if you do DNA sequence alignment if you have these double points And that means that the column contains the alignment different but highly conserved very similar amino acid a dot means that There's somewhat of a conservation going on and when there's nothing it means that there are dissimilarities or gaps there So hey, if you would for example look at histone H1 Hey, then what you would see is that if you take mouse human mouse red cow and chimp Hey, you would see that there's a whole bunch of stars, right and the stars are The ones which are identical right so the sequence here is very identical So there's a high likelihood that this had the identical pieces are carrying some kind of a Of a function and that this function is shared and without having this function you cannot function as as a as a histone Then you see that there are double points. So here you see an s a a a an s and these are Amino acids speaking very similar to each other. So this is then deemed as being conserved yeah, here you see Lloyd's scene and then Venue alanines which are very different amino acids at this part of the protein is not conserved And you see a part here, which is also not conserved And so if you would do like large alignments and then you would see that if you would do Similar proteins across different species and then you can figure out which part of the protein is probably Things like the active site or where you can target a drug to kind of target this protein Yes, so this is more or less the same thing what what you can do with With things like viruses. So if you have multiple viruses like flu viruses, hey, you just Yes, so you take the influenza virus Hey, you take for example the hey, my gluten in a spike and you you take this spike across like different versions And then you can see that parts of the spike are very conserved. Yes, so these parts are functionally necessary for the spike of the influenza virus to fight to to to work So there is of course a pitfall when you do multiple sequence alignment Which is not really there when you do kind of local or when you do blast searchers. So when you do a cluster W search The multiple sequence alignments always work on the the assumption that you are giving it related sequences So it will align anything even if they have nothing to do with each other So if I just take two completely different proteins from two completely different species and I throw them into an msa algorithm Then the msa algorithm will give me an alignment even if these sequences are Completely unrelated it will try to do its best because had the assumption here is is that you know what you are doing as As a as a biologist and you want to align these sequences together So you probably have a good reason to do this But the rule of thumb is that if it looks wrong it probably is because multiple sequence alignments unlike pairwise alignments They don't give you things like an E value to evaluate how good the alignment is So always when you do multiple sequence alignment be aware that if you are aligning trash with trash or two things Which are not related to each other and then The result will be meaningless although you will get a result and which is different for many algorithms Have for example in pairwise sequence alignment It will it will give you an alignment and then it will tell you well the E value of this alignment is one right so this alignment is is Just random but multiple sequence alignment doesn't do that. It doesn't give you an E value So hey, you have no idea how good the alignments are But of course if you see something like this where there's massive conserved regions, then you can think yeah, this is a pretty good alignment But hey, it's it's very difficult And so for example one of the pitfall examples, which occurs a lot. Hey, is that for example if you have? the alignment of The fat cat Garfield the very fast cat Garfield the fast cat and Garfield the last Fat cat and then it will try to do its best to do the alignment But you can see that in this case had the alignment goes completely wrong because here Having the word cat last would result in three Positions which are conserved But because of the way that the alignment works and it tries to align the sequences which are The most similar first Right, it will actually align cat with fat Right because now this will introduce one mismatch And then at the end it will have three gaps, but these three gaps at the end are not uh, not as Influential as this This is this is more So it's just a better alignment But because you start with the wrong alignment in the first step because these two sequences It will actually go wrong when it starts aligning the other ones as well So in the end you get an alignment Which is not really representative of the best multiple sequence alignment that you could get So have be aware that that's always one of these issues with multiple sequence alignment So there's a lot of different multiple sequence alignment tools out there So you have cluster w which was invented in 1994 you have hmm t which uses an hmm technology Which was developed in 1995 you have p prp which I never used I used muscle a lot and i'm still using that a lot. Um, it's developed like 10 years after cluster w, but it's It's more optimized and it it's relatively fast compared to cluster w And of course you have the new version of cluster w which was done like five years ago. Um, so like 2014 2015 when they renamed it cluster omega and improve the the underlying algorithm But just to give you an overview every year New multiple sequence alignment tools will be produced. So when you start doing multiple sequence alignment It really pays off to read through the literature to see what people are using currently and Generally what people are using on mass is the one which is currently best But of course all of these algorithms were invented and then people Continuously update them. So the best alignment tool changes every year And that's that's just the way that it is All right, so structural alignment. So it's the third part of how you can align It uses information about the secondary and tertiary structure of a protein or RNA molecule to aid in alignment of the sequence And so protein and RNA structure is more evolutionary than the sequence and we already saw that with our little example that we did before Where you look at the conservation based on DNA level and protein level So often proteins are much more conserved So the the amino acids are more conserved than the DNA coding for these amino acids This has to do with of course the third base pair the wobble base Hey, which is free to choose in many cases or relatively free to choose And so mutations in the third base pair of the DNA codon are not as impactful because they generally don't change the the protein and so If you would look at If you would think about this then of course like the the protein is more Conserved than the DNA sequence. So protein sequence is more conserved than DNA sequence But not only that but the structure right so the the way that it folds is even more conserved because had the Function of a protein is based on how it folds and where kind of the active site is in which amino acids are coding for that Yeah, but since the structure is more conserved you nowadays have tools like dali and ssap Which which use this information about structure To kind of inform the alignment tools Which alignment is to be preferred? Yes, so dali is a fragment based method for construction structural alignments based on context similarity patterns between Successive hascra peptides in the query sequence. So here this hexa peptide is more or less like the Camarization, which we saw on DNA level when you do blast but had these hexa peptides are Are assigned a kind of structural similarity score So based on three or three Amino are six amino acids hexa peptides are six amino acids in a row Hey, it will get a structural score and if two things are structural score It's very similar and then that will be preferentially in alignment. Then you have ssap, which is More or less the same but it uses atom to atom vectors in a structure space as comparison points And so this is more or less when when you have a protein sequence Which for which the structure is more or less solved and then they know what the distance is between the Individual atoms and then that is used when you query your new sequence for for alignment But these are relatively new tools and they perform slightly better than standard pairwise alignment for For things like using a blosa matrix But be aware that this is a very active field of research And this is something where if you are interested in bioinformatics, you can contribute a lot in the in the next five to 10 years all right, so then The almost last topic of today. There's one more topic after this is motifs So motifs are a way for a computer to be able to find things like transcription factor binding sites in the dna so there are Different representations of dna motifs. So you have the string representation You have the matrix representation and then you have a representation with nucleotide dependency And the third one is the best one, but also the most complex one to implement So there's not a lot of tools out there which use the nucleotide dependency representation with nucleotide dependency Well, we will go through the different dna motifs or storing of dna motifs and how to use them And so the string representation is just basically you have a string of characters and these string of characters I had for example the tata box, which is t a t a t a a That can be represented using a string motif because you are always looking for more or less the t a and then t a a So you can use wild card symbols to use choice at a certain position. And so for a tata box Hey, if I would have a Hey, if I would have a string representation of a tata box Then that would look more or less somewhat like this So hey, it would be a first a t then an a then a t then an a and the last a Is not always an a so hey, it could be an a or a c for example So what you then do is you you make this a wild card and the wild card is an a to c and an a to c when you Look at you pack coding. Do I have you pack coding? Yes. So an a or a c is actually called an m Right, so this would then be the search string that you use to search for a tata box Yeah, because it always would have tata. This is guaranteed But here at this position an a matches and a c matches as well Right, so it's just a string and you use the you pack coding And so the you pack coding has for all the different the different Positions it has a It has a base pair. So for example, you can have like a puridine. That's why it's called r So a or g is it have because both a and g are purinas Um, they use the letter r for it. If you have a c or a t those are both pyrimidinas So you will use the y for that and then you have strong So these have the c or the g both of them have three Amino acid binding. So that's why it's called they both Both the c and the g have three hydrogen bridges. Um, so that's why they're called strong Then you have weak so w a and t because of the fact that they have two hydrogen bridges holding them together And then you have g g or t which are the kato And you have a and c which are the amino and then of course you have Not a not c not g and not t And so you can use this coding to build up a string a search string to search through the dna To figure out where a certain protein might be able to bind Um, yes, so you can use wild cards or you have a choice from the symbols at a particular position And of course this works very well for dna for for proteins. This is not not available at the moment as far as I am aware So the the the matrix representation of a positional weight matrix is more or less what many people nowadays are used to And the positional weight matrix or p w m's or position specific scoring matrix called p s s m Had they assume independence between the positions in the pattern and they represented by a matrix or viewed as a sequence logo And so here you have a position frequency matrix, right? So you have for example, observed, um, 31 Or yeah, so you have observed, uh, 31 bindings of a protein to a piece of dna And out of these 31 bindings that you observed, um 28 times there was a c at the first position Two times there was a t and one times there was an a at the first position Right and then so here you can see that the c is the most is is the conserved residue at the first position Although it not all it it's not always a c Right and the same thing holds for the second position the third position four fifth sixth and seven And so if you do an experiment where you for example do a pull down experiment So you you make an antibody which targets a dna binding protein And then what you do is you have the protein is bound to the dna And then what you do is you you digest the dna which is not bound by the protein And then you pull down these pieces of dna which you sequence so head and in the end Hey, you would have 31 different sequences because this protein bound at 31 positions in the genome And then from this you can then build this This sequence logo and in the sequence logo you can see that yeah almost always the first position is a c The fourth position is always or almost always a t and then the second and third positions predominantly are a t And the the fifth position is predominantly a g So this is this is how these things are built and these are very useful when you want to score For example, does this transcription factor binding site Bind near my gene or does it have the ability to bind because a computer can use these position frequency matrices To kind of scan the dna and say well here There's a high match to the frequency matrix that you gave me So there's a high likelihood of the protein being able to bind here And of course different proteins have different binding motifs. And so the data box the one which is bound by The polymerase or the start of the polymerase head that's tata And that will that will have its own binding motif while other Proteins that bind d enable have different motifs So how do you do that? Yes, so when you create a position frequency matrix you count for each position the number of a c's t's and g's that you found For example during a pulldown experiment and then have from this you generally create a position probability matrix And so you divide each column by the pfm by the number of observations And so this would just mean that the first column gets divided by 31 the second column gets divided by five also 31 and this one gets divided by 31 as well And that that is called then a position probability matrix So instead of having the frequencies of how often you observed it You have the probability of observing a certain base pair And then you want to create this position weight matrix So for every entry in the table you calculate the log two of the pfm Divided by the bi and bi is the occurrence of a base pair in random DNA, right? Because if you are looking At a certain bacteria which has an 80 gc content, right? So At every base pair position you have an 80 chance of finding a c or a g Of course, you want to compensate for that when you are scanning using your position weight matrix And so that's just the way that it works And so of course the bi is the occurrence of a base pair in random DNA And so you you calculate the log two of the certain position And so in this case, hey, you would you would say that this is 28 divided by 31 and then you would take the log two of 28 divided by 31 Um So you would you would take the log two of 28 divided by 31 and then you divide that by the gc content So if you have a high gc content, you would penalize against finding a c there and Increase the score of things like a's and t's and this is done because every animal has a different gc content And you have to compensate for the gc content when building these matrices. Otherwise they they go more or less haywire All right, so if you look at this then of course we we assume that if there is a c at the first position There has to be a g at the fifth position This is not captured by these position frequencies or these position weight matrices at all Because these these weight matrices assume that every observation of each is independent, right? Have it here had there might be some kind of a dependence because we see that if you see a t here Then there's also of t very likely here But if there's a c at this position head then we see also five observations of a c here So it might be that these two locations are actually kind of linked together And so that the the protein when it binds a t It also needs to bind a t at the next position when it binds a c at the second position It needs to bind a c at the third position as well So this is not you can't catch this using this PWM So nowadays, hey, you have these nucleotide dependency relationships Where people assume that the bases in the motif are not dependent to each other And for example, if you have an a on position one, then you need a g on position five And then you use this position or this scored position specific patterns And this is again very complex not a solved issue And it's again something which in the coming like 10 years Some smart guy will figure out how to best represent this using computers and had this will be an enormous Boost in the efficiency in in how to find Well positions which in which the DNA can be bound by different proteins But that there's a lot of smart people working on these scored position specific patterns But it's not a not a solved issue at all All right, so if you are interested in DNA motifs Have for example, you have a gene and you want to know is this gene regulated by growth hormone or is this gene regulated for Another transcription factor binding site and then there are two major databases with motifs In our group, we always used to use trans FAC. It is the largest repository of transcription factor binding sites But it is commercial and it only has a small public database So it it's only commercial since like Seven eight years. So before that everyone would use trans FAC But since they wanted to make money out of trans FAC, they made it commercial So you have to buy a license and this license is relatively expensive so because of that After trans FAC went commercial and then needed a license to be able to use the database People started this open source database called jaspar and jaspar is a really really good database It contains transcription factor binding sites So you can download position weight matrices there for things like growth hormone or All kinds of other things that bind DNA And it provides an api for r so from r you can directly query the jaspar database and you can directly load in The position weight matrices And and these kinds of things So finding motifs have finding motifs is done by phylogenetic footprinting because binding sites tend to be conserved in evolution So you can use DNA sequences from multiple species align them together and use multiple sequence align them And if you have so if you have a homologous gene, for example, myostatin And what you do is you take the 5000 base pair upstream of myostatin in 10 or 20 different animals And then you you look to see if there's any conserved pattern Which is five to 15 bases long because five to 15 bases Hey, you know that a single A single kind of wounding of a DNA helix is 4.6 base pairs and most of the transcription factor Most of the transcription factors they bind either in a single helical turn or they bind up to three Up to three helical turns in a go And so when you are looking for transcription factor binding sites What you do is you take the upstream region of for example, myostatin Then you align these in as many species as you can find that have myostatin And then hey, you just do multiple sequence alignment and you look to see if there's any region 5 to 15 base pairs long which is conserved between species and this is because something like myostatin is regulated by Hormones which Which are targeting like muscle development and of course these hormones are shared between many different species And just like proteins and protein functions are conserved. Hey, you can use this homology trick as well to find Motifs which are which are conserved There are many many different tools available So if you want to find motifs and sequences, then there's all of these like meme and align ace and motif voter and Gip sampler and all of these tools you can use to find motifs in upstream regions of genes or in downstream regions of genes And if you have a motive then you can do transcription factor binding site searches And of course there there's a lot of databases and a lot of tools out there which can do that for you as well So just an overview with with links If you ever need to Find a transcription factor binding site and just come here Come to this slide click a link and then go to the database All right, so some other fields of application where we use sequence analysis Of course sequence analysis is used a lot in genome assembly Also, when we do RNA sequencing, we need to align sequences to the genome and when we do for example population snip discovery So single or nucleotide polymorphisms in a population then we also use sequence alignment So sequence alignment is really fundamental in all of the things that we do in bioinformatics Here I only want to talk about genome assembly Have when you do RNA sequencing and then you use sequence RNA molecules And then you use a tool like blast to blast the RNA sequence back to the genome And this allows you to locate where the introns are located So the protein coding parts of the gene and where the Where the axons are located the protein coding parts and where the introns are So the introns are the parts which are spliced out which are not coding for protein And you get information about things like alternative splicing So you get an idea of how many different proteins are produced from a single gene Have when you do population level snip discovery Here you do sequence alignment as well and since you're looking at a population which is for example humans And then you can see that oh all humans have more or less the same They have the similar sequence, but at this position has sometimes there's an a sometimes there's a t so But I wanted to talk a little bit more about genome assembly So the novo genome assembly is a is more or less a non-solve problem as well But it is a very interesting problem since there's very Very good approaches on how to do this Hey, it is the process of determining the DNA sequence composition of an organism and there's two things that that There's two ways of doing this Well, actually nowadays there's a third one, but I don't Intend to talk about it, but you can use whole genome sequencing whole genome shotgun sequencing Which is just short read sequencing. So have we we chop up the genome in very small pieces And we sequence all of these little fragments that we get Um, this is the novel way of doing it when people were doing the human gene genome project Um, hey on the universities. They actually use something which are Bacterial artificial chromosomes. And so what you would do is you would take like a large piece Like 15 000 base pairs of the DNA, right? So you would use restriction enzymes to cut part of the DNA out Then you would use primers to amplify that and then you would clone this piece Into a vector in a bacterial vector Then the bacterial vector will replicate and then we'll make a lot of copies of this And then you would then use something like sonar sequencing to sequence this artificial bacterial chromosome We don't do that anymore because it's relatively it's an old technique and it's very expensive compared to whole genome Sequencing, but if we talk about genome assembly, we're not talking about genome sequencing genome assembly is the computational steps That follows genome sequencing with the objective of reconstructing the genome from its read So how do we do that? So if we have all of these short reads like we have the different colored reads here like this This one in red and green and in blue have what we look is we look to see if there's any overlap between the short reads When there are overlapping reads we merge them into something called a contic Which is the black read here, right? So you can see here that the c the a the c the a the t and the c they match here Right, so we know that at this part this read plus this read They belong together because of the fact that they are identical at this position And so we know that after this c we can use this read to just continue The same thing holds here like this part of the green read is exactly identical to the red read So we merge these into something as a contic Which is called a contic so a contic is a contiguous sequence built from small sequences And so multiple contics are usually connected together using things like scaffolds So a scaffold links together a non contiguous series of genomic sequences into a contiguous sequence Separated by gaps of a certain known length So the sequence that are linked are typically contiguous sequence of context built up by overlapping reads And so what you do is in the first step You start and you look to all your reads to see which ones are overlapping and you build as Sequences which are as big as possible that you can do and then the next step is scaffolding So the next step is kind of figuring out how these different contexts that you have which have no overlap to each other are more or less Link to each other in the genome. So how do you do that? So scaffolding can be created when you do paired and sequencing because paired and sequencing uses Like pieces of dna which are around seven to eight hundred base pairs long And then you read the first 150 base pairs from the five prime in the forward direction and then you read From the same fragment you read the five prime in the in the reverse direction, right? So you get um, how this is done is more or less when you when you have your sequencer Then in the sequencer, uh, you make a library and the library looks it's kind of a So you have a spot on your sequencer and what happens is that these fragments of dna are on there like little loops So what happens is is that the loop gets gets cut So the loop gets cut open and then it is read from bottom to top and it's read from bottom to top here So you get two sequences. Um, well, that's really small actually So what happens is is when you when you do sequencing you have like these spots Let me see. Yeah, that's that's kind of visible and then you you attach dna like this Then this gets cut Open and then you start sequencing from here and you start sequencing from here Right, so we now know that these 150 base pairs that we get from this We get 150 base pairs from here and we know now that in between there needs to be around 500 base pairs Um, which we did not sequence because we cannot sequence them still really small But but what you then do is because of this pair then sequencing you now know that there are two reads, right? So what you have is you have for example, contact one build out of overlapping reads, right? So here we see two reads which have overlapped to each other. So we build a single contact out of them Here we see that we have another Overlapping read system where we can figure out using the overlapping reads that these are contacts And then we have like a paired end fragment Which we are lucky because the first part of the fragment falls in contact one The second part of the fragment falls in contact two So we now know that these two are more or less like an x number of base pairs apart But we we do not know the sequence But we do know where they are on the genome So this is how you go from having multiple or sometimes like thousands of little contacts into a genome Which is build up of more or less Scaffold so scaffold for chromosome one scaffold for chromosome two All right, so there was everything for today that I wanted to tell you guys So I told you today about genome annotation I told you about the homology trick and why it works and that this is something that we already know since Charles Darwin I told you about sequence alignment So I told you a pairwise alignment about substitution probabilities So have that transversions and transitions in dna make some Changes in base pairs more likely than others that the same thing happens when you look at proteins So that's why we have something like a blossa matrix Which assesses the likelihood of one amino acid being changed by another one I told you about multiple sequence alignment so that you can use cluster w or use muscle Kind of how these algorithms work so that they use like pairwise alignments Followed by a guide tree method where the the different sequences are more or less put in the tree And then kind of cluster together and I told you that when you do multiple sequence alignment You always have to be very very careful When you look at the results right because if you if you take two proteins Which have nothing to do with each other or not two But if you take like six proteins which have nothing to do with each other And you start multiple sequence alignment, then it can be that you get an alignment Which is not meaning anything I told you a little bit about structural alignment And that's that's the new step more or less in in sequence alignment where hey You use information about the 3d structure or the secondary or tertiary structure of RNA Or dna of proteins to kind of guide the sequence alignment has to have more information I told you about dna motifs and that You can use the justbar database for free And get your motifs for a lot of known DNA binding proteins there And I told you a little bit about genome assembly using whole genome sequencing Hey, just so that you guys know when people talk about a contig have what is a contig So that's a contiguous Read based of multiple little reads which all have overlaps And then what is a scaffold a scaffold is when you combine multiple contigs together Using this paired and sequencing technology where you know Kind of two times sequence from the edges and you know that there's like 500 or 600 base pairs between these two reads So that you can use that to couple these contigs together into Scaffolds which can then be used to reconstruct the whole genome All right. So for me, that's it for today Um, or if there's any questions then Please let me know Either by email or you can throw them in chat directly Um, and for the people watching on Moodle, this is going to be the end of the lecture. So I will stop