 So welcome all to this seminar series. We have got today the pleasure to have Maria Angel Mova from the Applied Computational Genomics Team at the Zurich University of Applied Science and at Sieben. So Maria studied Mathematics and Information Technology at the Pedagogical State University in Nagorno-Russia. She then taught in this field at the Better Bees College London in the UK before she received the Master of Research in Modelling Biological Complexity at the University College of London in 2000. Then in 2003, Maria received a PhD in Statistical Genomics from the UCL as well in the UK. And from the following four years, she held postdoctoral positions at the University of Montpellier and also back at the UCL. From 2007 to 2014, she worked as a Senior Scientist and Research Fellow lecturer at the University of Zurich, sorry, at the ETH of Zurich. And since 2014, she leads the Applied Computational Genomics Team at the Zurich University of Applied Science. And she also became a Group Leader at the Swiss Institute of Bioinformatics in 2015. Her research interests are in Applied Bioinformatics and Computational Evolutionary Genomics. And a group focused on theoretical and computational aspect of modeling the process of genome evolution and adaptive change. The goal of the group is to bring and combine new bioinformatics method to read applications, ranging from biotechnology to biomedical research, ecology and agriculture, and that in order to enable a wide range of scientists to analyze patterns of evolution and natural selection in large genomics and omics data. So today, Maria will tell us how to disentangle tandem repeats and their evolutionary history using statistical predictions. Maria, thank you again for accepting this invitation and the floor is yours. Thank you. Thank you for the introduction and for the invitation. Thank you for your interest in this topic. So today, I will talk about tandem repeats in general sequences, particularly about our recently developed methodology, series of pathological developments that took place in our group. So first, let's talk about what the tandem repeats are and why they're interested. I guess many of you are already aware how complex they can be and how can they complicate the things. But essentially these are the segments of genomic sequence that occur next to each other in tandem by a certain mechanism that I'm not so well described. These repeats can propagate, expand and shrink. And of course, with the divergence, with time they diverge, accumulating mutations, indels and possibly affected also by recommendation, events, et cetera. What's interesting about tandem repeats that they're actually very abundant. According to our recent estimates, over 60% of human proteins actually contain tandem repeats, which is a much greater amount than previously has been described. The last census of proteins of tandem repeats has been accomplished in 1999 by Mark Hote et al. And at that time, the estimation was done with rather simplistic methods. One type of tandem repeat prediction method. At that time, Swiss broad was much smaller. So already at that time, 30% or over 30% of proteins were predicted to contain tandem repeats. So now we put in these estimates at much higher rate. Of course, we all know that tandem repeats cause problems for all sorts of things, starting with assembly and going to any, propagating to any downstream analysis, including alignment and looking for process selection inferring trees and so on. They also very interesting, actually these are genomic sequence stages that are very interesting. In proteins, for example, they often offer enhanced binds and properties. They form some rigid scaffolds to help protein-protein interactions. They also have some interesting enigmatic connection with intrinsic dissolved proteins so that the proteins that don't fold don't have a stable folded structure. This phenomenon is actually not very well described, but many publications have pointed this out. So there are many interesting mysterious things about tandem repeats. And of course, they have been noticed in relation to the associations with diseases and to the genetic. So therefore, some four or five years ago, we applied for a grant to study how these features, genomic features, evolve and what are the functional roles of these features. So let's just start from some examples. One of the very common well-known examples is a collagen. You can see that this is essentially a triple helix. And collagen repeats can be very similar. They can be identical, like in this sequence that you can see at the top. But they can be also very diverged. They can be different types of collagen sequences. So this one is much more diverged. But nevertheless, you can see clearly some very strongly fixed positions. And this is an important protein, the main component of connective tissues. It's most abundant protein in mammals. It's mostly found in animals, but also, for example, found in viruses, interestingly. So there are also many collagen-associated diseases, which again points towards importance of studying such proteins. Collagen, of course, is very easy to see. In collagen, we can see that it's a structure very easily. So if we look at other types of repeats, they can be very much divergent. We won't be able to spot them not only by eye, but not even with good predictive algorithms. So for example, this gene-detective contains two types of repeats. You can see in the structure the two different types of repeats are colored in different colors. Of course, when you look at the structure, the repetitive nature of this protein is quite apparent. However, on the sequence level, there is a very vague... You cannot spot this type of repetition. None of the prediction methods are able to identify this kind of repetitive structure. So it's very difficult. The more divergence we see, the more difficult it becomes to identify such repeats. And of course, the identical repeats are easier to identify, comparatively easy. The more divergence happens, the more difficult it becomes, especially when we don't know how the unit should look like or how long it should be. So the complexity of the detecting tandem-repeat units correctly and annotating the tandem-repeat regions is quite high. So when I usually thought that... overlooked this problem I thought that there must have been quite a lot of work that's been done, indeed. In 2011, when I started working on repeats, about 50 or more different tandem-repeat projectors have been published. And yet, they predicted different things. So this is what we find out in the beginning of our studies, that discrepancies that we see from different projectors are very high. So let's look at this graph. Now what we see here is a prediction... So each graph corresponds to a prediction of tandem-repeats by different projectors. So here, HHREPID, this one is TREX, TRUST, and Xtreme. For example, this FOR. Each graph is essentially a two-dimensional representation of the distribution of the tandem-repeats as characterized by the projected tandem-repeat unit, the minimum unit that's repeated here. They can be very small and they can be very long. And also the second characteristic is how many units do we see in a protein. So that's the y-axis here. And we can see that, of course, this is all has been done on the human protein from Anzampo. We can see HHREPID, so only one amino acid repeated this column. They can be seen in many, many, many repetitions. As the size of the unit increases, these long units are not repeated as many times. But there are, of course, also interesting exceptions. So, one thing without looking at really fine details we can see is that clearly the four graphs are different. So the shape of the distribution even changes. Why is that? It appears that the properties of these four different predictors are different. So it seems that Xtreme, for example, is quite good at predicting this space here. Short repeats, many of them. Whereas trust is a self-sequence alignment method good at predicting long repeats. So HHREPID seems to be a little bit similar and there is some kind of threshold here that all the homo-repeats and short repeats get omitted. And TREPs seems to be covering this area better than Xtreme compared to Xtreme. So there are differences and these differences essentially stem from assumptions and from methodology that each of the predictors assume. We also can see that these four different predictors as the divergence grows the predicted repeat, tendon-repeat count also changes. So, for example, this graph now compares the distributions densities for these predictors. So let's say we can't see the colors very well but the HHREPID is this pink distribution, I believe. And so this is a predictor that seems to be pretty good at covering more or less most of the divergences. The tail is really spreading out the furthest out of all the other predictors so it's able to predict quite divergence repeats compared to all the others. On the contrary, you can see the other distribution here that's really peaked. So this tendon-repeat predictor corresponds to this distribution is really good at predicting the final repeat. But not good at all at predicting repeats of the divergence. So this is the conclusion essentially from this exercise. We get different predictions and maybe they are all right. There might be different levels of compulsive associated with different kind of predictors which we also somehow want to take into account. So how do we reconcile these conflicting predictions? So what we decided to do is actually take a very clean statistical approach and evaluate false positives for each of the predictors and also compare the power of predicting all simulated data. So let's look at this graph here and we will focus first on the top part of the graphs. These graphs correspond to the false positive rate. So this is what we see the predictions that we see on new data. The data that should not contain any tendon repeats. So what we are prepared to tolerate is a certain amount of false positives. So for example 5% and we want to make sure right. So again without going too many details we can see that the times where the predictor is working on DNA level or amino acid and again for different types of predictors we see the shape of the distribution different right. The color changes the dark color corresponds to the higher density yellow color to lower densities. It seems that some predictors like T-ray for example doesn't make any mistake that's not necessarily a good thing because that makes the power of the method very low. It's always the balance of power and accuracy. If you look now at the true positive rate this is the power of the method for the same predictor T-ray the power is also very bad. So that can't be a good thing. On the other hand we look at extreme and it seems for some areas of the space the positive rate is a bit too high but that means that the power can be much higher compared in contrast to T-ray Of course high power doesn't mean anything if the positive positive rate is too high. So we always have to make sure that the positive rate is controlled at a particular level. Another characteristic one could evaluate is the gridiness of the approach and this is something that nobody before us has done typically the evaluation of predictors has been done on a level on a binary level, yes, no has the method been able to detect a repeat or not. What we have done is we looked at the details did the method detect the unit length did the method detect the whole length of the region or not less or more and this is what the gridiness component measure tells us about. So how grid is the method? Does it tend to detect times and repeat regions that are too long or are they shorter? So the black color in these graphs correspond to just about the correct level and you can see that the top graphs here contain much more black compared to the other colors. So the other colors are the green is too greedy the repeat region is too long and purple is too short. Okay. So it seems the best results you tend to get for this row. This row corresponds to about 40 palm which is essentially the measure of divergence. So the most similar repeats actually easier it is to detect them and as you go down to 120 palm and then introduce also graphs or indos then it becomes more difficult. Okay, so this is something we expect and we should and now we have the understanding how this happens. The next thing is, okay, now we know about the situation. What do we do? What can we do? Because this is correct, it can be so high that we essentially end up with a prediction like this. So this is the example of predictions from these different methods. On the sequence BCAR the breast cancer resistant gene is the gene and the colors essentially here correspond to predicted unions. We can say the discrepancies are crazy. So what do we do? Some of them must be false positive some might be simply too short regions that should be essentially extended further. How do we evaluate? So we came up of a model-based test that allows to test whether the potential repeat unions that we can align would have come from a common ancestor and this is our definition. Essentially the repeat unions might have a rosin from the common ancestor. So each column has an ancestor here. This is something that we can model by tree structure how simplistic it is and contrast by the two hypotheses. The new hypothesis suggests that actually these unions could have just be there independently by random trans and they're not connected by any ancestry so the time to the common ancestor in this case is infinite in contrast to the alternative hypothesis where the time is finite. So if we can reject the new we can essentially say that the tandem repeat candidate is a significant one and we believe it's a true repeat. If we cannot reject the new then we do not have any evidence to back it up. So after testing this type of method so it's all in this publication we can show that essentially by using this test we can get rid of the false positives and incorporate this method into a region of different pipelines that seek to on a faith tandem repeat. So it can be done in many different ways and we have developed a library that includes all the scripts that allow you to create libraries called trial at the python library it's been published just recently in bioinformatics and one of the most obvious algorithms could be essentially taking genome sequence then deciding how to on a faith repeat if you have some kind of idea of what the tandem repeats you might be interested in what the tandem repeats and you have a model in mind which can be represented as a hidden mark of model this model then can be used to search for repeats using one of the tools that exists alternatively you can say well I don't know what that unit might be I'm interested in any kind of repeats so you search the nova using an ensemble of tools that are available by using different tools what you ensure is that you use the advantages of different methods so you're not missing out on any of the space you do as well as possible however of course this approach means that you might also generate many false positives so there you have to be careful and make sure that there is a filtering step so we have the filtering step here including the test for significance of the tandem repeats that I've just explained to you in the previous slide so essentially the output of such an algorithm would be tandem repeat annotations and if you are interested in annotating tandem repeats in a homologous sense of genes for example and being consistent about the models of tandem repeat that you use you could also include the refinement step where after annotating tandem repeats you build again multiple sequence alignment of tandem repeat units and build a new refined hidden markup model then use this hidden markup model here again to go through the steps that we've just mentioned of course this hidden markup model simply can come from the public database such as PFAM is very famous so it comes for each entry, for each protein family it has a hidden markup model described PFAM also is good because it includes a lot of proteins that actually occur in tandem and for our subsequent analysis we actually use PFAM quite a lot I will show you in a minute so Python library is available on this website here trial now, so if you apply this meta meta pipeline for tandem repeat annotation to human protein protein you will see a picture like this so this is in contrast to what we've seen just before the different pictures we've responded to different predictors so now we're kind of unifying these predictions and getting rid of false positives as much as possible so our best knowledge and clearly this space is much more full now we can see that both this area and the tail are filled in much better and we see some clear very common tandem repeats so the zinc finger is very common this spike if you wanted what it was on the previous slide and of course in PFAM occurs in two different entries one of the PFAM entries is a double unit zinc finger which is that other spike here that's interesting we also see the abundant losing regions here WD 40 repeats are very common they occur essentially in combination with many other domains and serve as a rigid scaffold and for many many different types of functions very important functions the other thing you might want to say here is like why do we actually have this spike what does this spike represent or if you look at this why access is the number of tandem repeat units which means that the zinc finger units occur in many different numbers of different proteins they might be combined with different protein domains and occur in different numbers so how does this diversity actually play any kind of functional role so how is it conserved in nature this was our next question we were really interested in the evolution you can also look at the picture more generally not on the Cuban protein but also across the kingdom so for example this is Swiss broad distribution for bacteria the color is a bit different here compared to the previous graph but again the red represents the higher densities and lower densities and we see few repeats of fossil bacteria but nevertheless quite a lot of them many very famous domains here all the outlines are very interesting cases so these kind of large scale studies they bring out the lists of really interesting peculiar examples that each deserve further consideration overall if you look across the three domains as it is seen in Swiss broad currently this is the distribution which we observe so red color essentially means that we don't find any kind of repeating protein in a Coriota essentially we have about just under half of proteins contained under repeats and some of the proteins contain two or three or more different types of repeat unions so we can see the yellow color corresponds to two different types of tandem repeat green to three and blue to four or more whereas in bacteria and archaea we actually see fewer repeats in general fewer smaller fraction of proteins with tandem repeats and smaller proportion the amount of two or three different tandem repeat unions in a Coriota now of course what do all these tandem repeats do how did they get fixed or are they fixed or are they variable so that was our next question because essentially in non-Coriota DNA these repeats are very variable and prone to slippage events essentially we observe a lot of variation do we observe the same picture in proteins or what happens so we decide to look at the protein tandem repeats and the variation in those units and conservation of units and conservation of order of units by essentially defining different evolutionary modes evolutionary modes can be defined what's interesting by looking at by species tandem repeat union biology what does it mean for this kind of little concept that we invented if you look at two proteins from species A and B that are orthologists they each contain the tandem repeat region and you might observe a picture like this so it seems some kind of conservation of the tandem repeat unions the order remain the same in both species so essentially what it means that if we color the phylogeny that could explain the relationship between these tandem repeat unions with two colors pink and blue then you would see bi-colored cherries as a result so this is called a cherry here essentially since the speciation of these two species we've seen strict conservation and we describe this mode of evolution as conserved if you compute the probability of observing something like this you can do it exactly the probability of observing this is very very low so for example we only have four unions like in this picture then the probability is 2.9 by 10 to the minus of 4 that's low probability very low probability of seeing it by chance if you increase the number of unions the probability drops down already to 7.4 10 to the minus 6 and even lower and lower the more units you add you also can distinguish less strict mode of evolution so when you have some deviations from these conserved scenarios but overall it looks conserved you can say okay we also accept this type of conservation or you can define different thresholds but let's consider ideal cases and see how many of such ideal cases we see in contrast you can also define the separated mode of evolution which looks like this where essentially since the speciation event the repeats tandem repeats in each species have been evolving independently and because of that on a biospecies TI tandem repeat unit of biogeny is the clustering of the colors in two different monophilic plates so again we can compute the probability of observing this by chance we derive the formula and again this values are slightly higher than for the previous configuration but nevertheless they're still very low so something like this this kind of scenario could essentially correspond to a scenario that has happened due to some adaptation and then the investigation occurred in each plate now once we define this kind of modes of unit evolution we can apply this methodology on let's say ensemble genomes or protons we've taken all the 61 eukaryotic species available at the time and for each arctologous set of proteins evaluated how many conserved and separated scenarios we observed for tandem repeat units so what we obtain is this it's kind of quite a loaded graph so we ignore some lines for this moment let's just concentrate first on this blue line this blue line this big blue line corresponds to the conserved mode of evolution perfectly conserved that mode of evolution that I showed you on the slide here it's exactly that picture that we see that they conserved no mistakes the other dotted line have a more relaxed restriction left so some mistakes are allowed red color however corresponds to the perfectly separated proteins now what is the x-axis x-axis is essentially the different species that we had now said started with human going adding to several prominence monkeys here the primates, 10 primates we had and so on so going away in time down to all the eukaryotes here we got east but essentially what we are going to do here remember we are looking at bi-species phylogenies that means we are comparing always two sequences two autologous sequences and the reference sequence in this comparison is a human one so we look at all the pairs of sequences of proteins that are homologous to a human protein and we are looking at patterns of conservation of tandem research so what we see essentially for example if you look at the mammals here we can see about 61% corresponds to about 61% of tandem repeat unions are conserved among all the autologous proteins that we find between each mammal and human and that is quite a long divergence time since 300 million years ago we are going to the split of mammals so there is huge numbers of conserved tandem repeat arrangements not only the number of units but also the order of the unit is conserved which seems to point to some kind of functional significance it seems to be some kind of optimized by evolution arrangement that is kept for the protein to be kept functional even if you go as far away as down here to the human and east we see overall we found 52 autologous proteins which is 13% of all the autologous proteins that we found between human and east are conserved since the split of the human and human which is 1 billion years so some serious functional constraints seem to be operated on tandem repeat units and conserving configurations some kind of interesting useful optimal configurations of tandem repeat units on the other hand you can see that actually the amount of separated repeats is very very low between the human and the primates we see no such a case so this is zero here no separated scenario however if you look at other cases and we looked at plans and so on you do find such cases and if you go further in time also you find such cases of separation which actually point to certain adaptation events and although they are rare they are very interesting to be considered further so how is this configuration serving the changes in the function of proteins to adapt for presumably either new environment or new interaction partner therefore we had a look at the functions we were just interested in what is the diversity of functions of proteins that contain tandem repeats of different types and again lots of different colors here let's look at first of all what it means we have two parts to this figure first part the largest one represents of course the diversity of tandem repeat units that are conserved strongly conserved so this is the scenario you see this by species biogeny with cherries with two colors and the lower part of the graph corresponds to the diversity of the strongly separated tandem repeats so you see the cherries are actually of one color not two for the strongly separated tandem repeats it seems that there is some predomination of this green color and not as many functional categories appearing on this list of course there are few proteins as well but this green color corresponds to the zinc finger and you can also see some other examples of tandem repeat units that are typically seen in either proteins related to immunity or resistance so that's kind of a very interesting observation that's consistent without interpretation that separated patterns of evolution seem to be pointing to some kind of adaptive events and therefore could be interesting to explore for conserved model of evolution of course we see a huge variety of functions for both for all the three different types of core classification, molecule process and so and of course many more colors so many more different types of tandem repeats you can see the Lucin rich repeat here the blue one light blue WD-40 very famous one this light green zinc finger also found in conserved model so interestingly we see the WD-40 for example is always very strictly conserved and so is the Lucin rich repeat so that's for you guys we don't see any separated repeats of the blue colors or light or dark in this picture but if we look at plants the situation is slightly different for Lucin rich repeat for example so overall we've done the same kind of analysis for plants except we decided that it's not fair to have one reference genome so for example for eukaryotes we've had human sequence as a reference genome here what do we think Arabidopsis maybe but then we said okay let's be fair to all the plants and so what we've done is we've done a pairwise comparison all to all and by doing this you essentially have maybe few proteins overall you analyze but you can trace the evolutionary patterns down to each node on the whole phylogeny not only in the pairwise fashion which kind of is more informative so what we've plotted now on this graph these are also ensemble plants available at that time and red color here corresponds to the amount of the strongly separated repeats of all of them of all of the ones that found octologous in the clay for example for this node right so here we're looking at all the pairwise comparison in this clay and the blue color corresponds to the concept of multiple evolution yellow color now we've got two colors grey and yellow grey no significance whereas yellow means that sometimes we see it separated sometimes it's conserved mixed and before and what's interesting that if we go further again here 150 million years ago we see a lot of separated repeats so majority of them are separated but still we see quite an interesting fraction here also conserved and still quite a significant fraction of separated repeats even if we go down to more similar species so here for example we see rice varieties and wheat you see some proteins have separated pattern already on that level which is very close similarity and these proteins are interesting to look at because they could be responsible for certain adaptations to environment or to a new emerging pathogen or something like that so now of course the Lucindrish repeats are very interesting then in this distribution because Lucindrish repeats and plants actually found abundantly in resistant genes in our genes and each plant species has maybe over 200 250 different resistant genes majority of them contain Lucidrish repeats so it's sort of an adaptive immune system that plants are using indeed if you look at the GO functions you will find a lot of Lucindrish repeats some of them are conserved this blue color again and some of them or a lot of them actually separated the PPS are also related to resistant functions can be related to the resistant functions but here we see them mostly conserved now how about we look at the same graph but just looking at Lucindrish Lucindrish repeats if you report it that's what happens so this is only Lucindrish repeats once they strike you see much more red color so there are more adaptive events that you can spot on as you go alone down into the to the tips including the close of varieties here all these genes that are involved are very interesting candidates to be investigated we are looking for example for adaptation to environmental pathogens particularly related to agricultural importance of plant species so essentially the conclusion from this exercise is that we found that despite the fact that tandem repeat units are quite short the pairwise unit pairwise species we call it bi-species phylogenies of tier units tandem repeat units are very important about evolutionary history and you can show that essentially they are pretty accurate representations even if you go down as low as 15 amino acids of course if your repeat unit is shorter than 15 amino acids then you have to come up with a different methodology to think about something else one of the possibilities for example for collagen which is interesting could be to look at the multiples of several repeat units up to for example let's say instead of looking at one collagen at a time you could look at seven of them bringing the units you are working with to 21 so that could be one of the possibilities to use to apply this kind of methodology to study other types of repeat but then of course those repeat units have to carry in relatively large numbers so we found that the observation of conservation and separation as we define actually it could be interesting to depend and it points towards certain function of conservation or changes in function and again we can actually use this information of tandem repeat unit like against analysis to pinpoint the functional changes because we can go down to the null to the topology and see what to this event could correspond to and with respect to this study of unit gains and losses we also have developed another tool which essentially is a multiple sequence alignment tool called program MSA plus TR plus TI means essentially that it helps you with tandem repeats so what it does to tackle tandem repeats it uses a graph structure so it's a graph based alignment so it allows to penalize repeat unit gains and losses adequately rather than the countment as a big gap also this program essentially uses the same algorithm as phylogeny aware gap placement in prime so it's phylogeny aware penalizes gaps correctly the implementation is really fast and the person who was working on this software was really good at coding so it's super fast it also can align sequences based on codons so you can run your alignment using code model directly it also uses context-specific profiles and because of this it does much better with tip of divergence by context-specific profile what we mean is it relies on generated libraries of context for 3-5 different residences occurring together and that allows essentially to cater for this tip of divergence so somebody who was working on intrinsic proton disorder in Sweden wrote to us saying that they finally found software that actually developed intrinsic disorder regions and repeat because they were actually interested in this connection and we never thought about intrinsic disorder when we were developing this program but apparently two things somehow go together so again this kind of graph structure allows defining the rigid union boundaries because essentially because of slippage events your repeat might not start position one so the graph structure naturally allows you to account for this there is no problem there so alternative splices also is no problem with this kind of program and essentially in our paper here when we were looking at the properties of other aligners but the reviews also wanted an example we first refused to set cherry picking because essentially we can always find account example also if you look at statistics right but no they still wanted a visual case and so we kind of show this visual case where our alignment does well so this is the reference alignment this is true alignment with really complex graph structure and if you use mapped for example very cochlear or massive you get an alignment like that it's forged up so it's not reconciled up over 11 months whereas a program essentially builds exactly the same alignment I think there is one little discrepancy here I don't remember where it is there is something here but overall it provides perfect result in addition it's got a nice byte product because we are looking at repeat gains and losses in recording them essentially what we can do is we can map the gains and losses on the tree that relates to different tandem repeats units and see where many losses or gains happened on the tree and this is an example of my favorite tandem repeat unit which is a recent repeat and it's this one is coming from bacteria which is essentially a T3 effector which attacks the tandem immune system and essentially these tandem repeat units seem to serve as some kind of binding part of the effector because either in direct contact with the proteins that are being attacked in the plant post or by some intermediate proteins so yes the protein is called gala protein so overall useful algorithms we hope can be also used for other studies where you find tandem repeats and I'd like to thank my collaborators so Elke Schafer was my Ph.D. student she's done a lot and a lot of work on repeats and still can have enough of it so we still are collaborating quite a lot Olivia Gasquiel also was working with us on defining the conservation and separation modes Adam was the one the computer science geek who wanted to make progress really fast really fast and Drake Kaja was my collaborator in Montelier who is a structural biologist and with whom we started who's fault it is that I'm actually working on tandem repeats he made me interested in tandem repeats and finally Julia was a master student and I was supervised both contributed to the trial library together with Elke so now of course the floor is open for questions