 So, I would like to welcome all to this first SIP virtual computational biology seminar series of the 17-18 seasons. Today, we have the pleasure to host online Karsten Bogwart, who is full professor of data mining at the National Learning and Computational Biology Lab in the Department of Biocystical Science and Engineering of the ETHZ, Zurich. So, I will give you a bit of information about his background and education before I give the floor. So, Karsten studied computer science at the Ludwig Maximilians University in Munich, in biology as well as the University of Oxford. In 2007, he completed his PhD on graph kernels at the same university. And from September 2007 to August 2008, he was a postdoctoral research associate in a machine learning group at the University of Cambridge in the UK. Then in September 2008, Karsten moved to a Tübingen to head a newly established joint research group for machine learning and computational biology of the Max Planck Institute for Biological Cybernetics and the Max Planck Institute for Developmental Biology as a W research group leader. In 2010, his position and his group were tenured and in 2011, his group joined the newly established Max Planck Institute for Intelligence Systems. And then in September 2011, he joined the University of Tübingen as professor of data mining in the Life Science of the Computer Science Department. He then moved to the ETH Zurich as an associate professor of data mining and he joined its Biosystems Department in June 2014. And since March 2017, he was promoted to the rank of full professor at the ETH Zurich and he's also one of our many SIDs, Succesive Bioinformatics group leaders. Just a few words about his lab. Big data analysis and biomedical research meet in Karsten's lab. They try to develop novel data mining algorithm to detect patterns and statistical dependency in large data sets from biology and medicine. So I will now give the floor to Karsten. He will now present a solution to the problems of the combinatorial association mapping through significant pattern mapping. And I will explain you more in details what this technique does. And so I want to thank you again Karsten for accepting our invitation. I will give you now the floor to the participants online. I will ask you if you have any questions, please write them down into the chat. And I will then take care of your question at the end of the talk and Karsten will be happy to answer all of your questions. So thank you Karsten and the floor is yours. Thank you for the very kind introduction and also for the very kind invitation to present my research here. As you've heard I'm a data miner. I'm a computer scientist from my background and as you have also heard we are working on developing new tools for biomedical research in the field of data mining. And one very important line of work in my lab is about combinatorial association mapping and I want to tell you more about this topic in the following. So my lab is among its research interests focusing on developing new techniques for genome-wide association studies and I would like to start by summarizing what a genome-wide association study is. In a genome-wide association study we are given a pool of individuals as you can see here in this example which only includes five individuals. And for these we have phenotypic information in a medical or clinical setting. This is often the disease status of an individual whether the person is a case or a control. Here is shown in red and blue. And we have genetic information about these individuals. The most common form of genetic properties that we know about these patients are so-called SNPs, single nucleotide polymorphisms, so single bases that differ between the genomes of different individuals. So we represent patients or individuals to put it more generally as a vector of these single bases that differ between individuals. And now a genome-wide association study is a genome-wide search for correlations between variations at one of the positions in the genome, at one of the SNPs, and the phenotype that we observe. So if we are observing half a million SNPs and one phenotype then we compute eventually half a million correlation scores or more broadly association scores to quantify which position in the genome is most associated with the phenotype, the disease of interest. Now using this approach more than 2,000 new disease loci have been found in human genetics since 2001 in the publication of the first draft of the human genome sequence. This is the successful part of the story. More of a failure, however, is the attempt to actually predict phenotypes using these associated loci. The correlations while being significant are often still very weak and the phenotypic variance that we can explain through these significant associations is disappointingly low. So when we look at typical examples of complex diseases, for instance cancer or diabetes and others autism, then the variance, the phenotypic variance that we can explain with these associated SNPs is typically less than 20% of the overall phenotypic variance and it's much lower than our heritability estimate of how heritable these phenotypes should be over generations. So there's a gap and the genetics community refers to this gap between the heritability that we expect and the heritability that we can explain by the term missing heritability. There is a phenomenon of missing heritability. We cannot explain with the SNPs that he has found through genome-bite association studies a large fraction of the phenotypic variance in heritable phenotypes. Now there are many reasons, potential reasons I should say for this that are being discussed in the genetics community. One of them that I would like to focus on today is that the standard methods, just like the genome-bite association mapping approach that I just presented, many approaches ignore interactive effects between loci. They just look at one SNP at a time and correlate it with the phenotype or they look at additive effects of several SNPs. However, there's Habianny method that looks at interactive multiplicative effects between different loci. Why is this a shortcoming? Well, there are a lot of hints and indicators that it's actually worth and very interesting to study such interactive effects between genetic loci and it is very likely or at least plausible that such interactions may have a non-negligible effect on phenotypic variation. An excellent review in this domain in this field is the one that I've pasted here by epistasis, important for tackling complex human disease genetics by Mackay and Moore. They list here several examples in this review by this is an interesting problem to study, in particular in model organisms. We have many examples of genetic interactions between different loci in the genomes of these model organisms. One well-known example are these large screens for synthetic lethal interactions. We are only knocking out two genes at a time, kills an organism, but knocking out a single one does not have this lethal effect. So this is an excellent review that I would recommend in this area. Now, with all of these genetic interactions playing an important role in model organisms, it's more than plausible to study the same phenomenon also in human genetics. And that's what we would like to do. So why can't we do it directly or easily? We can't do it easily because this form of searching genome-wide for interactions is an extremely challenging problem both from the computational and from the statistical side. So if we refer to this approach of finding combinations of SNPs as combinatorial association mapping, then we have to realize that first there's a huge computational challenge because we face a combinatorial explosion of the number of candidate sets of SNPs that could be associated with the phenotype. There's also a huge statistical challenge. There's a combinatorial explosion of the number of association tests that we perform. I'll give you a simple example. It's absolutely realistic to work with a data set where we have one million SNPs per individual. Now, if we only consider pairs of SNPs and do combinatorial association mapping with pairs of SNPs, then in an exhaustive search, we have to consider order 10 to the 12 SNP pairs for association with the phenotype. It's easy to see that this is a huge search space of 10 to the 12 candidate pairs and each of them represents one hypothesis that we are testing. There's also a huge problem of multiple testing correction that we have to perform here in combinatorial association mapping and this example I just gave is only the example for pairs of SNPs so everything gets much worse if you look at higher order interactions of SNPs. Well, if we consider associations of groups of C SNPs, where C is larger than 2, then our multiple testing problem gets even worse. Any of the K SNPs would correspond to a hypothesis that is tested and K would be of the order of D to the C so it would grow exponentially in the size of the SNPs that we are considering. Now, if you ignore multiple testing in this setup, then alpha percent of all steps that you are testing might be considered statistically significantly associated even in a dataset where there's actually no correlation with real dependence between genotype and phenotype at all. Alpha here is our significance threshold which is in most applications set to 5%. So if you ignore multiple testing, then 5% of 10 to the 12 SNPs would be false positive discoveries even if there is no dependence between genotype and phenotype in our dataset. And 5% of 10 to the 12, you can imagine how many these are so a really huge number that you would have to struggle with. It's therefore imperative as stated here on this slide to control for multiple testing, for example to control the family bias error rate which is the probability of having at least one false positive among your significant associations. However, there's a second problem in this correction for multiple testing. If we choose any of the popular approaches such as Bonferroni correction or false discovery rate control, then we have to, in Bonferroni correction, we have to divide the significance threshold by alpha, as stated here, by the number of tests K that we perform. But K, as I said before, is even in simple examples, order 10 to the 12. So the correct significance threshold that you'll get will be so small that no SNP set will pass this extremely stringent significance threshold anymore. We will lose all statistical detection power of true associations. Doing the same with false discovery rate control will not save the day. You'll still have extremely stringent significance thresholds. In fact, in many domains, both in genetics, but also in data mining and machine learning, this was long considered an unsolvable dilemma that you either suffer from extremely many false positives. Or you lose all statistical power when properly correcting for multiple hypothesis testing. However, recently, there was a number of advances in the field of significant pattern mining, mainly by the group of Koji Tzuda at the University of Tokyo in Japan, and by my lab, which are now bringing us closer to a solution of this problem and in some instances actually solves the problem in very interesting settings. Now, in order to explain how this solution works, I first have to define some terms, namely, first of all, features election. This may be known to everyone. Features election is the problem of finding features that let you distinguish different classes of objects. What I described in the genome byte association mapping setting is a features election problem. You try to find the SNPs that are correlated with disease status. Now, the method that I'm going to talk about is a member of the field of pattern mining. So what is pattern mining? Pattern mining is an instance of features election, in which we assume that our features are binary. And so state 01. And we try to find higher order combinations of these binary features. The so-called patterns distinguish one class from another. And now let's have a closer look at the example that is shown here. So we have again a toy example of two classes of individuals here. Plants that differ in the phenotype, yellow and blue. And you've highlighted here in blue, three SNPs, and they are summarized in a pattern here on the right. The pattern is just the product of the numerical values representing these SNPs. So only if your feature is one at each of these three positions in the genome, then the pattern is one as well. Because the pattern represents the product of these different SNPs or positions in the genome. This is one example for a higher order combination of binary features that you might be interested in. So these are these patterns that we will refer to in the following. Now what is the common way of testing an association between a pattern and a given output variable in our case a phenotype? Let's refer to this pattern as S and let's refer to the phenotype as Y. Now what you commonly do is that you create a contingency table. Here in the columns we have counts for individuals that contain the pattern. On the previous slide in our example, the first three plans would be examples of individuals that contain this pattern where the product of the three blue SNPs is one. And the second column represents those counts of those individuals that do not include the pattern. And now the roles tell us how many of these pattern carriers fall into class one and in class two and how many of the non-carriers of the pattern fall in class one and in class two respectively. Now after you've represented the data in such a contingency table, what you often or in most cases do is that you apply Fischer's exact test to test whether S, so the pattern, is over-represented in one of the two classes. Y equals one or Y equals two. So whether these carriers or one of the members of one of the two phenotypic classes are more likely to carry the pattern S than the other group. And the common way to compute p-values for this Fischer's exact test is based on the hypergeometric distribution. You don't have to go into mathematical details here but you have to be aware of one fact which is this hypergeometric distribution for computing p-values for Fischer's exact test assumes fixed total marginals X and one and N. X is the total frequency of our pattern in the data set, so the total number of individuals that carry the pattern. And one is the number of members of class one and is the total size of the data set. This is all assumed to be fixed in the calculation of the p-value. Now the lab by Zuda et al. discovered a very important piece of work from statistics that was published by Taron in 1990, four years ago. And I would like to explain the insight that Taron had 27 years ago here in now in the column. Taron noted that when working with discrete test statistics, for instance the Fischer's exact test, as I just described, there is a minimum p-value that a pattern can achieve. Now I would like to illustrate or explain this in a bit more detail. What do we mean by such a minimum p-value? You have a particular pattern S and it appears X times in your data set. Now which are the most extreme configurations of the data set that you can imagine? Well, the two most extreme configurations are that the carriers of this pattern S all belong to phenotypic class one to Y equals one, or they all belong to phenotypic class two, Y equals two. In these two cases we have the most extreme configuration of the contingency table and you'll get the lowest or most significant p-value that you can imagine for this given this frequency of a pattern. So with every frequency of a pattern there's an associated minimum p-value that this pattern can achieve if it follows one of the two extreme configurations of the contingency table. But what Tyrone now realized is that this minimum p-value for a given pattern S with frequency X is often larger than the Bonferroni corrected significance threshold, which is shown here, alpha divided by k. So in short, there are patterns where the minimum p-value that these patterns can achieve is larger than the Bonferroni corrected significance threshold, which means nothing less than that these patterns can never become statistically significant. Tyrone refers to them as the so-called untestable hypotheses, those hypotheses that can never become significant. And he shows in his paper that in Bonferroni correction one should only correct for testable hypotheses rather than for both testable and untestable hypotheses. So if we are correcting for k hypotheses and only m of k hypotheses are testable, then it suffices to correct for m of k hypotheses in multiple test corrections rather than k hypotheses. And as often the number of test hypothesis is much, much smaller than the total number of tests that we performed, this greatly improves our statistical detection power of true associations. This is a fundamental trick that now 27 years later becomes extremely relevant in data mining and in genetics. So as a first step in the research of my lab, we tried out empirically how well this trick works. And we used it on a future selection problem from graph mining. I don't have to go into the details here. Of major interest is the plot here on the left. It shows the so-called correction factor. That's the number of tests that we have to correct for in this application. And you see here in red the classic standard Bonferroni correction. And on the x-axis here you see the cardinality of the future set that we are looking at. So if we go for very large feature sets, then we have to correct for hundreds of millions at least tests that we perform. However, if we compute Tauron's reduced correction factor for multiple testing correction, then it suffices to correct for 10,000 hypotheses in this setting, even if we have no upper bound on the size of the feature sets that we are considering. This drastic improvement in statistical power then allowed us to find new interesting feature sets, patterns on this actually very well studied data set. And this impressed me very much and therefore four years ago I decided to further explore this topic of applying Tauron's trick in data mining and in statistical genetics. Now how do you actually compute the correction factor in Tauron's approach? So let me define some variables in order to answer that question. Assume K is number of tests that we correct for. M of K is the number of hypotheses that are testable if our significance level, our multiple testing corrected significance threshold is alpha divided by K. Obviously M of K is a function of K, that means the number of tests by hypotheses depends on the significance level and on the number of tests that we correct for. And we must require that K is equal or larger than M of K, which means that we correct at least for all testable hypotheses. Otherwise we are not correcting for all hypotheses that could become significant. So then the optimization problem that we solved is we want to find the minimum number of tests that we have to correct for K, such that this number of tests that we correct for is larger or equal to the number of testable hypotheses. And Tauron now proposes to solve this optimization problem by initializing the number of tests that we correct for to one and then iteratively increasing K by one until K exceeds M of K. That's the brute-force approach of finding the number of tests that we have to correct. However, there's one important question. How to efficiently compute M of K if you are given a particular value of K without running through all possible hypotheses in our huge space which grows exponentially in the size of the feature sets that we are considering? Because the brute-force approach of finding out how many patterns are testable would be that we run through all these patterns and compute their minimum P value. Unfortunately, we cannot afford this in these extremely large spaces that we are here facing in combinatorial association map. And now the insight that the Tudor lab had in 2013 is, and also the CESA lab I should mention is two labs in Tauron. The solution is that the minimum P value of a pattern is determined by the frequency of that pattern in the data set. I said before this is the number of individuals in the data set that carry this pattern. Now finding the frequency of a pattern in a given data set that's in fact a classic problem from data mining for which many algorithms have been proposed and you can now exploit this fact in solving this problem of combinatorial association mapping. How it works is illustrated here in the following plot that I would like to explain in a bit more detail. On the x-axis we have the frequency of a pattern, x it reaches from 0 to n the total size of the data set. On the y-axis we have the logarithm of the minimum P value of that pattern. So if we go down on the y-axis then the minimum P value gets more significant. And what we see here in this plot is that if we go from frequency 0 to frequency n1 which is the size of the smaller of the two phenotypic classes then the minimum P value decreases with the frequency of the pattern. So in this domain here from 0 to I'll highlight it here from 0 to n1 the minimum P value decreases with the frequency of the pattern. That means in this range the more frequent your pattern the smaller its minimum P value. And this is an extremely interesting insight because it means that we can find the castable hypotheses those that have a minimum P value of smaller than a predefined threshold by performing frequent pattern mining by finding patterns whose frequency is above a certain threshold. And this is exactly what Terada et al. then proposed in 2013. They proposed to determine the number of testable hypotheses m of k by doing frequent item set mining, frequent pattern mining on a data set E with a frequency threshold theta which is a function of our significance level here and let me illustrate this on the previous slide. So you see here the dashed bar in the middle if this is one possible significance threshold that we have to pass. So all patterns that are testable should have a minimum P value that is below this dashed bar. We can see on the x-axis that they have to have a minimum frequency of what we call here x-min. That means in this particular example we would run frequent pattern mining with a frequency threshold of x-min in order to find the testable hypotheses and that's exactly what Terada et al. proposed. However, if we start with very small case, so with very small correction factors then the corresponding frequency threshold theta of alpha divided by k will be very small. If you've ever used frequent item set mining or frequent pattern mining then you will know that frequent pattern mining with a very low threshold is a nightmare and your runtime will absolutely degenerate because there are so many different patterns that pass a very low frequency threshold. And this is where the work of my group started to kick in. So first of all, and I only summarize these contributions here now first of all we answer the question how to efficiently find the optimal case the optimal correction factor without running frequent pattern mining with very low frequency thresholds. This was solved in 2015 and I have the full list of references at the end if you are interested in the details of these methods. Second point is the patterns that we are looking at are in subset-super set relationships just imagine sets of features of different cardinality. How can we account for this dependence between the tests that we are performing and all that I presented right now we do not account for any dependence any subset-super set relationship between different patterns. Then in the life sciences it's extremely important in such tests to correct for covariates such as age and gender of the patient. So how can we retain our computational efficiency and statistical power if we account for these categorical covariates? In 2016 we found a solution to this very difficult problem. And the fourth and final question is can we develop new association mapping approaches based on Tyrone's trick? And the answer to this is yes and I'll very quickly describe one new association mapping approach that we developed based on Tyrone's trick. It is what we call genetic heterogeneity discovery and I'll jump straight to the toy example. Genetic heterogeneity in the classic clinical genetics meaning of the word means that a disease may be caused by more than one locus in the or allele in the genome. This is illustrated here in this toy example. We have again our six plants here. We have their snips represented as pits. And think of the zeros as snips that are... So think of the ones to be precise as rare variants and zero would then be the non-rare variant. So what we are looking for in genetic heterogeneity is groups of positions in the genome such that having a rare variant at any of these positions gives rise to the phenotype of interest or is at least correlated to the phenotype of interest. So in this example here we look at contiguous intervals, contiguous sequences of snips, a clock of length four and we check whether there's at least one rare variant within this clock and if that is the case then the state of that clock is one state of that pattern. You could say here on the slide it's called a metasnip. And we do this for all individuals. We check whether they contain a rare variant in this interval and then we represent this interval by one number depending on whether there is at least one rare variant in this interval. Now similar approaches that are called burden tests have been proposed in the genetic literature but they suffer from one shortcoming. They have to precisely define the intervals in the genome that they want to test and typically they then define these to be all the genes in the genome. What we would like to do is that we would like to scan the entire genome for such intervals. These intervals may lie in intergenic regions. They may overlap with parts of the gene. They may overlap with several genes. We want to allow for arbitrary start and end points of these intervals. Of course, if we allow for such flexible start and end points of the intervals then we are performing more tests than the classical burden tests so we have to correct for the inherent multiple testing problem and we want to do this computationally efficient and with high statistical power. To solve this problem we model it as a pattern mining problem given an interval an individual contains a pattern if it has at least one minor allele in this interval. This is what I just described. Having one rare variant or one minor allele in a particular interval makes you a carrier of the pattern. And then we can greatly prune the search space to make the search for these intervals that are associated with a phenotype of interest very efficient and statistically powerful. What is obvious is the statement here on this slide which is that the longer such an interval is that we are considering the more likely it is that it contains at least one rare variant so that it contains at least one one. Now this sounds like a basic fact but it helps us to prune the search space tremendously namely by two pruning criteria that I summarize in the following. If too many individuals in our dataset have a particular pattern the corresponding interval is not testable. You see this here again on this minimum p-value plot versus the frequency of a pattern. If the pattern gets extremely frequent shown here by the dotted blue line then the minimum p-value goes up again. So if a pattern is extremely frequent then it is not testable anymore. That means if we look at very long intervals it's very likely that most of the patients will have at least one rare variant in such a long interval and it's therefore very likely that it will not be testable and that we can prune it away. The pruning criterion two is if we have found such an interval and such a pattern that is too frequent to be testable then none of its supervolts is testable either. So we can prune away all intervals that include this non-testable interval and thereby we can reduce the search space very much. How much we can see in the following slides here I'll keep this very short. We show here the length of the sequence that we are considering so the length of the genome versus the time taken to search for significantly associated intervals. And our approach called Thys here scales only linearly in the number of snips in the genome although the number of tests that we would have to consider in a brute force approach is quadratic in the length of the sequence. So we can prune away so many candidate intervals that the runtime is still linear in the length of the sequence. That's a very good outcome. Our power is also as shown here in simulations higher than that of competing approaches. This is very short. Now if you're a critical reviewer you could say with these interval finding techniques do you just find intervals that include one strong single snip hit plus some neighboring snips that add some noise to this association? Well we tested this hypothesis in 21 binary phenotypes from Arabidopsis-Taliana, the plant model organism that has also been used for genome-bite association studies frequently and we looked at all the intervals that we found and we checked whether they overlapped or at least were in the neighborhood of a single snip association that we would find with a standard approach to genome-bite association method like a linear mixed model or a univariate Fisher's Exact Test. And it turns out that 70% of the intervals that we find are not located in a 10KB window around these single snip hits. So 70% are genuinely novel. And the other 30% are located in a 10KB window around the hits found by these other standard techniques. So to conclude with this approach which we call fast automatic interval search price we can search for intervals that makes Hippid genetic heterogeneity efficiently. We don't have to pre-define the boundaries of the intervals we can search genome-bite. We can properly correct for multiple testing at the same time. In the meanwhile solve the problem of how to account for covariates like age and gender. We are currently working on the problem of how to extend these from intervals, contiguous sequences along the genome to networks and sub-graphs of snips if you represent snips or genes in form of an interaction network. This is our current work. To summarize the entire talk combinatorial association mapping allows one to study epistasis, snips and interactions, one important potential reason for missing heritability. The high dimensionality of the problem leads to an enormous computational and statistical challenge. I hope I have convinced you of that. Solving both problems at the same time was largely unachieved in the past. We have developed several significant pattern mining approaches that achieve both these goals. Now, if you want to read up more on that, download our papers, presentations and the code that is online, then please go to Significant-Patterns.org. As some final pointers, I would like to point out some resources that my group has developed to the Swiss bioinformatics community. We have developed the platform EasyGvast.org, a machine learning platform for geneticists. It allows you to perform genome-wide association studies in the cloud. It's now a few months later than the May 9 shown here. We are approaching 1,000 users for this tool. If you're interested in running genome-wide association studies in the cloud, then please visit EasyGvast.org. And if you know my earlier work about network comparison, then if you're interested in the problem of comparing networks and biology to each other, then a very useful tool or a set of tools is available from graph- kernels.org. We had an application note accepted in mathematics yesterday, which includes an R and Python package for all the graph kernels, graph comparison methods that were developed in my group. Now, this concludes my talk. I would like to thank my group. I would like to thank our sponsors to support our research, and I would like to thank the SIB for inviting me.