 So welcome everyone and thank you very much, I'm deeply honored to be here today and that our work on the single cell mutation, the identification of mutations in single cell was selected for this price, so thank you very, very much. Yeah I did this work during my time as a PhD student in the computational biology group at ETH and in the next, let's say, 15 minutes I'd like to give you some more insights about what we've done and why we've done it. And with me, I'm actually talking about Professor Berenwinkel right here, you should think over there, Dr. Jack Kipers and Dr. Karatari Nayan for the great support and their hard work and contributions to that project, so thanks a lot. And with that I'd like to give you a very short biologically introduction. So we've learned in the flash talks that we have more than 30 trillion cells in our body, so every day we produce billions of cells and in order to do so, we need to have them dividing, so basically you just create a copy. But if things can go wrong, they will go wrong, so in this case it means we sometimes see a mutation happening and if this mutation provides a proliferative advantage, that means that the affected cells divide more often and quicker than the others. And over time, these cells might acquire additional mutations. And this is problematic because while this whole system gets out of control, but it's also problematic from a treatment point of view, because it is basically enough if a subpopulation, if a subpopulation contains a resistance mutation, because then we will see a relapse and we really need to find out mechanisms below these processes here. In order to do so, we can now just look at single cell data. So what we like to have is something like this, we like to have a table telling us for all the cells here which mutations are present. And we can acquire this information by looking at weeds. So here you see a lot of weeds for each cell and you see some of them harbor the mutation or show the mutation and from this nice data set here, we would very easily identify the correct mutation to cell assignment. However, in practice, this is not the case. In practice what we see is something like this, where we get very noisy data, so in many cases we don't have much support for mutation to be called. But we can make use of our knowledge that all these cells, they had a common ancestor. And if we can combine this phylogenetic structure with the mutation calling, then we might be able to overcome the problems with the noisy data. And this is what we set out to do. In order to understand why we need to do it, I'd like to tell you why the noisy data is as noisy as it is. And the problem is that we look at single cell, so we have very, very few materials to start with. This needs to be amplified, which is usually done with a multiple displacement amplification. So we do have some hexamers, the green dots right there, and we have polymerases and these basically just make a copy of the DNA. And whenever these polymerases encounter always sequence run, they just displace it. Therefore, multiple displacement amplification. And these new stretches now can serve as a new template for amplification. And if we do this process over and over again, we get lots of material to sequence. However, this is kind of noisy in a sense that some stretches are amplified very much and others almost not at all. In mathematical terms, you can think of it like, you can think of it like a pull your own model. So you start with an urn here, which has two chromosomes in the beginning, you take a chromosome, you make a copy and replace it. And you do this process over and over again, until you reach the coverage. Okay, in some cases, for technical problems, for example, you start with a single chromosome only. And in this case, you just have one copy to make all the time, which gives you a result like this. So this is actually real data. And what we can see here on the x-axis are the frequencies observed in a cell. And here on the y-axis, how many cells with that frequency were observed. And you can see that there are lots of cells with no mutational information. And there are lots of cells with only mutational information. And then there's a rather flat mix of cells which showed mutational information to some extent. We modeled this using a beta binomial distribution. And in fact, we have three beta binomials. So we start with the one representing our mixed cells. And we combine it with one for the cells which don't show any sign for the mutation and the cells which show just the sign for the mutation. And we end up with a beta binomial looking like this, which is representing our initial observation quite well. So now that I've told you how we can model the nucleotides, I like to say how we actually can integrate that into our phylogenetic structure. In order to do so, we have a tree. And this tree has cells as leaf. So it's a salinage tree. And it has inner nodes where we can place a mutation to. And there are of course a couple of parameters. So in total, our model consists of a tree, then an attachment of mutations to these inner nodes, and the model parameters of our beta binomials. In addition, we have to make some quite strong assumptions. So we assume the infinite size assumption, which is basically saying that whenever you gained a mutation, you're not going to lose it. And that a single mutation is not going to happen twice independently. For the rest of the errors, we assume they're independent. So now that we know how our model looks like, we need to say how we would like to score these trees. So which is a good represent, which tree is a good representation of our data. And in order to do so, basically we take our mutations, replace them into these nodes here, and we compute the score of the tree. So the score is basically just if I place a mutation here, then I would require cell one and two to show it. And the three and four not I place it in a also need to show it. If you do it naively, it's quite computational intensive. However, you can do it in a smarter way. You can start by placing the mutations into the leaves. So you just place them here, you complete the score for all of them. And then you go into the inner nodes, but you can make use of the results you pre computed already. So overall, it's quite efficient. It's linear in the number of nodes you've got. In the end, all you have to do is just sum up all the scores you computed to get your final score for that whole tree. This is a score for one tree, but there are many trees, unfortunately, very many super exponential many. So this is why we cannot just compute all of them. But what can do we can do can use an MCMC approach where we start with a random tree. And then we go through this tree space by basically changing the tree structure, the prudent we we attach approach or by changing the parameters of our beta binomials. So we've done that for a couple of times, 100,000 to millions of times actually for the burn in place. And once we've done it, and once we quite sure that we converged, we start sampling from the posterior mutations. And this sampling is basically just counting how many times was a certain mutation assigned to a cell, which is then giving up the probability in the end that the mutation is present in that cell. Okay, so here's a very short summary and an overview of our approach. We start by looking into the genomic information for all these cells. We compute the probability that at least two cells are mutated by two cells. Well, if you also compute, or if you allow mutations to be present in a single cell, it's very, very challenging to distinguish that signal from an MDA error that happens early on in the amplification process. So we set out and took all the sites where we compute that at least two cells are mutated. With this candidate set of positions, we then went into the tree, tree learning process. And once we've done that, we started sampling to get our final mutation to cell assignment. And here are some results on bench on simulated data. So we simulated 100 mutations and usually in 25 cells with a dropout rate of 20% and 50 repetitions. And we compared ourselves to Monovar, the first single cell mutation caller that was able to actually use the information of all the cells to compute the probability of mutation in single cell. And as you can see, using the phylogenetic structure actually helps. So we do have a performance gain in the F1 score. And that's almost independent of the number of cells. So you can see there's a slight increase in performance, but not so much. And that is true for basically both tools. But what we can see if we increase the dropout rate, our approach doesn't suffer as much. Reason being is because it can use information of cells in the same clade to still make an assignment that there should be a mutation in the cell, even though we didn't sequence it. We also looked at different other measures. For example, we increased the homozygosity rate, so basically the number of homozygous mutations. And we can see that both of the approaches increase with or their performance increases. And then what we also wanted to make sure that in cancer usually you have copy number changes and our model doesn't explicitly account for them. But the performance isn't too bad. So yes, with an increasing amount of copy number changes, well yes, the performance decreases, but not as bad as one could have expected. This is all similarity data. Now I like to talk about real data a little bit. So here we're talking about a cancer set where we had 16 single cells and they gave us quite a nice branching tree here. So most of the mutations are in the root, but there are quite a few mutations and some of them are very special to a certain set of cells. This cell, as I said, consists of 16 cells. We had a matched normal to account for germline mutations and we're talking about an exome sequencing set here. So these are the results. If you project the tree, I've just shown you onto a 2D landscape. Here are the probabilities of sci-fi for the different cells. And here are the probabilities provided by Moonover for the same set of mutations. And as you can see, we have a very clean picture here, while this picture is rather noisy, simply because Moonover doesn't deal with missing information very well. And also, in some cases, there is a misunderstanding because it just didn't capture that this was a drop-out event and not showing the mutation. Of course, we also did multiple experiments, so we tested on a panel data set as well. Here we had 255 cells and you can see basically the same picture. So here, we get a much cleaner assignment of mutations to the cells compared to these results. But now you could argue that, well, we kind of took our structure, put it on there. But if we took another approach, so for example, if we took hierarchical clustering, what would happen? Therefore, we did it and as you can see, if you take hierarchical clustering, you get clusters, quite nice clusters actually, but the noise doesn't go away. So that gave us confidence or that our approach is actually doing what it's supposed to do. And with that, I'd like to come to a summary. So for the first time now, we can actually measure single cell DNA and use it to call mutations. And using the phylogenetic structure in cancer data sets or tumors basically, we are able to overcome problems which are caused simply by technical issues. Actually, we are able to even call mutations in settings where we don't have any reads or where we don't have any reads supporting our hypothesis that there is a mutation. You can find our approach here on the GitHub page. And with that, I'd like to thank you again and I'm happy to take questions.