 Good morning everyone in the room and everyone across the world, today starting a lecture Julio Caravane from the USFPS, a second lecture on the topic of cancer evolution or... Yeah, with the computational modeling of tumor evolution, yes, that's pretty much it. Well, thanks for coming to the second lecture, it means that the first one was good. I think maybe we can close this one, no worries. Yeah, I'll go full screen, no worries. We can also do this, right? It's okay? Cool. So in this lecture, we go a little bit forward and we move forward with things we discussed last week. So the main ideas were that we want to model the process of tumor evolution and we want to understand it and we want to use sequencing data to get some information about what's going on in the process. And if you remember the main concepts I gave in the previous lecture were the fact that essentially we have a process that starts from single cell that is very much related to the idea of cell division. So you have this continuous growth of the relation of cells. And in the process of dividing cells randomly acquire somatic mutations. And whenever a mutation is acquired in a cell, then this very same mutation is present in all the progeny that comes out of that cell, right? So there is this kind of cumulative type of attitude in the process of growth. And from an evolutionary point of view, we always imagine kind of expansions of cellular populations that are therefore more complex in terms of genotypes. So there are more mutations, the older are these cells. And then we have also this kind of process of multiple populations coming up one inside the other and competing for a sort of survival in the sense that we were imagining the process as related to a latent hidden fitness landscape where there is a certain fitness value for this population and a certain fitness value for this other population. And we were thinking that if there is some kind of advantage of this population over these other populations, so this is higher fitness than this other one, then in the long run, we could imagine that this population will be completely dominant in our, well, in this case, in the tumor, but in general, these populations wins over, takes over, sweeps over these other populations. That was like the effect of selection, positive selection of this population over these other populations, right? We said things are more complicated, there is positive, there is negative selection, there is neutral evolution. And we're going to build on top of these concepts on these concepts, basically. And the more important thing that we said last time is that at the level of measurement in our data, so just let me, let me just finish this picture to make it consistent. And then we will discuss that while later on. At the level of sequencing in our data, we have a reference genome here, we have our technology that gives us reads that we align over this reference genome. And what we usually do is just like to look at particular positions of the genome that we care about, or all of them also, and we look for, let's say, we expect to find a certain nucleotide. And in some of these reads, we find a different nucleotide. And so we are piling up, piling up the number of reads with the variants over the number of reads with the variant, plus the number of reads with the reference, which we call depth of sequencing, we were obtaining the variantal frequency, right? And because we do these across the full genome, we get a spectrum, right? So we get something looks like that, okay? That was the setup. So our job is to think about how this thing is informative of the actual growth process based on the shape of this distribution. I've heard some of you work on things like copy numbers in a lot of particular type of genetic lesions that are important in the context of the disease. I'm actually, I must admit that I'm hiding a lot of these complex things under the drug, because I thought that I wanted to stay to the simpler level of explanation. So I'm just talking about mutations in a simplest form, but there is actually the truth is that there are a lot of complex things. And also, the frequency spectrum is affected by all those things. So one possibility is that for tomorrow lecture, we focus on these other kinds of complex genetic events. If you want, we can talk about copy numbers and the way they affect the site frequency spectrum, the variantal frequency distribution, or we can do something else. I would like to discuss this with you later after this lecture. So are we all set up on the same page? Cool. So let's start thinking about, sorry, yeah. Yes. I think, yeah, if you want to be recorded, or I can repeat the question if you want. Like this, what? Which one? This one? The other one? Yeah. That's a very good question. Yes. Yes. Of course. I mean, your question is like, if every, if every, so tissues are, you could, so, yeah, sure. So sorry. So the question is like, if there is like a process like this exponential growth. So if you, if this happened like in a completely arbitrary way across the human body, then we wouldn't be like what we are. And there's actually a good observation. Cancers like break down a number of control mechanism that normal cells have. Some of them are also related to proliferation. For instance, normal cells have a lot of signaling mechanism that inhibit proliferation. And essentially you can think as a physicist, you might think of these as a kind of a system that is in nomenostasis, that's rich a certain type of equilibrium. And this is an evasion from that kind of equilibrium, right? So is your observation is definitely correct. You don't have these kind of expansions in a normal tissue. There are some form of expansions like clonal multiple gaseous or things like that, but it's not exactly the same thing. Otherwise, we would be monstrous in some way, right? And there's also the normal cells, but anyway, in this context, there is uncontrolled growth. So all this kind of mechanism that are normally regulating cell growth are broken up by also by the presence of these actual mutations, right? Which are the ones that give this kind of proliferative power to these kind of cells. So let's think a little bit how our data looks like, no? Because our job is to make kind of inferences for the data, right? So we have to build a statistical model for the data. So the first thing I would do, if I was, I'm the one that does these things, but if it was you, I would think about how is the process going to look like, right? If I understand a little bit about the process, then I make some educated guess on how this thing is going on. So let's just like make this very simple kind of reasoning. Imagine we have, we said that most mutations are completely neutral. They don't do anything, but some are very important. Some are the ones that actually trigger the start of these expansions. So they're really important mutations. Just like imagine that we have these mutations in these populations, like the red, the yellow, and the green one. And so we have these populations that are born one after the other. The color just distinguished the fact that there are different type of expansions. And one is associated to mutation A. The one starts because of mutation B. The other one starts because of mutation C, right? And then we know, if we think about the genotype of these cells, we know that the red cells just have the mutation A, but the yellow cells have mutation A plus mutation B, and the green one have A, B, and C, right? Because there is this continuous process of accumulation over time. And then we have our, when we slice the tumor, right, at this point here, and at some point we get a sample of the process here, we have 50% of the red ones, 30 yellow and 20 green. So we have, you know, different combinations of these populations with these proportions. And then our job is like, if we look at the site frequency spectrum and the virally frequency, can we understand anything about these populations? For instance, the composition of that, right? If you think about you have your reference genome, which is this thing here, and you have your reads that get aligned to the genome, no? And when you align these kind of reads, you will have some aligning on top of mutation A, right? And this will be, let's say, well, these are, these should be the same red or the other one, right? This doesn't mean anything. So the one you read referred to reads that have the mutations, so like the ones that I drawn here with X. So because this mutation is present in all the cells that you have sequenced, right? Because it's present in both the population A, B, and C, you are expected frequency, essentially expect to read it from all the cells. But because the cells in normal scenarios, they have two copies of every chromosome. So you can think of representing that type of thing as like a cell like this, but you have one copy and another copy. And the mutation is on one of the two copies. Then by, you know, picking in a random one or the other, you expect half of them to have the mutation and the other half not to have the mutation. So your expectation over the frequency is 50%. And in fact, in this very simple cartoon, you have 10 reads and five of them with a mutation. So you have an expectation of 50. Then you go to the one that is on B, right? The one on B is read. Well, the mutation is detected only in two out of three populations, right? Because the population that is read is wild type for B. It does not have the mutation in B. So whenever you read it, you can think of these as a process in which on the B one, so red cells, if you look at the B mutation, they're like this. They don't have the mutation, right? So whatever you read is going to be wild type. And in fact, your expectation for the environmental frequency is lower. And it comes to be 25%. Because that's in present in 50% of the cells. And then if you can go to lower frequency, so you take mutation C, you will find even lower allelic frequency. So this suggests to you that essentially the temporal evolution of the process, because the ordering of ABC is somehow represented also by the frequency at which I see the mutation, which makes sense, right? Because we said that as far as things go over time and grow and become dominant, they get overrepresented. And because they get overrepresented, you mean that their frequency must go higher, right? Does it make sense? So this also suggests to us how the site frequency spectrum, the violently frequency should look like. Because it's going to be centered at some values, depending on the size of these populations, on the relative compositions of the population in my sample, plus there will be some noise due to this type of sequencing I'm doing, right? An actual observational noise, right? Do we see this? Cool. This is pretty much our job. This is what we do for a living, is just look at things and think in terms of these kind of numbers. And in fact, the most natural way of modeling this type of process is by thinking of a kind of a mixture model in which you have a certain type of, the mixture model is always written as a summation of the likelihoods of components, each one with their parameters, and some mixing proportion, which is like a stochastic vector that sums to one, that tells you that your signal is a combination of signals in different proportions. In this, in this case, the proportions are the things that we were discussing before, right? So we have exactly this kind of nice way of representing the signal we were discussing before, because you have summation over, in this case, k groups, k clusters if you want. That's why I said this is all about clustering in some way, because this is all just a classical way of seeing a clustering problem. And in this kind of example, you can see that the density comes up with the numbers that are actually the numbers we were putting on the composition from previous slides, basically like the same kind of numbers we're having before. What I'm just writing here is that our component likelihood will have to be suitable to describe the process we're looking at, no? But what is this process? Well, this is a process which I'm doing a lot of draws. How many? Well, depends on the coverage of this nucleotide. So depends on the configuration of my sequencing, I'll say. And every time, I either find the read with the mutant or the read without the mutant, right? So we can think of each one of these reads to be a Bernoulli process, because we have two states for the read, mutant, wild type. But we repeat this process many times independently, because each one of these fragments come off a different DNA molecule and sequencing in my assay, to some extent. So the most natural type of distribution to use in this case, it's a distribution that is working on discrete read counts data that can account for coverage and the success probability, which is the expected allelit frequency of the mutation we're discussing. And in fact, the most trivial thing to use is a binomial distribution where you have, you predict, no? The number of reads with the variant condition on how many reads you have and the success probability of the trials, no? So what you want at the end of the day is learning the success probability, P, which is actually the peaks in the frequency space. Does it make sense? This is the probability of a mutation that is present in 100% of your cells, right? So your expected frequency for a mutation that is present in all the tumor cells, the expected frequency is 50%, right? Because that mutation is present in 100% of the cells, if you look back here, right? This one here is because you need to look at this in the other way around, right? So you have mutations that come over time like that. So a mutation that is present in all the cells has an expectation of 50% in the frequency, and therefore it's on the peaks here is this peak here. All these mutations, they are present in all the tumor cells. Why? Well, because another way of representing this type of process is like this. Imagine, which I find is a little bit more intuitive, imagine this is like your process that starts from a single cell. Your sequence, these cells here is what comes in your sequencing, the one for me to get DNA from, right? You can, for every type, pair of cells in the human body, you can go back to their ancestor, right? So crawl back in their history, the phylogenetic tree, which is called coalescence, and actually go back to this ancestor here, right? So all the mutations that are accumulated during this process from whatever is the embryo here up to this, just called the most recent common ancestor of all these cells, all these mutations here, they're present in all these cells, right? So all these mutations, we get a peak at 50% allelic frequency, that will look like this, mutation present in, let's say, half of them, we get a peak and lower frequency in some way, intuitively. Does it make sense? So the latent structure of your population, what is the latent structure here? The latent structure is the composition of the colors and the proportion of the colors, right? Because that tells you how many distant populations are actually inside the tumor, right? That kind of information is mirrored on the site frequency spectrum because this is telling you that you're going to expect some bump in the frequency distribution, looks like this, depending on the latent structure of the population. So you can predict the latent structure of the population looking at the site frequency spectrum. So at least if we want to do some kind of inference, we will have to capture this type of process because it's going to be part of our signal in the data. Does it make sense? So yeah, I mean, I think you're, yeah, sure. The question is whether we are considering statistics of the alignment quality or things like that. In general, in these kind of things, I'm saying, no, we are assuming that you have done your alignment, you have taken all the reads and aligned in a certain position, and you decide, for instance, you could decide the quality score for the reads alignment and only retain those that have a certain quality score. So I'm not really, so this kind of story does not really depend on the alignment, if you think, no? I mean, it's just like, this is like a characteristic of the process, no? Of course, but I would say that at least my experience, for things that are very high frequency, standard alignment procedures, they work very well. Things get a little bit more complicated if you go to very low frequency, because that means that you have only just a handful of reads that contain limitation, right? So there maybe the alignment makes, it's important in some way, and also some regions are more difficult to align than others, etc., etc., etc. But just to understand the principle, linking the low frequency to the population structure, I think you don't have to worry about the alignment quality and just use GITKA best practices or whatever you feel comfortable with. So to build your infrastructure in population structure, you need to build the VFs and the mixture that is proposed. Yeah, so essentially, if you see this as a learning problem, which we're going to do, what you want to understand, there are two degrees of freedom here, right? One is like the success probability, so the t to i of the mixture, which in this case becomes this p probability of the binomial distribution. And the other thing is like what is the number of populations, k, right? Those are the two things that you don't know, no? So essentially, parameter learning is the process of learning the values for p. Model selection is the problem of understanding what is the size, what is the value of k, right? How many distinct things you have? It's obvious that in these histograms you have three, right? Human i is a good predictor in this case. One of the good things about these kind of problems is that you can look at these data and have reasonable guesses, right? Because it's not like a super high dimensional problem where you wouldn't be able to make any educated guess on the data distribution, no? But it's just like a classical, probably if some of you started machine learning and multivariate Gaussian distribution, this will be exactly the same kind of problem, no? In the multivariate Gaussian distribution, probably you want to understand how many Gaussian distribution you want to use, and then you will be learning the mean of the Gaussians, maybe the covariance of the Gaussians, which will be exactly the same thing of learning this p here, right? It's just simpler in this case because it's a different type of distribution. It's a distribution done to model discrete read counts because what we're modeling here is a discrete process, no? But this is not the only type of signal we have in our data. There is another type of signal we need to think of, and that's very important because so far we have an idea that, okay, we need some kind of density, mixture density related to these kind of bumps in the data, so we need that, okay, we buy it, we do understand it, but maybe there is something else, and actually this is like a complex part of the process because it's important and it's something that people realized only over recently, I would say over the last few years. If you remember when I discussed the principle of the evolutionary process, I said that there is this kind of question, what does it happen when nothing happens, right? So what does it happen inside each one of these things? She's the reason I draw it on the blackboard. What does it happen inside this thing? So nothing happens in the sense that you have a continuous accumulation of variation over time, but there is no change of frequency of these populations because all these mutations are neutral. So they change the genotype of the cells when they happen because they are mutations, but nothing special happens related to these mutations. While these are mutations, here the start and expansion, so this cell has a proliferative advantage over these other cells, right? So one of the things we have done so far is just modeling each one of these bumps by thinking of the growth process, but we also need to think what does it look like, this kind of process in terms of our frequency distribution because it's there, right? Actually, most of the time mutations are neutral. You don't have these like thousand expansions. You have some expansions due to the growth of the tumor, but most of the time most mutations are neutral. So actually most of the time is spent acquiring more genetic variation that makes cell different one another, but leaving cell at the same kind of fitness level, just making them a little bit different. That's the thing that you are seeing when I show you the video, the other lecture, right? When these bacteria hitting on the antibiotic, they were becoming more and more and more different until maybe one of them get that particular mutation and makes them outgrow the antibiotic barrier, right? But they were different over time. If you had a clock measuring how we're different, it was like increasing over time, no? But that's non-functional type of difference. This is called intratumor heterogeneity. It's not functional. It's related to the accumulation of mutual mutation. So we need to think about that process. And if you remember the kind of three I was doing, I was making a lot of divisions with the same color, right, in blue. What I'm meaning here is that all these things are the same kind of blue. They're different one another. We need to model the growth inside each one of these populations. And we can think of this process in a very simple way, just like simulating that process, right? Imagine we start from one such cell here, this one here or this one here, and that cell has one mutation, blue one here, no? Well, every time it divides, new mutations come up with a certain rate of mutation, right? Those come up on the two branches because evolution is always branching because one cell makes always two daughter cells. And they will segregate apart on the left and right lineage. And this process can be continued over time in a recursive fashion. And now we get more mutations, right? On the progeny of each one of these cells, there are new mutations. Actually, this is this, right? And then when I sequence, when I do this kind of cut, what I'm actually taking is the leaves of these three, right? The final four cells. What I can actually do, right, is to look at the frequency of these mutations, no? In a similar way, I reason about the frequency in the previous part. And if I do this, I will definitely see that certain mutations like the blue ones, they're present in 100% of my cells. So they will have an expected frequency of 50% if the cells are deployed. And then each one of these cell divisions is coordinated with a cut in half of the expected allylic frequency for these cells because the green end is a Azure type of mutations that are found in 50% of the cells because the result of branching out at the first cell division. And the one in the app and afterwards, they are actually branching out the second cell division, an expectation of 25%. And if I go farther, I'm going to get like 0.125, 0.0625, and so on and so forth. Every time I scale by half my expected frequency. And this is the expected allylic frequency construction of each one of these expansions. Inside each one of these expansions, I will have a site frequency spectrum, it gets shaped by just normal canonical cell growth in this way. And actually I can do what you see here is a stochastic process model in which you take one cell, you create, you have two types of events. First event, simplest one is cell death. The other one is cell division. Whenever you do cell division, you change the color of the cells, meaning you put more mutations on the blue one, sorry, the yellow cell becomes green and orange. And what I'm doing here is just simulating true such population, exactly the same thing I have here on the Blackboard, where first population is blue, this. Second population is purple, this one here. But before going from blue to purple, there are a number of these mutations. And what you can actually see is that I simulated the sequencing process and variant calling, the alignment, all those things and made a snapshot of the variant allylic frequency over time. And if you can see let's comment on this. At the beginning, as cell divisions start, there is a signal that comes up, this first bump here, which is the actual bump of mutation coming from the embryo to this first cell here. But then as far as time goes, there is a shape that looks like this kind of power law shape I was discussing in the previous slide. It comes up in the data from the simulation of the stochastic process. But then at some point, what we do, we put a special mutation of one of these daughter cells. So in this simulator, what you have is a certain fitness value for this population, which is the probability of cell division, successful cell division, whatever you want. And another one here. And maybe it holds something like this. The second one is much more fit than the first one. So what does it happen is that over time, this one takes over these other populations. And as you can see, from the lowest part of the site frequency spectrum, which is only limited by my ability of sequencing, so the money I put on the table to make my experiment, something powerful pops up at low frequency and slowly comes to higher frequency. Right up to a point in which at the end of the simulation, both so that if you can stop it, so now it starts. So in fact, it started very low frequency, it starts growing. So the set of mutations go to higher frequency, up to a point in which at the end of the day, the pool of mutations I see at high frequency is a joint thing of both the blue and the purple ones. But what I see is another type of power load type of distribution, which is what is the same kind of frequency spectrum originated inside this expansion, right? So whenever this kind of thing brings up its own signal to high frequency, you will have that it brings also up the neutral signal of its clonal expansion. Each one of these things is a clone. I would like you to notice also that if we sequence this guy at this point here, let's imagine that this one is associated to one mutation A, important mutation A, as I was saying before, this one is mutation B. Essentially, I will be seeing both at high frequency mutations A and B inside here. I will not be able to understand which one happened first, right? So when evolution results in the sense that one population wins over the other, so the B population, the AB mutant actually won over the A mutant, evolution looks linear in the sense that I only see who's the survivor inside this expansion. I don't know if there was any third population here that was competing, right? Has been completely swept over. Does it make sense? So evolution is always branched because of this process of growth, but when evolution resolves, it looks linear. And for you, the complex thing is that you're going to find a lot of maybe such mutations, high frequency that are important, but you won't be able to say, you know, if you look back in time, who was canon first? No? Do you see this? While, for instance, if when I sequence my tumor, I find something like this, then I can definitely relate a clock and say that this thing is coming before this thing, right? Because I'm catching evolution in the act. That's the main message. And this is exactly the thing I was discussing before. That's the bump of the binomial thing, right? Plus the tail. We agree on this. This is something that works in cancer, but you know, it's just like for every evolutionary process. Yes. Well, we will see it comes up as one over F square, but yeah. Well, think of this process, right? If I was to continue this over time, so a question is like, let me repeat it, a question is like, can I clarify a little bit of this? So if I move this thing over time, my observation on time over time, because this population is more fit than this population, at some point, all the offsprings that I will be seeing are all coming from this population, right? Which is the, when we say that there is fixation of that population. And when that happens, basically my site frequency spectrum is going to look like this. So essentially, if I sequence here, my site frequency spectrum looks like this and is blue. If a sequence here, like now, my site frequency spectrum is going to look like this because I have a little bit both. If I sequence once this guy took over, maybe here over time, then the site frequency spectrum again looks like the first one, but with different colors. So, you know, whenever somebody takes over, evolution looks again neutral. And of course, there is also a theoretical problem of a practical problem of my limitation of seeing things that are small, right? Because in the beginning, you know, the, this population is very small, right? If it is as small as that, you won't be able to detect it. So you can only catch the evolutionary process during the act of sweeping, depending on your coverage. And so depending on how big is the population it is growing. But if we could theoretically monitor and sequence the process without breaking it, we would be seeing something like this. Does it make sense? No. So the question is, if it doesn't overtake the blue completely, yes, yes, it does. But what is the thing? Like, look at this. Let me do another drawing. I can cancel this. This is the blue one, no? And this is the, this is the purple one, no? These are my cells. I have this MRCA for the, for the population. That's the MRCA for this population, for this. In this one is the MRCA for all these things, right? So I expect to see at high frequency. So the mutations that I see high frequency in the purple ones are the one that happened here. And the mutations that I see high frequency in the purple one are these one here. The one that comes from this cell to this other cell here. So what you see, this leftover of blue things over there. This one here. There are these one here. And the purple ones that are high frequency are these. And the other purpleish one that are going to shape the lower frequency part are these one here. So we call those mutations high frequency clonal, meaning that they're present in all the cells. And we call the other ones these one here subclonal because they're present in a subset of our cells. Does it make sense? Okay. But technically speaking, I think the question that we'd ask is, is there any theory, minimum number of cells to start a clonal expansion? So technically it's one, right? Because this expansion starts because at some point over cell division here, from this one to this one here, a particular type of driver mutation or epigenetic event, whatever you think it's important, was acquired. So technically is this one single event that makes this expansion? Then to see the expansion, you will need to have a minimum growth of this population relative to these other populations. No? What I mean is that if we look at this one here, here probably the population was already born, right? It's just that it's too small to be seen. But it is expanding. Expanding one cell over a billion, it takes a little bit of time. But if you grow twice faster, you can actually do the maths and determine two exponential growth processes with different rates, how long does it take to see one over the other, right? So does it make sense? I don't know if this answers your question. Yes, of course, yeah. I don't know about that. I do understand your question. The question is whether we can say anything about the size, the number of expansions based on the number of mutations, because one of the confounders, of course, is how big is the biopsy that you take, no? So the predictor of the, so let's just make a difference. Very simple drawing, no? You have the same expansion here. And in this case, you sequence all these cells. Therefore, your MRCA goes up here. And so this is the number of clonal mutations, right? But then, let's say you make a sequencing assay where you take only a lot, just a little bit of them. Then your MRCA will go here in time. And therefore, you're going to have these extra set of mutations that will look clonal. So a confounder is the amount of DNA you collected, right? So if you could make that uniform, that you could compare. So just like go back, yep. No, no, no. You don't see the color. So your statistical problem is putting colors on top of any of these snapshots. Yeah, yeah, sure. Here, I'm simulating the process so I know the colors, right? But our job is to, as I said, in the first lecture, getting how many colors are there and which are these colors? Finish like one. There is only one color. No, because it's like, look at this, right? If you remember this slide, it's like if you were to sequence only the yellow population, you would guess that it was born out of some red population, but you only be seeing A and B at high frequency and that would be it. You don't know which one came first, right? So this makes sense. So anyway, if you are physicist, I'm not, yeah, sure, nucleotide level variants. Yes, yeah, yeah, a single, it's not a gene frequency. It's like the frequency of a point mutation in this particular sample and sequence, okay? So the question is like, the question is specific to certain type of mutations, which are called insertion deletions that are much more complicated to align. And therefore, for instance, noise around the allelic frequency of an indel is much higher. I would suggest that if you want to make these analyses, you just work with single nucleotide variants. You can map up posteriori indels on top of the site frequency spectrum by using assignment probabilities for a plastering model, but I wouldn't use them directly. Anyway, just like to tell you that people have been thinking about the mathematics of this process, these kind of mathematics for this kind of process, and there is beautiful papers by Kessler and Levine in the context of modeling Luria Delbrouk evolution, which is exactly the kind of process I was describing. Luria Delbrouk got another prize for this. They showed resistance growth in bacteria. So the beautiful thing is that they were the first one to make the experiment and do the mathematics for the population genetics of the process, which is the same mathematics we can use to understand and describe the voluntary frequency growth over time. And there is this beautiful paper on PNAS. And what essentially show is that if you think about writing the probability, you assume that your population grows with a deterministic exponential growth rate. And you write a kind of discrete time mark of chain model where you want to ask yourself, what is the probability of having M mutants at step N plus 1? And that thing comes as the probability of having in the previous step one less mutant and acquiring with a certain mutation rate, a new mutation out of the pool of N minus N plus 1, so that wild type cells get an extra mutation. It's a kind of simple combinatorial way of writing the process from which you can derive the master equation. And doing some complex mathematics, you can come to a point in which you derive the probability, the P of M, so the probability of M, by summing over the number of mutational events with things that look like a little bit Poisson processes in some way, because that's where they come from, and integrating over the possible times where they occur. And they show it in the analytical derivation that the kind of distribution comes as a lambda type of distribution, which Guido should be very happy of. And the asymptotic behavior of this distribution follows this one over X square type of behavior, which is exactly this kind of power low frequency spectrum I was discussing before. That's the theoretical simulation. The distribution is like different in regimes where you cannot really sequence because there are two low frequencies, right? At the end of the day, we don't do much more than 100 X when we look at the genome today. Maybe tomorrow we're going to get to 500 X, 1000 X, but still very far away from very, very small frequencies. And there is the same kind of derivation that is a little bit easier at the level of deterministic growth that is instead taking a different type of perspective, but it's actually similar where you write down the, you're interested in the number of cancer cells at a certain time point. And you write the probability differential equation for the expected number of mutations per time interval, which depends essentially on the mutation rate deployed of the genome. So the number of copies of the genome normally this is pi will be equal to two. And the cell division rate. So is actually what I was saying before cell divide, they have a certain genome configuration and acquire mutation with a certain mutation rate. And then the number of cells, but integrating these subsets solving these for and then plugging in an exponential growth model, you can actually find at some point, a frequency solution where these things follows as one over F type of distribution was integral will be one over X square. So you can get to the same kind of modeling result. So your expectation about this violently frequency distribution, but either working out the stochastic process or differential equation depends which one you prefer. And there is a huge bump off, sorry, amount of research on actual mathematical modeling or evolutionary processing, which people derive this type of steady state equations for different type of solutions is usually called type zero type one processes. All the books by Rick Duret are about the arriving solutions of these type of processes under the assumption of exponential growth, logistic type of growth, a number of scenarios that may be applied to different things. For instance, if you were to model, I would say stem cell dynamics, as you were saying before, you would use instead of exponential growth model, you would be using a logistic type of growth model, because you know, the de-saturation in the growth, they just don't blow up like that. So a lot of things have been already solved. It's a classical situation in which you say I want to invent this, then you find a paper from 1945 that already solved these models and gives you all these nice equations. It happens to us several times at this point. So spend time looking for the literature because in some Russian journal you pretty much find all the solutions to these complicated equations. What is the good thing of all the process is that we can actually make the inferences now that we have all these theories, right? Because we know how the distributions should look like. So what we can actually do, the way I talk about these is we can do model based type of inference. I just don't take any clustering model and I plug it in. I can actually create a clustering model that resembles the type of signal I'm expecting to see. Taking into consideration two facts, the bumps we discussed at the beginning and the neutral evolutionary part. And as far as I understand, this is what I'm going to show you now, is the only way, sorry, is the only model these days in the film that takes into the account both the standard type of machine learning for clustering and the population genetics part of the process to combine these unified mixture. It's still a mixture. It's based on Dirichlet type of mixture. And when we need to decide what type of distributions we have to use, this thing I did like a couple of years ago. Now we are improving this and I don't have time to show you that the conversion is still under preparation. But the key point here is that you need to select what kind of mixture components you want to use. So the thing I selected a couple of years ago where distributions suitable to describe the science frequency spectrum. So instead of modeling with the binomial process, I used a beta distribution because the beta is constrained to be between zero and one. So there was like a good distribution for the, for these bumps, the pink parts of the distribution here. And then, so that's like one of the components we need to put in our model. And then we also selected a type of Pareto type one distribution to model the bluish part of the process, so the neutral tail of the distribution. Does it make sense? So these are the two design choices you make as a data scientist because you have this analytical solution regarding the process. So you know what the distribution will look like and therefore you make the decision of using those type of distribution. Notice that some people don't do like that. They just say, oh, let's do a mixture of Gaussian distributions and blah, blah, blah, because that's a little bit like black box type of reasoning over the data. There is no reasoning over the data. Now instead we spend time reasoning about the process, the cellular growth model, and then we derive the machine learning as a consequence of that, which to me is more interesting type of data science approach because you really need to think about your data. It's not just one size fits all type of solutions. There is some assumptions which are implicit in the model. First of all, it's the mutation rate is constant. Under the assumption that the mutation rate is constant, at least it's possible because let me clarify. As I said in the beginning, each one of the expansion comes up with a bump and his own tail, right, but it's a little bit difficult to the convoluted type of resolution of the data. I couldn't find any reasonable way of doing that at the time. So we made assumption of if the mutation rate of these populations is the same, which is not necessarily also the case because there are some cases in which the mutation rate here is far higher or lower than the one here, but most likely it's the same. You can think of writing one tail for all the clone expansions. In fact, we came up with this kind of model which we call mobster. We stand for model-based clustering. It's model-based because there is this kind of model that makes you design the mixture in this way. That is essentially just the mixture definition. The points, the mutations are independent. That's the multiplication over there. And then you have k plus one mixtures where k are the beta bumps. And this plus one is a special component g with the likelihood that comes out to model this Pareto type of distribution. So you have a k plus one type of mixture where you have a number of bumps in the side frequency spectrum model captured by these beta distributions plus the neutral tail. So it's a Dirichlet finite mixture, standard type of parametric model in the context of machine learning. They can capture with each one of these beta components theoretically, it can capture expansions associated to positive selections while controlling for the confounder of neutral evolution by putting this neutral tail. And the reason we did it like that is also because before this type of modeling approach, people were just using a mixture of binomial distributions. And as you can imagine, it's not difficult to imagine that if you have let's make a monoclonal type of signal where have the cone expansion in the neutral mutation. If you have to fit this with a mixture of binomial distribution without putting this model for the neutral tail, you'll be most likely catching the signal for the bump. And then because of the finite constrained nature of the binomial variance, you will end up putting a number of binomial distributions to fill in the amount of signal for this neutral expansion. And if you were to make the deduction that each one of these binomials is an expansion, you'll be overestimating the number of expansions because you would have this very stupid cartoon, one, two, three, four, five, while instead you would like to cluster this type of data with one expansion and its neutral tail, which is a completely different type of story about the architecture of the tumor. Because one is talking about extreme polyclonality, we say, while the other one is completely monoclonal. So there is a lot of discussion in the free now, we should be analyzing this kind of signal. I think this is the, as far as I understand, the only clustering model that takes into the account, both the theory for the bumps and the neutral process at the same time. Yes. Because they're this, right, these mutations are neutral. They make these cells different one another, but the change of fitness happens at the beginning of this and the beginning of that. So all these cells, they have the same kind of proliferative capacity, which is S2, which is greater than S1, which is the proliferative capacity of all these other cells. They're different one another, but they are alike in terms of fitness. They have the same fitness value. Okay. Yeah, this will be a density. So yes, there will be a number of mutations. Yeah, so this is the way we put colors on top of the side frequency spectrum I was ended up at the previous lecture the other day, right? So these are the colors. If the question was like, how many colors are in the data, then you would say there are two colors, one major expansion. This could be a reach off important mutations because this will be the result of maybe several clones that one, one over the other. But here I'm also seeing one ongoing expansion associated to these subtle mutations. Maybe there is something important inside here that justifies this, especially if it is a genetic associated type of expansion. Does it make sense? So this is a type of the convolution where you catch the evolutionary process in the act. So if we go back to how do we actually learn the parameters of this process, the way we solve these at the time was like by using a maximum likelihood formulation, a variation to the classical expectation maximization model, we had to make a little bit of changes because some of these things are not analytical. So we went, we could, we have both the numerical approximation of the maximum likelihood estimators for the beta, but otherwise we use some moments matching to catch the capture the mean and the violence of the beta, but it doesn't really matter. But at the end of the day, what you want to optimize, which is I think more important is the number of components, right? So in this particular case, what we optimize for is the number of betas. And also honestly, we optimize for the possibility of detecting the low frequency side of the spectrum. So the tape, because in some way, as we will see in the practical tomorrow, some signals, sometimes the signal is not good enough to understand what is happening at low frequency. And what do you think is the variable that associated that? It's actually the coverage, right? Because to have a clean signal low frequency, you need to have a lot of coverage. So again, money. So the number of reads you have determines your ability to look at low frequency. But this is kind of like expected in some way, right? So if you go to the classical way of selecting the best type of parameters, so doing the model selection, there's, of course, like standard approaches based on regularization with the Bayesian information criteria, where you penalize, you compute your negative log likelihood, and then you penalize for the model complexity, which requires to put the log of the number of samples plus the number of parameters, which in this case are the k plus one. So the mixing proportions, plus you have one parameter for each one of the beta, two parameters for each one of the betas, because they have A and B as parameters, the beta or you reparameterize, but still you have two parameters for the type of distribution. And two parameters also for the power low, the power low technically has a scale in the shape. One of the two has a trivial estimate, maximum likelihood estimate, which is the minimum frequency of your data. So that's trivial to learn. The others has to be learned by inference. But we have, we were not very satisfied with the, with the, with the BIC performance in selecting the number of models because we was finding too many clusters based on our interpretation of the data. So what I did is I took something which is called the ICL, which is an extension, integrated classification likelihood, is an extension of the, of the classical BIC, in which we further penalize the solution for a given k by trying to induce the sparsity of the signal. So if you have a clustering problem, know people do this kind of thing, right? This is the classical machine learning book where you have the mixture of Gaussians, right? You put like a Gaussian here and one Gaussian here. I mean, good. But what if you have a lot of signal here and then you end up that you have a lot of overlapping Gaussians. Essentially, you might don't want that type of solution. You might want your solution, your clustering to be very well sparse. So each one of the groups, you try to maximize the separation between the groups. One simple way, I think because you work with Dirichlet, the fun mixture, is to think about the posterior responsibilities of the clustering. So the, what is normally called the Z and K, the latent variables for the clustering, so the emission probability, so the probability of assigning point N to mixture component K, which are usually computed as proportional to the mixing proportion for clustering K and the likelihood for the component K, right? Then normalized by the summation over the rows for these matrix. This is the classical probability of assigning one point to a mixture component. So if I have this type of data, how can I non-sparsity? If you think of points like here at the overlapping of these distributions, all these points, they have some non-zero probability of being assigned to both this solution and that other solution. So the most simple way of catching that is by thinking of the entropy of the latent variables because the maximum entry distribution is the distribution that is flat. So if it is flat, you have completely equal probability of going to one part or the other. So one thing you can do, and this is something that can help you if you have clustering issues in your problems, is to add the entropy of the latent variables on top of the BAC, which in this case essentially is given by the overlap of these distributions. So in the Gaussian, it would be something like this over this two-dimensional space. In this case, one-dimensional model is the overlap of these beta distributions, right? So what you want is to, you want to penalize a solution where you have these kind of betas you don't like and you try to promote solutions where you put very clear beta distribution together with the take. Does it make sense? It's just a design choice, again. Reason being that I want to talk about the subclins and I want to be 100% confident what I see is a real good signal subclinic. Yeah, yeah, yeah. But there is no cell here because we're not really working at a single cell level, right? No? So you have essentially, your sequencing experiment is a vector of vitality frequency values for many mutations. We don't really have single cells. And this is just an example of the real inference over some data that we collect in the paper at the time. And as you can see, the model picks up these one, two, three signals with the betas. So three beta distributions plus the neutral tail at the very low end. And then essentially, at this point, it's trivial to map this type of information on this kind of conceptual information, which is the chlorine evolution model that we started with in the first lecture, where we have these mutations accumulate over time. Should be a little bit more clear now that the fact that this kind of cartoon has these different outgrowth size is just mimicking you the fact that this population is growing faster than this population. And this population is going faster than this population and so on and so forth. Here we have captured a case in which we are seeing essentially one, two, three set of populations, two subclones, one nested inside the others, most likely, or branching that we don't know actually. There's a way of understanding that, but essentially the kind of evolutionary model we are able to associate to this particular patient is reported here. Is either a linear model where you have one after the other, or it might well be that those two subclones are coexisting and competing. And if I had sequenced this tumor a little bit later over time, most likely, if the red population is more fit than the blue population, for instance, I would have seen only that population. Does it make sense? Notice that because this is a maximum likelihood formulation to the problem, we don't have, I didn't say anything about the confidence of the estimates, which is definitely very important. But in this case, the most simple thing to do is to perform a parametric, sorry, non parametric or even parametric bootstrap of your data, which means you resample your data as your data distribution, right? Use your data as your data distribution. You sample number of limitations and you remake the inference. And so you can compute the probability distribution by bootstrapping over the number of beta components in your data. Or even the parameters of these beta components. Does it make sense? Essentially, it's a very simple, frequentist way of acquiring a distribution over our single point estimates otherwise. The inference is kind of fast. It takes a matter of seconds, I would say. And this is what we're going to do over the practical in the next lecture. Practically, what we're going to see tomorrow is this type of data. This is like clioblastoma data. I'm going to give you also other data. Actually, I can give you like data for a number of cases so we can take it home with you and using your projects when we want. These are public data. We're going to see some of these examples. Things that should come out should look like this if everything goes right. These are actually inferences from the model. We have a whole genome sequencing data. So I'm going to show you how we take this data, we load it into our, we process it, how we make the inference and so on and so forth. And is data also matched? Meaning that we have primary and metastatic samples of the same patient. And this is going to look like this. The thing you see on the left is the result of a fit that clearly describes a monoclonal tumor because you only find one bump plus a neutral tail. While the thing you see on the right, the relapse samples is a very clear signal of a subclone expansion inside the relapse. So inside the relapse of this patient, there are two populations, one outgrowing the other one. Does it make sense? And we understand that because we find this low frequency bump called cluster C1 by the model, this one here, which is definitely a subclone. In this other sample, we see different, still a monoclonal type of tumor. The only different to that is just the type of tail, the shape of the tail. And I didn't tell you that, but that comes out because these tumors might have different mutation rates. So there is also a way of learning the posterior probability of a demutation rate by looking at the parameters of this neutral power load distribution. So we have a posterior probability of a demutation rate, the rate at which mutations accumulate over every cell division. And this sample here is instead a little bit more complicated because you see, for instance, what do you notice? You see that at low frequency, you don't have a clear neutral tail, right? And this could be for the very simple reason that here you have at the mixing the signal of a subclone that is coming up, but it's just hitting at the detection limit of your sequencing assay. So that means that inside here there will be sub mutations that are actually part of the tail, which you cannot the convoluted out of the subclone because they're exactly the same frequency. So this is the reason the model has also the possibility of choosing not to use the neutral model, the neutral tail, and put a beta distribution at low frequency. This is a very practical example where you cannot capture neutral revolution in this kind of process. So for this kind of sample you can't actually get the mutation rate, for instance, from the tail because you don't have the tail. And also you will see something else here. It looks a little bit not ideal, right? What do you think is this guys? I mean, this looks like very smooth, right? The density feet over the data, but the other ones on the right, it looks a little bit less smooth. What do you think it is? No, it's not money. No, because high frequency doesn't need that money. Money are important for lowest frequency stuff, right? No, this is like a subclone which is very high, very large, maybe 90% of the cells come out of that, 85, a lot of them. So it's very close to the old class. And so take this and push it towards the right. When it goes towards the right, it's going to create a kind of asymmetric type of distribution until it goes inside in completely, right? In fact, what you would be, so let me just like, if it has swept over, it goes just inside and that's fine. But it is about to sweep almost there. It's going to look like this, right? So you lose the symmetry of your distribution. And in fact, you can see if you compare mutations from these and the other sample, you will see that here inside, this low thing here is actually another subclone. So technically there are three subclones. The model is not able, if you wish so, to capture, to decouple them. And the reason is because we use the continuous type of distribution that has the flexibility of acquiring some large variants and therefore it will just merge them. But if I take these sedimentations and I rerun just the standard clustering. I just like to turn off the volume. And if I take these kind of mutations and I run a standard binomial mixture model over that, it will split them apart. Thanks Guido. We rise up the volume again. And this is the result of analyzing data for over 2,000 samples. I can pass to you if you want data from a large consortium called pancancer analysis of whole genome. So this is like data from 2,700 something primary tumors. It's public data, you can take it with you and use it for anything you want, including this type of analysis. Each one of the points is one sample basically. On the x-axis you find the purity of the assay. This is something we didn't discuss. But for those of you that know how things work, we said we sequence a little bit of normal cells sometimes, no? So the purity is the measurement of how much such cells we have sequenced. So if world was perfect, purity would be always 100%. But it's not. And in fact, sometimes it can be as low as 20%. So think of the thing that you put money on the assay and then your sequence mostly normal cells. So you really need even more money than usual. And on the y-axis you have the median coverage of the assay and other dimension for money if you want. So in each one of these colors, blue or red, refers whether the model thinks that you should be using a neutral tail to explain the signal in the data or not. Because as I said, it gets selected by a statistical argument over the quality of the feed. And what you can see is that the concentration of red points pretty much shrinks towards the top right corner of the quadrant, which is the corner where you have the highest purity and the highest coverage. And that means that if you reach high quality data, you will definitely start seeing these type of signals in the data. If you stick to lower resolution type of data, you will have more difficulties picking up that type of signal, which is also one reason that this kind of computational innovations came out at a time in which sequencing quality was high enough to start seeing these type of signals in the data. So this concludes today's lectures. And what I'd like you to do before tomorrow, please, if you have RStudio, unless you already have done this for your own stuff, please install DevTools as a package, because we're going to use that to install packages from GitHub. Does it make sense? Do you know what this is? Everybody does know it? This package is called, I'm going to write here, DevTools, which essentially DevTools contains functions to install packages directly from GitHub, because I don't put packages on Kran or my conductor, I leave them on GitHub. So if you install this, I'm just telling you to do this because it takes maybe five, 10 minutes because it compiles on your computer and I want to waste time tomorrow. Then tomorrow we install the packages ID, which is like a few minutes, it's no problem. But these do it before. Tomorrow we're going to take what we're going to do tomorrow is to take some of this data and actually run this analysis. I'm going to do this with me on some samples. I'm going to show you how to inspect the quality of the feeds, how do we decide if the feed is reasonable or not, blah, blah, blah, just basically processing. And then I'm going to leave you some data and then you can probably do some analysis yourself and see if you see anything that you think it's interesting. Does it make sense? Questions? Yes. Can you speak up because, well, yeah. So having a population that grows faster than another, it means that in the same time span you get more cells, right? So the tumor grows bigger. So if you can imagine the consequences, I mean, a tumor becomes diagnostic when it's starting pairing the function, basic function of the individual, right? So if on the same, if you grow faster and you're going to basically get to higher size in shorter time, say it again, sorry, I'm not sure. Well, yeah. Yeah. So one of the things that a lot of people, the conservation field is this beautiful field that started like a few years ago in which we look at sequencing data from the point of your evolution, right? That's what we're trying to do here, right? Which is interesting. It is interesting that at some level because you understand better the process. So there is just an extra curiosity for understanding the process. But also, then you could ask yourself what can actually predict in terms of clinical covariates for my patient? One of the things we usually like to understand is, for instance, if you can predict survival probability of a patient under treatment, baseline condition, and so on and so forth. And so far people used to do that a lot of based on the presence of particular mutations. If I have mutation A, you have this risk, if you have B, you have this, you have A and B, blah, blah. But one of the things that you could naturally imagine is to use some surrogate statistics of this process as the predictor, number of such populations, or complexity of this, or mutation rate of the process. A lot of these kind of phenomenological features of the cancer are potential predictors of clinical outcome. So we are at the point in which we are trying to dissect the best of the methods to understand how do we really get the precise estimate of the evolutionary process? Because that information, once we're going to have like large cohorts, could be used to predict outcome and maybe develop biomarkers of some sort. At least some people want to do it like that. So there is always a translational output to what we do. But this could also make sense if you just were curious about understanding the basic functioning of cell growth. If you see, this is really about reasoning on how cells grow. And now variation emerges over time. No, well, variation is mutations. And as I said in the first lecture, variation is the source of life in some way, guys. So it's an important mechanism of evolution. So we evolved like this on a different time scale, because this thing happened over 15, 20 years over cancer. But we are like this because we evolved through a process like that. We underwent selection in the past, right? So we swept over some other populations, most likely. Okay, thanks.