 Thank you very much for being there. And the lecture, the last lecture of the day, will be given by Christian Tomasipi from the Center for Cancer Prevention and Early Detection of CTO of Hope. And the title of the talk is Mathematical Modeling on Cancer at which, thank you. Grazie, thank you. Thank you, OK. All right, so today we'll talk about this, but let's go back, because we didn't finish. So yesterday, so just as a recap, remember, we talked about NMF and how NMF is used in what I would say today is considered the gold standard for cancer etiology, which is this methodology called metational senators. And we saw, so just one reminder, in NMF, since you take the original matrix and you are going to factorize it with two matrices, they are still non-negative, whose product is as close as possible, depending on the metric that you pick, to the original matrix, right? That's your idea. But one issue is that, so say, if the original matrix has dimensions N times M, right, when you look at the two, the W and H matrix, feature and coefficient matrix is one way to call them, now you have the issue that you have to decide a dimension, yes? Which is, you know, whatever, the P dimension, right? So and the question is what that dimension is going to be is going to significantly affect the results that you're going to get, right? And just to be clear on, maybe I could have added a second slide on this, but if you remember, I was saying the original matrix, I want to give you, because I feel maybe yesterday, OK. So in the original matrix, sorry, here, right? What is an entry? So here, every raw say it's a patient, OK? And then every column is one of those 96 mutation types that we discussed, right? So the point mutation, one single nucleotide change plus the two flanking bases. So we have 96 triplets. And so here you have counts of those triplets for one patient and then across patients. So each row is one patient. So now you split into matrices where on one side, you have the signatures, OK? So again, and on the other side, you have intensities for those signatures, OK? So as I was saying, I think this picture, it's OK. And showing that if you remember, the mutation signatures are probability distributions on these triplets. Here for simplicity, I just shown the six major types, but then there are subtypes, right? So these are just probability distributions. So one matrix just has that as columns. In fact, in the senator matrix, the columns add up to one, because it's a probability distribution, right? And as I said, the best way to think about it is say, for example, in the presence of smoking or any carcinogen, really, is, OK, now because smoking, a cell is going to have a mutation on its DNA. That was not present before. What type of mutation among the 96 possibility is going to be? It's like rolling a dive, flipping a coin. It's according to this probability distribution, which is the senator matrix, OK? So columns add into one. And then the other matrix in the factorization is the intensities of each one of the senators. So in each patient, right? In each patient, sorry, I keep going back and forth, so do you see both of the mathematical representation and this more maybe intuitive one, it tells you how strong each one of those senators is in an individual patient, OK? So what the fundamental problem, as I was saying, is that you have to choose this third dimension, right? Because when you have a factor of two matrices, the dimensions are three, right? Each one of them has two, and one has to coincide, so there are three. And that one has to be picked by either you or using some algorithm to optimize the choice. And what I was discussing yesterday is that it looks to me, and this was the origin of this paper. And by the way, let me say I know I said it yesterday, but I've been in schools like this one. And one thing I really hope that you get from this is just an example of how to think about what is that can do with respect to what is out there, right? So here, for example, you look at this type of work and you say, OK, this is interesting. I know about NMF where I study and I'll review what NMF is. And then you say, OK, what are the potential pitfalls of this technique, which is the gold standard in this field, OK? Where is potentially the problem? One of the problems, as I said, is because you have to choose this dimension, then you end up if you don't have the right number of, if it's not the correct number of signatures, it's just a mathematical technique. There is no magic in it. This thing just takes a matrix and you have to make two matrices out of it, whose product gets very close to the original one. But it's no magic. So if you pick 20 dimensions and there are four, it's going to be a very messy result, where the truth is spread out across many, many more signatures than what it's real. So here, in my opinion, it's pretty clear, right? More than my opinion. I think everyone would agree that it's certainly not the case that every patient living in cancer is a smoker or that every breast cancer patient has BRCA mutations in them. But again, this technique is forced to split things. And in my opinion, one obvious limit beside the dimensionality is that, in a sense, within a math, and a math is going to distribute a little bit of each signature to every one. Does it make sense intuitively, right? And so that's exactly what you find. Now, you can think, well, OK, you're making a big deal out of this. But the reality is that I can pick a threshold. And I can say, when I see things that are very little present in a patient, I say, let's just know it's just an MF that cannot be perfect, right? But the reality is that, and that's exactly why I'm showing you these two examples. Because in fact, it's true across all cancer types, is that BRCA, which is the orange signature here, the signal here is huge. But there are actually very few women that really have the BRCA mutation. So it's not that you can say, I'll set up, like I said, something that, if it's 5% or below, I'll just put it up to see force it to be 0. Yes? Christian, do we have these signatures and use this macro matrix factorization to find out the estrous strengths of the signature? Or we do this matrix factorization and interprets the roles as the signature? So they do both of those things, OK? So on one side, so this is a very good question. So on one side, OK, so this is just data, right? This is genomic data. I get, I don't know, 10,000 patients. And for each one of them, I write down the counts of each mutation type, how many they are. And then I come up with, and then I get these two matrices, right? So on one side, I can look at the signatures. And then I can ask, how are the signatures? So biologists are very interested in looking at this matrix and say, what is this particular signature right here, right? Cancer researchers are interested in that, but are also interested in the intensity of each signature in a given patient, right? Because that is going to talk about the etiology of a given cancer, of a given patient, right? For example, if in a patient, I see a lot of smoking, right? I say, OK, it looks like smoking is what caused your cancer, for example, right? So it can both of those aspects are. And of course, one of the problems is that because this is unsupervised, when you come up with a list of senators, you don't know which one is smoking, which one is sun exposure, which one is age, right? What you can do is a posteriori say, well, this particular senator, so for example, senator, what they call senator one, it's present, especially even more in older people. So this is aging probably, right? Or this senator seems to be present in smokers. But so that was the whole point of what we thought to do. And yesterday I talked, so I won't repeat that. We saw that, in effect, much of the senators are just noise, are not really very helpful. Even for the few ones that we know what they are, OK? You can look at the peak, and that has some meaning, but the rest is just noise. So I won't repeat that part. But so what we thought to do is, well, first of all, if we know of an exposure, why not train whatever algorithm you want? Why not train the algorithm, teaching the algorithm that that patient was exposed to smoking or alcohol or whatever that is, right? So a supervised approach, supervised learning should always be in a supervised way, in terms of performance, if done properly. So that was the whole idea. And the other was, and I think that was the last thing I mentioned yesterday, was that in the previous one, you have to assume that, or they assumed, that each mutational process was the same across all cancer types, right? You get one senator, senator five. And that senator, if you create, maybe I didn't say that. Well, let me just say this today better, which is, when you come up with this matrix made out of senators, OK? Obviously, the more data you have, the more probably the better job you are doing. So what typically is done is, people take cancer data from, like, say, 30 different cancer types, OK? Put them all together, and then say to an MF, give me mutational senators, please. The problem is that you are mixing different cancer types, OK? And so the assumption they are making is the mutational senator one is going to be a given pattern, no matter which tissue you're looking at, OK? And one question I had was, well, that doesn't necessarily have to be true, right? Because I can imagine that, say, smoking has an effective increase in lung cancer, and maybe pancreas actually does also pancreatic cancer. But the presence of the smoke in the lung, I would expect it to have a very different effect in terms of carcinogenicity than the effect much more indirect effect in pancreas, right? And in fact, in the lungs, it's like the risk goes up like 20-fold or more. In pancreas, you know, a smoker will risk twice as much as a regular person. And no smoker. So that's why we thought, OK, oh, and also remember, talking about the random. So if you remember here what we did is we said, well, let's just create random senators. Pick one that has the PICA C2T, which is the agent senator. And let's see how it performs in predicting all versus young people, OK? And we show that beside the PICA, which was already known to be due to the agent C2T mutation, there was not much else. But the thing is we know that we know these PICAs for a few carcinogens. But if you are trying to do discovery, right? Basically, you have a long list of senators with some PICAs. By the way, not always PICAs, because if you remember the plot I showed you, look at the senators, right? So sometimes there are PICAs like senator one. But one of the most important, by the way, in my opinion, it's an R senator as much as one, which is five. And I think even the authors agree today the five S important role in aging two. Look how, you know, pre-flat. It's pretty uniformly distributed across the board, right? So it's a little bit difficult to interpret in my opinion, right? To know exactly what to do with this. It looks cool, but how useful it really is. And sometimes, as you know, methodologies become trendy, like anything else. There are trends and fashions in science, too. So everyone is doing it because, well, because everyone is doing it. And then you pretend not to see the problems with that. And you go. And then at some point, hopefully, someone improves it. So OK, anyway, these are some of the concerns that motivated the study. And so we thought, OK, let's do, well, here is what I thought. What is that we are interested in? We want to understand senators for cancer etiology. And we want to be able to predict, OK, in a patient where, say, we don't know of this posture, how strong was this posture? I think that it's important for practical purposes. So let's use the metrics. So let's look at prediction. For example, AUC, that's what we use. And see how and use that as a metric to develop a methodology that's supervised. So let's take advantage that we know who is a smoker in the sequencing data that we use, which is, I don't know if you are familiar. It's called TCGA, the cancer genome atlas. It's publicly available. And OK, so what we did, you know, standard approach in the sense of having a training set and a test set and do different folds. I think it was five rounds of three folds cross validation. And but the main approach was the following. We first, and I'll go through the steps, actually, at least one of them in some detail. The first one was to decide where our features, OK? So we call the context matters and you'll see why. But I would say this is feature engineering, the feature engineering step, OK? And then we had a feature selection, all right? And then based on selecting this feature, we were doing prediction, right? And then seeing how it's performing and so on. And this will give us, you know, this gave us the senators. So let me let me go over in some detail to this. But let me before before doing that, let me just say that if you ask me personally, so this I, you know, I give you some tips, maybe they're completely worthless. But to, you know, you get from me, you get what I think is important. And and you see a later today again. But I think in many people or many focus. Too little on the feature engineering part. In general, this is I'm talking about machine learning. I'm not talking about mutation of senators, OK? So we know that you have features and we have text books talking about the important problem feature selection, right? So you have many features. How do you pick the ones that you really want? And then we talk about, you know, learning and prediction and, you know, all kinds of metrics to decide what we are doing and, you know, how to minimize, you know, I saw yesterday's square loss and things like that. OK, that's all cool. But you know what, if what you put in is no good, no matter how good the method is, it's not going to be very good what comes out, right? So in my opinion, even for people working in machine learning, my recommendation, I hope you remember is the feature engineering is the key to the whole approach, always, OK? I mean, unless you are doing something that it's already been done a million times and so everyone knows what to do there, OK? But if you are developing, if you are in a relatively new field and you are trying to learn new biology, feature engineering is where I spend 90% of my time, OK? In fact, you will see today that at the end of the day, the methodology was very simple, OK? OK, so how do we do a feature engineering here? Well, let me first start describing the tree, this tree here on the left. So we have to take any passion, OK? We do sequencing just under whole genome, had some sequencing using any mutational color you want. For now, it doesn't really matter. And this is bulk sequencing, OK? Now, when we do that, I think you have learned by now, or you probably already knew it way before coming to the school, that you come out with a list of somatic or general mutations. But in this case, there's somatic mutations, all right? So mutations that have accumulated in a patient after birth, specifically in this case, in the cancer of that patient, OK? So there is a total number of that, right? Whatever the algorithm gives us at the end of the cell process of sequencing. Now, as we discussed yesterday, we can split because you always have C combined with G and T with A. We can, when we think about the mutations, the point mutations of a single nucleotide that can occur, there are only six types, OK? So we can think about C's and T's, and then you can think in the other way about G's and A's by symmetry. But essentially, it's always a C becoming either an A, a G, or a T, or a T becoming an A, a C, or a G, OK? OK. But then, so this is, if you want the first level, that gives you some information on the mutation type. But then you can say, well, I want to add some context, right? So maybe matters what are the flanking bases. So how about for the C to T mutation, right? That mutation, so when I look at the C that became a T, I can ask, what is the next nucleotide to the right one? Well, that can be a G or a T, right? Sorry, G, T, or A. I don't know why here. Oh, yeah, sorry, from here, right? It's an A, a C, a G, OK, or a T. Yeah. So those are the four, because there are four letters, T, C, G, and A. Similarly, I can say, well, instead, forget the base flanking on the right. What about the one on the left? Well, it can be, again, A, C, G, and T, OK? So this gives me a bit more context about the mutation. And now I can play this game again and say, well, let's consider this C that becomes a T that's followed by an A. What about the letter on the left? Well, it can be an A, C, a G, and a T. So now you have the triplets. This gives you the 96 types that we saw already yesterday, right, the total. But this methodology is actually flexible. We can go down further, right? We can ask for two bases on each side or three bases on each side, and sometimes it really matters, right? OK. And so these are, in a sense, this space of all potential features. OK. And then you have to decide which ones you are going to consider. Well, what we did here, again, is it's pretty simple. So, for example, let's say a C is changed to something else, right? And you say, well, let's assume for simplicity here that everything is probable in the same way, which, by the way, basically is true. But so a T can become an A, G, or a T, sorry. So we have the C becomes an A, a G, or a T. So then there will be one third probability that this happens, one third of this happens, one third of this happens. This will be the null, right? If this is completely a random phenomenon, this is what you would expect. If I take a bunch of C to T of mutations, they see change to something else, I would have a third, a third, and a third. And similarly here, right? But so the null is that when you take all the mutations, the total mutation count in that patient, you look at whatever the frequency of C is, OK? And then you say, well, that product times a third will give me what I expect to be, for example, the C to T mutation, yeah? This is the null hypothesis under purely random conditions. OK? So then you can do a simple test, OK? So just in this case, one-sided binomial test, where you say, do I observe this with a frequency that's higher than what would be just suspended by pure chance? OK? And if I do, I select it. These are not still the features that we're going to use for the senators. This is just the first step. But then this thing is considered important. It goes to the next step, right? Say if instead the C to G is observed exactly as you would expect in under completely random conditions, then it means that if we are looking at, say, smoking, then it means that smoking doesn't seem to have anything to say for that particular mutation. Is it clear? Ask me if it's not. OK. All right, so then you can say, well, let's play this game again, right? And now, given there is a C to T mutation, OK? I wonder if my microphone, because sometimes I hear it, sometimes I don't. Let me go. I don't want to be one of those in line with the headphones right now. I apologize. OK, so I repeat this thing one more time. So we are trying to figure out what, say, smoking or any carcinogen, really, or just any factor related to cancer does in terms of mutations, frequencies, OK? And which types of mutations. So from a statistical point of view, the normal way to think about it would be, well, let's think about what is suspected by pure chess, OK? And that would be my null hypothesis. And if I see that, then there is nothing special about that particular event, right? But if I see some deviation, relatively large deviation from that, then it means that there is something happening, the smoking is doing, for example, right, to that particular mutation type. And the simple way to do that is, for example, with a total number of mutations, to look at C2Ts, we say, well, how many C2Ts would you expect? So I say a patient has 150 mutations, total, OK? And so I have six mutation types. If, say, they were equally probable, right? I would expect to have, what, 25 in each bucket here, in each of these types, right? So I would take the total number of mutation. I would look at the normal frequency. For example, for the C, say Cs are one-half of all the mutations, right, under pure chess. So the normal frequency of C would be one-half, OK? So the base that's going to be mutated is there a C or a T. The proportion of C and Ts in the human genome is about one-half and one-half. So it's the total, 150 mutations, times one-half, it's 75, right? And then, because it can mutate and become an A, a G, or a T, meaning there are three possibilities, if those are equally probable, I take one-third, one-third, and one-third as the probability for those three cases, right? So that says that the specter number of C to T mutations is given by this formula, which is right here involved. It's the total mutation time, the normal frequency of C is time one-third. And that gives me the null, under the null, right? That's the frequency there, OK? So now I can do, I can run a binomial test for the frequency, for the number of mutations that I actually observe, the C to T mutation that I observe in a patient, right? For example, if I observe 150 mutations in a patient, but then this patient has 100 C to T's, I say, look, this is definitely not my chance. There is something special about C to T, right? It should be about 25. And I observe, you know, whatever I said, 100, right? Is that clear? OK, so then I can repeat this again and go to the next level and calculate what is the U respect, again, under pure chance conditions. And then you can run a binomial test to see how far you are from that, right? And now anything that is significantly greater than expected, you keep, you say, OK, I want to keep this guy under observation. OK? OK, so doing this, basically this, we go, what I say, we go down the tree. I think it's clear why, right? Now notice that when we get to the second level, I have to run two tests. OK, so for this mutation C becoming a T followed by an A, I both want to test whether it's significantly different from what would I expect given the number of C to T's that I have, as well as given the total number of mutations. And I keep doing this. OK, so if I go down to the third level, I now I test it going on, you know, test it with the parents and grandparents. OK, all the relevant ones. OK, and why I do this, why I go down the tree? Because so why do I want to test this? Why I don't test just this against the total number of mutations, right? Well, because let's say I already decided the C to T is special. If C to T is special, this may also be much higher than normal, just because it's a C to T. There will be nothing special about the fact that's followed by an A, right? So I, so here what I'm testing is, well, given, say for example, that there were a lot of C to T's. So we know C to T is special. Is there something even more special about C to T followed by an A? Or when we control for the fact that it's a special C to T, that already takes into account why I see a lot more followed by an A. Do you see the point? Is that clear? And so I do this. And so we got what I said, we got down the tree. And then, and then we do the opposite. Once we have selected, we test going up the tree, because what could happen, so here we are testing whether this guy is special, given that we already know that the parent of this guy was special, right? And so, and we do that process. But once we have reached the bottom, whatever the bottom is, now we want to ask, actually, was that parent really special, or was just because of the child of that parent, that that parent seemed to be special? You see what I mean? So was this guy really the special one that brought a lot of mutations in, or was the child that did that? And so the parent looked good just because was the child doing all the work for that parent, you see? So then you need to go up the tree and do that also in that direction, OK? So this is how we selected the features, OK? And then, maybe let me just, you know, and then and then the rest was actually very simple because then all we did is we ranked the features based on AUC. OK, well, what we were using was predicting, depending on what it be, you know, small key in age and so on. We were looking at AUC as the metric, OK? So then we ranked the features. And and that allows us to allow us to select our feature that actually were used, you know, doing this training, the feature that were used to then do the final prediction, OK? And the selected ones. Then where once once you have them, then you have you can build the probability distribution. It can be the senators, right? Once you know which ones are the key features, now you look just at them and see what is the probability of those features. OK, so because notice one one key difference of this approach to the previous one is that in the previous one, all 96 the distribution is all over 96 types, OK? Here we chose we thought there is so much noise in this data that we think it's kind of nonsense to try to give a value to all 96 types in a symmetry. Within the symmetry is probably one peak or two, you know, in general or often. So let's focus on the key features. They are really bringing the signal, OK? And the rest forget about it because it's just a noise and making things worse, not better. So I'll show you some examples of what we ended up with. So for example, on the left side, these are all senators for age. So as you can see, even in general, it's, you know, few peaks, even just one in breast cancer, OK? But always few peaks. I mean, even in that case, which is the one where we have more features, you know, we are talking about seven of them. By the way, the number of features selected at the end was done, you know, in training, right? You come up with the optimal end. But just by looking at prediction in terms of how they were performing. So anyway, so that's this for age. And now what is interesting about this is as you can see in the previous work, there is a senator for age. But what I'm showing you here is the different tissues have different senators. OK, so tissues like to make different mistakes normally under regular conditions, just the aging, OK? And even when we look at other factors, OK? So first of all, here you see like there are, you know, this is a senator for alcohol in a tissue and BRCA and so on. But look at smoking. So this is the smoking senator in lung. It's all green because it's all C2A type of mutations. So C is the B-C-A. By the way, we knew that like smoking likes to make C2A B-C-A. And then, you know, for head and neck. Very different, right? I would say blood and head and neck are much similar to each other than in lung. OK. So and then you can use the senator to, as you were asking, to assess how much of each one of the senators is present in a patient, OK? And there, once you have the senators, if you think about those, you know, the matrix of senators, then you basically can do non-negative linear regression, OK, and that spits out your prediction. So doing that, we estimate it, and this is a completely different methodology from what I showed you yesterday. We came out with almost a very similar answer to what we had yesterday, which is about 70% of all the mutations that are found across cancer patients of all cancer types are attributable to basically just R to the aging component, OK? So that's cool, because now we can look at the patient and say to a given patient, here is what I see for you, right? And of course, as I was saying before, well, let me, OK. How many do we have any, how many in this room are interested in biology? Because I spent a lot of time talking about biology. Maybe I'm killing the others, OK? So maybe just for you and maybe some of those online. Let me just mention here one very interesting thing that came up out of this work is that when we look at the senators and how close they are to each other, OK? You can think about cosine similarity among the vectors and things like that. What we discover is there were exceptions, for example, smoking in lung cancer as a very specific senator, OK? But there, the smoke is right next to the tissue, OK? But in general, even smoking, when you look at tissues that are no indirect contact with the exposure, what happens is, in general, the fact of the environmental factor has a similar signature to what the R factor, the aging signature of that tissue has normally. So basically translating that in simple words, often a carcinogen, the suggestion that carcinogen then is just inducing inflammation and cell death. So it will kill the cell. Now a new cell needs to be produced. And in that cell division, the mistakes made accumulate, you know, the mutation accumulated, look like that tissue always does the same type of signature. It's just the signature in that tissue for aging of that tissue, OK? So there is nothing too special, often. It's just that the environmental factor causes the cell death and therefore more of those mutations, OK? So there is some lots of similarities with the tissue of origin. That's why we think it's fundamental when you look at mutation senators to think about it in terms of tissue-specific signatures. And finally, we found the mutation signature for obesity, OK? Which was very important at the time because people think still today, in many ways, the obesity has nothing to do with DNA, OK? And instead, in some tissues, we could predict an obese person just based on the DNA, all right? And so that's now someone may say, OK, who cares that you can predict an obese person from the signature of DNA? I just put that person on the scale. I can tell you that right away. I may not even need to put on the scale, OK? But, OK, that's fine and maybe true. But think now about many other factors, even very simple factors, like smoking, right? If I call a patient and I say, are you a smoker? OK, if that patient is honest, we'll say yes, right? But now I start asking how many packs per day? You know, how many cigarettes? OK, it already gets very sketchy, right? Because now this person has to remember over the years, right? And you try to. So here what you have is something that ideally when it's working properly, it's actually looking at what the smoking caused in terms of being recorded on the DNA, right? So I have a quantification of the effect of smoking rather than a memory of how much I smoke, you know? Some of us smoked a lot and maybe lucky and the DNA was not damaged that much, right? So even for factors that we know a person has been exposed, it's still actually quite important if we can look up what happened to the DNA of that patient, OK? Yes. Oh, yes, the microphone. Yeah. Oh, I thought it was. OK, I did not quite understand how you associate each of the signatures with an etiology, like do you when you begin divide your subjects into smokers and non-smokers and then do the tests? Perfect, yes. Yes. Yeah, that's very important. Sorry, I skip to say that because, you know, by saying that these are super supervised methodology, that's basically what it means, right? Then I'm training the algorithm by creating labels. And so if I want a smoking simulator, I'm going to have a set of smokers, a set of non-smokers, and I learn what is different. So that's how that's how we do it. Oh, you have another question? OK. It's related. So two questions. I'm wondering how the scene was done in the unsupervised way in what was initially the case. And if using the supervised case, you could like overcome the pitfall of like all patients having a smoking signature. Is that? Perfect. Great question. And I'll show you now. I think I would like to answer this. So the first one is, so here you have a comparison of the performance in terms of what you see of the supervised methodology versus the unsupervised, OK? And where, as expected, the supervised methodology should do better than the unsupervised one, OK? And so for the other question, yes, we, because we, thanks to this feature engineering approach, we drastically reduce the space of the features to only a few peaks, sometimes even just one or two. We essentially eliminated the problem of finding every senator in every patient with some different quantities, but as I showed you. So yes, that was one, that was the key for achieving that. And let me say one more thing. And then I think I really, OK, this is a package. And if anyone is interested, it's very easy to use. And the data is there, so you can play with that. I highly recommend that, of course. But let me just say one thing to, you know, there is only one limitation of the supervised approach. So if I had to play the devil's advocate, I would say, OK, this is very cool. But you know what, OK, if I'm not very good of a lawyer, I would say, well, what about, you know, annotation is often not very good. And so things are mislabeled and all of that. OK, to that, I'll just say, look, moving forward, right, data quality will become better and better, technology is going to become better and better. I don't think that's not the reason to go in the supervised way. I think actually the supervised approach should just improve even further with better quality of data. But I think the most serious criticism is the following. What if you don't have a training set, right? So as you were asking to train this methodology, I need to have a set of smokers and a set of no smokers. But what if I don't? What if I don't know? I may not even know what this posture is, right? The unsupervised approach has the advantage that, in theory, it gives you something. You may not know what it is, but you get something, OK? And actually, I agree with that. That is a limitation of any supervised approach. To do supervision, you have to have labels. But what we show in this paper, and I don't have time here to show you, but let me just mention it so that if you're interested, you can look at it, is even in that case, what we show in this paper is you are better off by first removing, so first learning the supervised part. OK, say, for example, I have a set of patients. Well, assign how much of the patients, how much aging at an effect, how much smoking at an effect, OK, and so on. And remove those effects for which we had supervision that we were able to learn with labels, OK? Take those out of the total load of mutations. Now on the leftover, do the unsupervised approach and see if you see anything else, OK? So we propose this, basically, a mixture of two methodologies. I don't remember if we call this semi-supervised, right? Which basically allows you to take care of, by the way, when I say the label, we can apply it with the supervised method. We can apply it to a patient for which we don't know anything, right? But now, new patients come in, and we have senators for, say, smoking, alcohol, and obesity, to say. We have more, like, I think 50, but let's say we have those three, right? So let's evaluate those three, see how much they count out of the total found in that patient. And then if we want to see about anything else possibly there, take those out and now apply the unsupervised approach to the leftover. And we show that when you do that, on the unsupervised part, you get a lot clearer signal, OK? What we did is we pretend that we didn't know about one of the carcinogens, for which we have labels. And we say, OK, let's pretend we didn't know, right? And let's just go with NMF and try to find the peak versus, let's first take age and smoking out, and now see if we can find this. I don't remember what it was, if it was a sunlight or something, OK? And indeed, once you remove age and the leftover becomes a lot cleaner, OK? And sorry, I know that this, in a sense, this may require a lot more slides to go through all the details of it, but I just wanted to show you some several examples. OK, so actually, we're even later today of what I thought. So let me try to go faster. Yeah, please, please. Very soon from the last approach to remembering other things. Yes, thank you. One idea I bring to mind, I want to share with you. After you factorize this matrix, you have the signature. How would be if we treat these signature as a random variable and try to build a joint distribution based on data? So, for example, Bayesian network. And then you have enough data to learn this dependency between the different signatures. So maybe at least I'm sure it is not with assumption that one signature related to one cancer. So we can find the different signatures. And then we have a new patient. We can use that model to predict. And the probability would be that better than the 0 and 1. So we get rid of this hard engineering and also range. Yeah, yeah. So we thought about, I'll leave it out so that we thought about the dependency and we left it to finish our work. But my answer is certainly yes, it's a very good thing to do. Because I think I want to understood these signatures, there are dependencies among them, right? Or we expect them to be related to each other in some ways. So instead of treating them, well, I guess an MF is some way. If we're talking about the MF approach, right? In some way, it's considering everything together and then breaking down different parts. But I think an approach where you start from the beginning, right, is a more mathematical way to assume a random variable where you learn the joint distribution, then you can learn the dependencies among these effects in a more proper way. So yes, definitely a very good idea and work has not been done yet. So OK, I'll talk about, I'll switch now. Tomorrow we'll do, in a sense, maybe from a practical point to the most, yeah, from a patient point of view, maybe the most important thing right now to work on, in my opinion. But at the base, you always have to have a model to understand how cancer occurs. And so I'll show you here some work that we've done. And maybe let me, I think, let me see, OK, let me just, I'll try to be quick. So and again, don't worry too much. My suggestion is don't worry too much about the details. This is really to give you some mental pictures and some ideas of what you can do, OK, with the various tools that I'm showing. So modeling cancer or tumor evolution means understanding how you go from a healthy tissue to cancer, OK, and in fact, even from when you are developing a healthy tissue to cancer. And one thing that was never done really, and that's why we call it multi-phase, is that often modelers focus on you have the first cancer cell and you are modeling the explosion of this cancer growing and becoming detected and maybe metastasized. But the question is what about the phase before, which is you have the healthy tissue that's accumulating this dangerous mutations all the way to cancer. And in fact, even before birth, as I said, OK. And so and then you want to do this not just for one cell, of course, but you want to do it for an organ, so for the set of cells. And so initially, we started with this paper in 2020. We came out, we looked at the simplest, well, the simplest way to do this was with simulations because it was too complicated at that time to do it with formulas. At least, well, it wasn't too complicated, but it was a work that we started in parallel and was taking longer. Actually, Sophie here, I'll show you in a second what we did in terms of analytical formulas. Simulation, it's faster to deal with. So we wanted to get some results and some intuitions. So what we assume is you have a cell that it's alive and a cell can die, of course, with some rate or divide. And if a cell divides, we assume three types of division. I'll show you in a second what those are. And then different type of mutations, which I'll show you in a second what those are. And now we want to do this for every cell on an organ. We're talking, say, about a billion cells and across the lifetime of a person. And so just to show you, I talked about different types of division. Stem cells, you can think about stem cells as the cells that are the engine of a tissue. They are the only cells that can make a copy of the cell. All the other cells, when they divide, the daughter cells are more differentiated, basically are getting closer and closer to the final function that the cell has to provide. The stem cells have the potential to, they are multi-potent. They can become, in a sense, anything they want. So when you, and they are very lonely. In fact, because they can make a copy of themselves, their progeny, their lineage is always alive in us, right? In general, unless they die. Instead for the differentiated cells, they are short-lived, they provide the function and then they are gone. So we focus on stem cells. And here we are showing different events. The first one is, so with some division B, birth rate, cells are born. And with probability P, these divisions produce like a stem cell and a more differentiated one, or progenitor. Okay, so we call this asymmetric division because the two daughters are not the same. And then, so one minus P, we obtain this other type of division where the two daughters are both stem cells, okay? And we call this symmetric cell 3D1. And then you have the possibility of a stem cell to produce two differentiated cells. And this actually, it's equivalent to a debt for a stem cell because the stem cells are replaced by two cells that are neither stem cells. And then the last case, of course, is that, okay? So this is what the cells can do. And then, cells can, and then mutations can occur and we, I know, you know about driver mutations, these are the bad guys. And what we did for the first time in this paper is that usually in mathematical modeling, you just assume that there is a bad mutation and there is some probability of being hit by a bad mutation. But we kind of wanted to do a little bit better than that and said, well, you know, the mutations can be of all kinds, but there is some consensus. There are three major classes of bad mutations, okay? And here is the class and it's very easy to cast in mathematical terms. There is one class which is called cell survival, which is increasing the speed of cell division, okay? So you hit one of these ones and cells start dividing faster, okay? So the rate of division, E goes up or the death rate goes down, okay? So this is a class that we call cell survival or CS mutations. And that is that there are other type of mutations that belong to this group which is called cell fate. And so what that does when referring back to the figure I just showed you, those increase the probability of this happening versus that, okay? So as you can imagine, you want to maintain things in balance and stem cells, as I told you, are the engine of a tissue. You want for this event to happen in general, only when there is a death, right? So think about the group of stem cells. If I want to keep things in equilibrium, I stay like that until when one of us, one of the stem cells dies. At that point you want one stem cell to create two daughter stem cells, so you want that. But only when you have a death event you want one of this, right? In equilibrium. So now imagine there's something like a driver mutation increases, sorry, decreases P probability P. So basically increases this event from what is normal. Now you start having an uncontrolled growth, right? So this is not about frequency of division. You know, say for example, instead of dividing every month, now it's dividing every 20 days. There's nothing to do with that. This is about what the division produces, okay? That's why it's called cell fate. So that's the second class. And the final class, the GM type of mutation, genome maintenance are mutations that hit the repair mechanism genes basically. So these are mutations that cause a lot. So because the duplication of DNA to go wrong more easily, okay? Or to be repair not as well, okay? So you can put driver mutations in one of these three buckets basically. And we want to take into account the effect, okay? And also we wanted to take into account that there is a carrying capacity. So you will be shocked about how much of the mathematical modeling field, modeling tumor evolution assume exponential growth, okay? Which is completely ridiculous, right? But in math, the exponential function it's easy to deal with, to do integration, right? But okay, so you understand that obviously that's not the case, that cannot be the case. So now again, I did it too. But if you're modeling a cancer that's exploding, that may be okay. If you say, well, I'm modeling as an exponential growth just because I care all the way to when it becomes so big that the patient, that the cancer is detected, so it's going to be fine. But if you're modeling, for example, a growth that is pre-cancer, okay? Then there definitely it's not realistic, right? So it depends on, it always depends on what you are trying to do. Okay, so let me skip this because we don't have time. Let me just say that with that model, which I don't discuss here, but I'll give you much details now of an analytical model that does the same thing. We were able to reproduce the cancer incidence of different cancer types. So what we did is, we did simulation. So we simulated, right? Each cell, billions of cells in an organ of individual and then we simulated, say, 10,000 individuals and just went through life with them and see how much cancer we will get. And we did that for common and we fit the data, okay? And the fit is, by the way, not fantastic, I would say, because I actually didn't care for the fitting to be perfect. That was not the point, okay? As you all know, it's easy to fit stuff, okay? So there's nothing too impressive in fitting an incidence curve with a mathematical model. But what was important is that what we wanted to see is, let's say now we change only two, three parameters. And instead of modeling color, now we want to model, say, pancreas, okay? So what this means is in pancreas, what we changed was, well, I'll give you some sense of what we changed. For example, we changed the total number of cells, okay? Pancreas has a different number of cells than colon, stem cells. And then we change how often the cells divide. In colon, the colon divides every four days. It's completely renewed when you're young, okay? In pancreas, if I remember correctly, it's in the order of every eight months to a year, okay? So much slower. And also in colon, because of this crazy amount of division, it's almost always a symmetric division. So P is huge in colon, okay? In pancreas, because pancreas does not divide that much, actually pretty often it's because of stem cell dying, okay? So P is not as high. In pancreas, I think it's 0.5, 0.6 versus in colon, it's like 90, I don't know, 98, 95, 98%. Okay, so what was impressive, in my opinion, of this work was that when we did that, we were able to reproduce the cancer incidence, pancreatic cancer incidence. And then we did it again with leukemia. And again, we were able to feed it. And then we did that with Lynch syndrome, which these are colorectal cancer patients, which have a mutation rate that's 10 times higher than normal, then we were able to qualitatively feed their instance. So what that means is that the model is capturing some basic ingredients that are green enough to give you at least a ballpark incidence curve. I mean, an incident curve that's pretty close qualitatively to the real one. So then, and then the other thing, here I'm giving you some sense of what the model did. And then I'll show you some of the math behind these type of models. But the other thing that was, in my opinion, striking the represented in that model is the following. In general, in previous work, when you look at the events that take a person to cancer, they are presented as, and this is a technical term in the field, accelerating waves. Actually, you are a physicist, so you can understand this obviously very well, right? So, and it's intuitive. I get hit by a driver mutation. Now I start having a colonis pension that should not be there. So there is this uncontrolled growth. Then within that sub-population, that I'm hit by a second driver mutation. Now this has grown even faster, right? Because of the second accelerate. And so if you consider the slope of this, the line here of this cone as the speed, right? The speed is increasing. It's getting faster and faster. And so in, I would say essentially all the previous work, if you asked when these events occur, you would say, well, if here is time of cancer, a lot of them occur pretty recently. Because of, but you know what? This was a result of assuming exponential growth. If you assume exponential growth, you keep increasing the lambda, you're going to explode in terms of growth, right? Once you have a carry capacity, what we found is actually that you can even have situations where the first mutation, so say before it was told that the whole process took maybe 10 years, of which the last two events occurred in the last two, three years before the patient was diagnosed with cancer. What we found is that is that in many cases, we estimated that the first of the three mutations that took a person to colorectal cancer happened when this person was 15 year old. And the second, 20 years later or so, okay? So a much more flat distribution, no accelerating way. Wave, right? Okay. So here is another example of improving the assumptions of the model, how you get a completely different result, right? And really different. Okay, and this, by the way, this was motivated by, and that's why we did it in terms of simulations at the beginning, because it was complicated. And we wanted just to say, well, okay, let's just put into the model the ingredients that we know are important, that are necessary. And let's see what we observe, right? Rather than say, I know how to solve this particular equation, let me just use it and we go with that. Okay, so what we did is, I'll skip this one, since we're late, what we did is we then, of course, decided to do a mathematically, because no mathematician likes simulations really, okay? If you have to, you have to, but if you can't avoid it, it's great. So, okay, let me show you again the basic here is, right? We have time, this actually was just published, about two weeks ago. So consider driver mutations hitting in time, right? Hitting the organ of a person. And let's say that you need, in this case, I'll use three driver mutations as the number of events that you need to get to cancer, okay? And so you have these times at which the mutations have occurred and they have survived an initial growth that's very stochastic and may end up with distinction, okay? So do you want to take a look at them? And let's assume that we need three of them for simplicity. And as you now know, let's assume that the cells can, the four events we can have for the cells are this four, they are already described. So symmetric, self-renewal, asymmetric division, differentiation and that. And here is a very important part of the model, which is assume a carrying capacity C, okay? And now we didn't want to lose the stochastic part of the model, which is important, especially at the beginning of the clone, right? When you have few cells. So initially we model the process like a branching process, burden that process, which could go to extinct. So you have a cell hit by a driver mutation, which gives some proliferation advantage, some fitness advantage. And now this can cause a growth and the growth can then decline and the clone can go to extinction or survive, okay? And we assume for simplicity that there is a survival size. Basically, once you reach certain population size, the probability of going to extinction, it's almost zero. Okay? So from that point on, actually that was part of the trick. The point on, you can model a deterministic, like a standard differential equation, right? And there, so we use standard logistic growth, assuming a given carrying capacity, okay? And this threshold for the survival was the carrying capacity times some small epsilon. Yes. If you have a question, yes. Does the carrying capacity need to be the same for all clones? Oh, that's a very good question. So first of all, carrying capacity depends on, or can depend on a tissue, right? Different tissues may have different carrying capacities. But also for a particular cancer type, the carrying capacity can be affected by the drivers, okay? So it can be actually function. Now, I'm showing here a simplified version of the model, because if I had to share the model, we would have to stay probably a good five, six hours, I would say, of which the first 30 minutes is just definitions to make really sure that we are putting it down right. Okay, so, but all the details are in the paper, by the way, and Sophie is the first author, and Omari Lambert, I'll show you again how it's written. So what is the, yeah, here is, oh, no, I thought we had, yeah, so as it's written there, Sophie Penisson, who is in the room, and Omari Lambert, who is, you know, very well-known probabilist applied mathematician in Paris. Yeah, in Paris, I think Paris, no, where is Omari? I mean, college is for us, normal, also, yeah. Okay, yes, so, okay, and then as I already discussed, we wanted to include the effect of the different types of fitness, okay? So self survival, based on the, you know, the letters here, survival increases the professional rate, right? So DB, delta B, it's positive. So B increases, self-fade increases, sorry, decreases P, right? So DP is negative, genome maintenance increases the probability of imitation. So it's delta U, it's positive, okay? So given that, we get this, let me see, how much time do we have? Trying to decide if, yeah, I think we can, maybe let me just show the first slide here and then we'll start tomorrow. We're actually not too far from the end, so I think tomorrow we can finish. So let's go slowly now, because if I go fast, I just get you lost and then it's completely useless. And we try to have here really just the kind of big picture formulas that I think it's actually pretty easy to understand once you think about it. Okay, so we are interested in understanding time to cancer, right? And we are modeling from conception. Okay, so this is even before birth. And, okay, well, let's start from here. First, we are starting with N number of required driver mutation. So I told you before, typically we pick N to be three, you know, three major events need to occur, but of course, that number is valuable. Usually it's between one and four. And when is the cancer occurs? Cancer occurs when one of the surviving clones, now why do you say surviving? Because if you remember in the previous figure, the first thing has to happen for a clone to get to cancer is it has to reach this size epsilon C. If you don't reach it, you're going to have to extinction. Okay, so first you have to survive. That's why we talk about surviving clones. So the surviving clone has to have the form, you know, V1 to VN where N is the number of drivers. So let's say it's three, right? We have V1, V2, V3 where V is just belonging to one of the three types of driver mutation I told you, okay? So the mutations are either increasing cell survival, the CES mutations in the previous slides, or increases, you know, cell fate, switching to our more self-renewal, or genome maintenance, increasing the production of mutations. So V belongs to one of this and to get to cancer, right? If you need three of those, you can think about all the possible combinations that can heal the cancer. And here, by the way, you can put conditions on which ones, right? For example, in our case, we said you need three for colon, you need three, and you have to have at least one S and one F, okay? Why? Maybe because we know from data that that is the typical situation, okay? Or you may not force it, you may want to see, you know, what happens? For example, in the simulation model, I forgot to say that was very cool. One thing that happened is in colon, yeah, actually, sorry, let me go back because that's an important thing. So in colon, do I have APC here? Yes, okay, okay. So in colon, when you look at the driving mutations, actually, I showed you some paper, I don't know if you read Berr-Vogistin as a quater with me. You know, Berr-Vogistin, if you are in cancer biology, I mean, he's probably one of the top three in the world in cancer researchers. He's the one that discovered the C, he is the one that discovered, I would say, that cancer occurs as an accumulation of driving mutations, okay? And he made fundamental discoveries for APC and TP53, okay? In textbooks, they have the Vogal crown, which is basically a fancy word to say, the Vogistin figure that shows how you go from healthy tissue to colorectal cancer, okay? And the standard way it happens doesn't have to be always like that, but the most typical is that you get first hit by an APC mutation, okay? Which is a cell-faith mutation, and then you get hit by a KRAS mutation, which is a cell survival mutation, okay? Guess what? We simulated, I told you about the simulations, we simulated colorectal cancer. We didn't keep track of the sub, you know, the different types of mutations within a group, we just kept track of the class. And in fact, in colon, when we look at colorectal cancer patients in our simulations, and we ask what was the first mutation, okay? The first mutations were cell-faith, okay? I told you that we did pancreas. When we looked at pancreas cancer patients in pancreas, one thing that's kind of surprising is that in pancreas, the first mutation is actually always KRAS, or very often is KRAS, right? So you may ask, why, why, right? I mean, they are cancer, okay, in two different tissues, but why this would be like that? And we provided, at least a piece of explanation, which is that, now think about it, you're a physicist, right? So this is very, I think physicists are the best, you know, better mathematicians that definitely are thinking about modeling and models, because that's a lot of what you do. So think you have a tissue, like colon, where, well, let me start with pancreas. Think we have a tissue, like pancreas, where you divide once a year, okay? Well, guess what? Either I increase that heck out of that division rate, or I'm not going to get to cancer if I need to get three driver mutations to get to cancer, okay? So requirement number one for a tissue that's divided very slowly is that you speed up division rate. I don't care what division, just give me more divisions, right? Well, that's the green group here, cell survival. And in fact, that's exactly when we look at pancreatic cancer in the simulations, it wouldn't happen starting with that group. You'd have to be this, okay? You need to bring it down to divisions every few weeks, or you're not going to, the probabilities of getting these driver mutations is so small, you're not going to get to cancer in your lifetime in pancreas. But guess what? A colon actually has plenty of divisions, okay? The problem for a colon is that in colon it's almost essentially always the division is this one. Is this one? Why? Because the stem cell, colon is organized in crypts, and at the base of the crypt there are the stem cells. And the stem cell push up the differentiated cells that provide the function of the lining of the colon. And just to give you a terrible picture, right? Here, this is the lining of the colon. So here is where the food goes through. Well, not really the food, but you know what I mean. Okay, so here is the lining. Here are the differentiated cells. At the base here are the stem cells, okay? And every time that if they do the division with the green cell, they push a green cell up, okay? And the next time another green cell, so this one goes up and they push each other all the way here and then they die, are done. Because being exposed to the acid that's there every four days, as I said, it's there replaced. So plenty of division of this type almost always, almost exclusively, it's this one, okay? Well, for cancer, all right? If I have a lot of division, then I don't care so much about a green, you know, the cell survival type of mutation. The most important is give me this. This is really going to make a difference, okay? So basically tissues with different dynamics need to be hitting different ways together, okay? All right, you know what? I'm realizing that's time, so I won't go over and we'll just, I think have a little bit more. So we'll continue tomorrow. Thank you. Thank you very much. I think we have a few minutes for questions. Do we have any? I think they, I like this class because they asked me questions during the lecture. Right. That's certainly a pro. And the day has been long and tough for sure. So maybe I will, we can conclude with a very naive question from my side. The model that you showed last, does it have any resemblance or connection with the compartment models for pandemic evolution? Like the SIR model, is there a relationship between them? Yes, so the answer is, you know, when you said that I remember this textbook which I have in my shelf, which was one of my first textbooks on this material, you know, this birth and death, this stochastic processes are at the base or bought epidemiological models like the one you just mentioned, as well as branching processes, which is for the survival phase of the model that I showed until when you get to this epilepsy, essentially you are using the same type of, you know, very similar model. Yes. Yeah. So the answer is yes. It's a very good thing because I think the cross fertilization between the two fields might be helpful in making use of models that have been developed on one side and then you can use it in a different context, I guess. Okay, so thank you very much again and congratulations. Thank you.