 Thank you. Okay, so I hope as one. Yeah, I can hear myself so it's working the microphone. All right, so we need to finish for the modeling of cancer evolution and I see that I see that this morning you did. You did some some of the, you know, modeling of the evolution of the cancer and is in sequencing data. So that's good. MRCA. That's good. So Julio show you some of that. So I was going over or I started going over something that we just published. We're with Maria and Sophie Sophie first outer she's sitting in the room and I think I described the basic idea of modeling every time you get hit by a driver mutation. You know, the idea and the model is to consider a stochastic phase until when you hit a size epsilon C where C is the carrying capacity. Okay, so there is some epsilon C threshold where if you get to at least that size then you know that the probability of survival of the clone is extremely high. And so then you can start using deterministic basically you can model the rest of the growth in a deterministic way for simplicity. So stochastic for the first phase deterministic. Once it goes over epsilon C. The deterministic phases logistic growth. The stochastic one is inverted that process. Okay. And, as I said, I don't want to do, you know, epsilon deltas today. It's, I think this is mainly to give to give you a flavor of what you can do. And how the models are. So, and I talked to you about the three different types of driver mutations with three different effects affecting cell division and mutation rates. And so let me just give you a little bit more detail. The purpose is just to give you a sense of, of, you know, the modeling. And so if you if you consider the, you know, occurrence of cancer, we can assume that there is a number and of required driver mutations. Okay, that you have to hit. You have to acquire in order to get to cancer. I told you typically this number is between one and four, often for solid tumors is three, and for liquid tumors is two. And now we consider, you know, this requirement and so you need to get in a in a cancer you need to get this clones. You know, for surviving clones, where for surviving with me eating that absolutely size. The, you know, have where the clone is made out of cells that contain a mutation a driver mutation that belongs to one of these three types SF or M where these are the three types I just I showed you right so it's either affecting cell survival or self AIDS or increasing the mutation right so genome maintenance. And when you look at the timeline from conception, right then your birth, and then at some point, you get this surviving clone, sometime as, and then you look at all the way to time T, and you ask what is the probability that the time you know that the time at which you get to cancer which is the time at which you have all the required driver mutations. So the time where you have all this multiple hit has occurred and hits, you know what is for you to that happens. So by time, lowercase t. So by some time t. All right. So, T sub n is the time of cancer T lowercase t is just time. And so, you can show that that can be expressed mathematically when you go through just the differential, you know, right in the expression for it. It's not right if you want to. You know, it becomes one minor disponential of this integral, which I think it makes sense if you have seen any type of stochastic modeling or you think about. If you have any exponentially distributed type of event, this would be the rate, but here instead of just a simple constant rate we have an integral. Okay, where the key part of it is this lambda function. Okay, so, and what is, what is the lambda where the lambda function so lambda is lambda of s s is the time at which you have you get the surviving clone. Okay, so you need two things to there are two parts really key parts in this pressure. One is the rate at which this surviving clones appear. So we call this new. All right. So new v one at s means that at s you got a clone v one appeared. So here can be a clone of any of these three types. Okay, of these three key driver detection types. Then once you get that's not enough of course to get to cancer right you just got that surviving clone but we said that you need three driver mutations, for example to get to cancer so and, you know whatever that means. And then here is three. I mean if you want you can think of and it's equal three. So once you have the surviving clone. The next thing that you need is that that clone that now we know it's going to survive needs to acquire the other mutations. So, given that that clone up here. What is the probability that that clone will acquire all those extra mutations in time. Well how much time you have left. So if you want to look at the probability to get cancer by time t, the time this left is the clone survives surviving colon appears at time as you have t minus s. Right, as the time left for this clone to become cancer to acquire all the rest of you know all the required ingredients to get to cancer. And that's by this part of the form, right. And now you consider the summation of all the possible clones across, you know, that's why you have a son of all the new ones, where the requirement is that you know these clones are representing are created by mutations are either as effort and type. So it's actually pretty simple, I would say. And, and so, just to give you a little bit more detail of, let's focus one second on this appearance rate for a clone. Okay, so how does the clone appear with surviving one. What is, you know what is the rate of that. So, okay, first of all, and here you have all the, you know, sorry the stuff for some reason when we transfer slides colors got a little bit messed up but it's okay. But so, so, first of all, you need to get a mutation. And the mutation appears with it's a, you know, the appearance of a mutation. It's a function of the population size at that time so that's acts. Okay. So the number of cells basically, right. Then be is the division. Okay, so it's a function of how many cells you have. And they divide, right. And then for every division. What is the probability that you get the mutation at the division. So this is the probability of a mutation per cell division of that particular mutation if you're modeling a particular mutation type. Okay. And then here is two minus P. Because that comes from the fact that if you have this type of division. Okay, you see what is the form. So your probability of imitation, you get two cells right so this is what you get there. But if you get this type of division. So one of the cells of the daughter cells are going to be important to us, right. So to keep track. So here you get two. Here you get one. When you put this two together if you do just the audio, just the arithmetic, you get this to one minus P. Okay. Sorry to minus P. Yeah. Okay, so that's that part. And then, and then as you can see, if, if you get hit by mutation, which mutation type is it. Well, there are, this is what pie is therefore. Okay, so you have, as I said, you have three types as F and and M. So now, now that we have say particular mutation type as hit the cell. Now you look at the probability. This row is for the priority that this converse survive will reach size epsilon C. Okay. And then this is the time that because you see there is a Z here this is the population size of time C. Which is before us. Right. That's where the mutation occurs in this figure. And so then there is this light latency time period right from S to Z, which is the time it takes for the mutation to create a clone that reaches survive, survive in size. Okay. So that's the, just the explanation of this terms. Now you integrate from, you know, conception basically all the way to us. That tells you what is the lambda has this just the basic tuition it's actually be more complicated when you really look at the details of that but. Okay. And so. What is this when you substitute now when you plug in your lambda function into the expression here. You get the probability. You know your PV one. Well, sorry. This part is actually. I need to mention this part. This part is, as you remember, once you get a surviving long, what is the probability that once you that surviving clone as appear that that long from that long, you will get a cancer before time T right. That's this piece. Okay. And what is that probability. Well, basically, is, you know, same idea. It's this one minus this financial function where the rates are given by this integral where now you consider there is a V one you need to get a V two. Okay. And, and so on. Right. So that's the, that's the intuition. And so, right, as we wrote here at the various PV one PV one V two PV one V two V three are calculated this way. You end up with a full formula depending on how many, you know, if you're, if you're modeling blood cancer, then you can stop it to say, solid tumor, maybe you need to go the way to three and so on. Okay, so, sorry. Sure. Yes, it does. In fact, it's hidden. Yeah, if you really want the details of those formats, you really have to look at the supplementary of the paper. I'm just trying to give an impact. It's a bit complicated to just show the formula that makes sense without a ton of notation, but yeah, so some dependencies in the expression that are hidden. Okay, so, so once you have the probability of, you know, getting to cancer by time t is one minus this e to the minus, you know, capital lambda. So what we show in that paper is the under some, some pre fine assumptions, it reduces to this expression. Okay. And, and in fact, well, let's go through the terms. So, and is the number of stem cells. Okay, makes sense. The more the cells, the cells, you know, the hardest probability. You hear is the driver mutation probability. Okay, to the end, which is the number of required drivers. And then you get this expression where be not is the proliferation rate before birth. Okay, with some Taylor terms here, plus the proliferation rate after birth, which is be this actually makes a lot of sense right if I think about it, think about independent complete independent events. Then it would just be tell me how many cells you have. So actually let me let me do this, forget before birth, let's simplify let's eliminate that. Well if we eliminate that you get that this lambda becomes this right. It's a drop this term. So, this actually makes a lot of sense, because up to some constant, it's having some mess. We have what, how many cells you have. What is the mutation probability per division, and then you're just counting how many divisions you have, which is, this is the division right and this is the time. T tells you how many divisions you have, right. And if you need any of those events. You get. So I think it's actually pretty intuitive. So under this simplifying assumption and where we drop the possibility that that the, this clones are created before birth, which you may not want to drop in fact we are working on this development phase. There are some observations there too. But you, if you use this lambda here in this formula that actually that's available distribution. Okay, and what is neat about this is that available. I will, depending on how you pronounce it. The distribution has been used a lot in modeling cancer incidents by statisticians. It's a distribution that makes sense you know it's modeling failure times. So we've seen cancer as a failure time. But there was never a justification for using the variable distribution, except for saying yeah I mean cancer must be some type of failure right. So here we are actually showing that mechanistically when you model all the events that you need right this multiple sequential events that you need to get to cancer. You end up that you can approximate all that complicated process by the Bible. Okay, so I think now it's a decision saying I'm going to use the Bible distribution for modeling time to cancer. So there is a, I would say much more solid justification for that. Okay, so, okay. So, you know when, just to show you some plots of how this expressions work in different tissues. We have here you see brass colon and lung. By the way, this is assuming noise fortunes. Okay, so as you can see the number of cells division rates f obviously have important effects on when you, you know how the cumulative risk of cancer behaves. This is kind of 80 right that's a reviewer asked us but what you know what if we didn't have the current lifespan. And so here you see the density of this time to cancer right so this kind of shows you that in theory, if we could live, you know, 500 years. Each one of us, even without exposures. And if our body keeps going the same way, each one of us would have every cancer in every organ, right, it's just a matter of time basically the density. As expected. Okay. Oh, yeah. Now, an important other another important consequence of this work is that so when you look at the Bible distribution right this the probability. Okay, it's just the Bible once. Now, if you just look at if you do Taylor persimmon look at the first term right what you get here is this thing becomes one minus all of this exponent. So the two ones cancel, and you're like with that right so lifetime can read cancer risk can be approximated by this expression. Where all I did here from what was here, you know, I'm using the fact that I'm sorry I'm, you know, age is a before we had time keep changing some of the symbols on you but what I'm doing is I'm taking B and a together. Okay, or T as lifetime and considering that and calling it D. Okay, for the number of divisions. So as you can see there is a kappa constant and then you and the to the end and then the number of steps themselves. And what is what is interesting about that is, if you remember I told you on the first day that I'm that knowing that was wrong. What I did here when when I looked at different tissues and the relationship between a many life, not lifetime number of cell divisions and therefore of this mutations, accumulating different organs. So I'm looking at essentially the product of the number of cells, and the number of divisions in the lifetime of that issue for each one of the cells. And that's that product. Okay. Yeah. How did you estimate. I think it plays. Yes, that's, that's a question for after for afterwards. Yeah, we can talk about that. But it's, it's, it's not. So, okay, so that's, that's the basic idea right it was to say well, how many mutations like maternity tissue is going to be proportional to how many cells I have times how many times each one of those cells has divided by actually, what I just showed you is that the more correct formula is to say that the probability is a function of you remember the Bible right the function of you D to the end and the number of cells. So now, what does this mean this very important. Okay, so if you've been following this script. The probability of cancer, it's linear with respect to the number of cells. Okay, by as a power law relationship with respect to you times D. Okay, what does the mean that means that it's actually very intuitive. If I have a tissue that say for example, if my brother has double size of a given organ. So this is times two, right, the risk in terms of number of mutations in terms of the risk I'm modeling should be about double approximately. Okay, because there are double cells creating those events and those mutations that may take a person to cancer. Okay. So it's linear the formula the expression it's linear in it. But it's not linear in terms of how many divisions I have. So a tissue that has a double number of divisions does not reach the double cancer that is more than that. More than that. Right. Because of this power law relationship. And we see this in the cancer incidence curves like when you look at them right there is this kind of like exponentially shaped curve. It's not linear with age, the older we get that, you know, exponentially more we risk in terms of cancer. So that allows us in this new study to take the same points I showed you before. But now, instead of a, you know, to the now we do a 3D plot where we split the and then before they were just multiply together and consider together here because we recognize that in one direction, it's a linear effect in the other direction is not. We split them along the two axes. Okay. And so now each point here which is a tissue as the number of cells is one of the variables as one of the coordinate and the other coordinate, say the white coordinate is the number of divisions to the power of Okay, and the mutation right this is a constant across all tissues right. Is that clear and so what happens. What happens here is that, let's see. This is the video right. Okay, let me see. Okay, yes. So, as you can see, you got a lot better in terms of it. And so, in terms of comparison. We went from an adjuster R square, which was about two thirds if you remember because the correlation was point eight, so then was about two thirds to something that it's now point eight. Okay. So to conclude this part I just let me say that. I think I think one of the important roles that mall, the machine learners modelers can have in the way it's cancer or anything really in biology is to do the following and I have this figure this is taken from a book. I think it's to be Sirani book. I think it has the end you know this is the Stanford group of statisticians. It's a very famous group of machine learning. But the idea is that, so if the truth is here right, and then there is some noise so the realization could be somewhere around the truth, what you actually observe because there is always randomness. Just the model space. So, what we can get out close we can get to the truth to our models is represented by this red line. Right. But then we often try to restrict the model space, we focus how to, you know, reduce the dimension of that and optimize how close we are to the to the truth in this way. One thing I feel that people in machine learning people in quantitative science should focus a bit more on is that the model space. You know, a lot of the work can be on the should be on the model space and how we can get a closer so we can get this red line closer to the truth. Okay. So there's a lot of work on on the on the park a line. And, and so I think I split it here in one setting. And so I know I'm going very fast by fear familiar with some machine learning days this was some very familiar. You know, for example, if you look at the square error loss. Okay. You can break that down and I prefer you to the book if you haven't seen this this actually a very, very good thing I think I think I've seen something the other day along this lines that you were doing in class so you should know. But if I look at the square error loss, you can break that down in a reducible error. There is nothing you can do about that. And then the model bias. Okay. And then the estimation bias and plus, you know, the variance. So, you know, there is always variance everywhere and there is estimation bias and so. So, my point is that there is a lot of work on this estimation bias. I feel there is no enough work on the model bias. Okay. But if you really want to understand a phenomenon, I think the model bias is really the most important one that you have to reduce. And I feel in the example I just gave you that that's, that's what we did in that case, right. We, we provided a better model for then analyzing the data that you get and understanding, you know, the constant, I mean, making inference out of that data. You see, I think that is more important than once I have that data. I mean, what is the best methodology to fit those points, you know, given a model. Well, no, I think the most important thing is what model I'm going to, what model of reality. I'm going to use. Okay. To conclude this part, you can also use, you know, this approach to understand how many mutations an individual patient should have. And so, at a given age a, imagine on this line right you have for if a patient got hit by cancer by age a that means that if say there were three required drivers together. There was a T1 T2 and T3. Okay. And the birth, the division rate was be the normal division rate until the first driver, and then because of the driver gets the changes. And so that's be one, and then another driver hits and so there is, again, another fitness advantage, right. And so when you want to calculate how many sessions if you assume that every time you have a division. There is some mutation rate right there is a probability of getting some mutations. Then what you can do is you can say well by age a how many mutations I expect. Well, it's the mutation probability for times the number of division that happened before birth. The ones that happen until time T1 where the division rate was be the one that happened between T2 and T1 with the new division rate because of the fitness advantage caused by the first driver. And so on. Okay. And now you can compare this to sequencing data. I'll skip this part because just because of time but. And so here again in the same paper, what we showed is that just, you know, this is without seeing the data is just using. We don't see that obviously we don't see, you know, we see the data too but just using standard as parameters estimate coming from the literature, we could fit. I would say, you know, the average number of mutation by given age in brass. Pretty well, I would say, right. And also in lung for a very simple reason because long. We here are modeling just the normal accumulation of mutations. But when you look at lung cancer patients, a lot of the effect that is to smoking, which we are now included in our prediction right. And so when we look at a bit more complicated figure where here, every color, it's a different cancer type. So the dot it's one patient. Okay. And here, what I'm showing you is the x axis coordinate is what the number of mutations that we would expect just do to our to this normal endogenous processes. And on the y axis is how many we actually observe in the patient through sequencing. So of course there is a lot of variation but as you can see on this line, which is the line where expect and observe match. Okay. Tissues are, you know, we are in the right ballpark, I would say. Okay, there are two major exceptions. The yellow and the light blue tissues, but guess what those are. One is lung cancer, the yellow one, which is so smoking really takes those tissue and adds a lot of attention to them. The other one is melanoma skin cancer where again, we know the sun exposure has a very powerful environmental if you want to affect on the issue. Okay. And so I always like to say, you know, sometimes I'm asked, how do you know that there is no something very powerful that we are drinking the water that we just don't know or you know some something like smoking to the level of smoking or sun exposure at the general population level. My answer given this figure is, I actually don't think that that is possible. The reason being that if there was, again at the population level cannot exclude in a few people right in a small subset select the patients but in general, if there was, I will see a tissue or some tissues being big outliers. My point is, you know, it took maybe 50 years to really prove the smoking causes cancer. It was actually a very big endeavor of, you know, in the epidemiological field to really prove it. Let's say we, we didn't have any of that. Just by looking at how many mutations you expect in a tissue normally, and how many you find in those cancers. Anyone just looking at this figure would have said there is something wrong going on in lung and skin. Okay. And the reason is that because our DNA doesn't care about what we know what we don't know our DNA is recording everything. Okay. So, even if we have not done yet the research to show that smoking is causing cancer. Our DNA is recording that effect. And so the point here is that if there were really major effects that that we still don't know of, we will see a recorded in the DNA. We will see a major deviation from the aspect of values. Is that clear. Doesn't make sense. Okay. All right, I'm going to. So here is so here is the adjust and then if we try to and still, I think it's because of the estimations are still not good. But if we try to adjust for what is the effect of the effect of smoking and lung cancer and sun exposure and skin cancer. And so if we look at this figure, you know, now we are adding the mutations that small keen adds and sun exposure add, then the plot becomes better. Now we're including the fact of the environment. And as I said, I think we are just not included properly. I think the current estimates on what how much the sun affects the skin. So on this analysis list, I will say are still, you know, undershooting the true values. Okay. So that was what I should have done by yesterday. So today we can start with, we can go and talk about our last topic, which is only quit biopsies. And for the early detection of cancer. So let me, let me just say that this is a very important topic, and I will recommend that you will I want you to be aware of at least because it's really a field today in cancer research, and it holds a lot of promise. And the, the idea in terms of motivation for why this is so important is the following. Well, actually, let me let me first before you mention this slide, let me ask you a question. You probably have heard that, you know, in United States, President Nixon declared war on cancer. Okay, have you have you, I don't know if you ever heard this. But that was in 1971. So basically it's been 5050 years. So you have a sense of how much cancer mortality has been reduced in this 50 years. So in this war, during this war on cancer. And by the way, you know what that meant, declaring war on cancer wasn't just saying we declare war cancer was bringing a lot of funding to research to help with, you know, solving this problem. Do you know this 50 years of research what, what was the yield, always been the yield, just in terms of let's say cancer mortality reduction. Give me a proportion by how much you think we reduce cancer mortality in 50 years. Any guess is good. I'm just, I want to see what you think. Yes, that's, that's a good point. Overall, just give me a sense you know do you think we did pretty well. Yeah, that's, it's a very good comment. Yeah. So I'll give you a number. It's 20%. Okay, so we reduce mortality by 20%. By the way, that's a proportion. If we're looking at absolute numbers, we actually have more cancer today than we have 50 years ago. So if you look at the absolute number of deaths, we actually have more. Okay, that's how we are doing. And I will claim the one major reason of that is due to the fact that cancer research has focused on to today, mainly on therapy. Okay, and so we end up trying to treat lay stage so high cost lay stage. That's the paradigm today. And, and I think it's, it's been failing, obviously. Okay, to give you a comparison in 30 years. Heart disease mortality from heart has gone down by 50% in 30 years. And I think one of the major reasons is that there, of course, we want to understand how to do treatment and therapy better, but the focus has been there has been a major focus on prevention. So, you know, checking cholesterol levels and all those all type of screening. Okay, so how do you prevention for cancer. Well, there are two ways. We call them it's, I mean, this is an epidemiological, you know, definition, primary prevention. Since it's prevention from cancer mortality primary prevention is you stop the exposure. Avoid exposure. So primary prevention will be for example, spending money to convince the population that smoking is bad right so you print on cigarette packs that some scary messages and things like that right. Okay, primary prevention when it works is great because you stop the disease from occurring. There are two problems though. One is, so there are two limitations. It's great and it remains a great thing to do and we need to do primary prevention. But there are two big limitations one is that some people, no matter how much you tell them that they shouldn't do something they will still do it. Okay, so it's not going to work. If you don't actually apply that policy. The second, even more important, I would say, is that remember the table I show you the Harvard, you know if you do these things basically this is all that causes cancer. Well, by now I hope I at least somewhat convinced you that there is a lot of cancer, there's nothing to do with external exposure so lifestyles. So, as of today, we just cannot do primary prevention for those cancers in general. Okay, I have an exception an example of an exception but I can tell you later if you're interested. By in general, for the cancer that just occurring normally because of our body committee divisions, you know unless you stop living. We will keep happening and we will get to cancer as we just show, if we leave long enough, we will all get to cancer in all of our organs. So how do you, what do you do for that part. Well, the other component prevention is called secondary prevention and secondary prevention is essentially, you cannot prevent cancer from occurring but you can prevent cancer from killing the cancer. Okay, so another way that the typical way to define cancer, secondary prevention is early detection. Okay, so if you can detect the cancer early, then you can remove it surgically when you're still the cancer is still localized. Okay, and that can have a major impact. So, as a motivation here is, you know, two examples. And, you know, this is the, this is the big killer for women and for men, the two, right, together with lung cancer is the three big cancer types. And as you can see, if I find cancer in stage one or two when the cancer is still localized after five years, essentially 99% of women are alive. So that means that the cancer was found was removed surgically and done. Okay, or with radiation or whatever, but done. Similarly, with prostate cancer, 100% of those that found the cancer localized stage are fine after five years. Okay. Essentially, all of them. Now look instead when you find cancer in stage four, for breast cancer, three out of four women, five years later are dead. Okay, and for prostate cancer it's basically 70%, 70% of them are. And so this provides a motivation for, especially for that part of cancer that you cannot prevent with primary prevention, you have to detect it early. And in fact, even for those that you can do primary prevention, you still want to detect them as early as you can. Right. So that's why this, this new methodologies are are becoming so important in the field, because they really hold the promise to finally change the outlook of cancer which is as you all know it's just terrible right. Okay, so this is a paper we published in 2018 and was the first paper to publish a method for multi cancer. So this is a collection using a blood test, they analyze both mutations and proteins. Okay. And a group at Hopkins of cancer researchers. Dr. Wogelstein, probably the leader among them without the probably been the leader among them. On my side, I was, I was developing the, you know, the algorithmic side. And so, I was responsible for that. So it was like, you know, a five out five last outer corresponding authors depending on which part we were responsible for. But so we published this paper, and we show that this technologies hold a lot of promise. And I'll tell you now in a second this is just a case control study. But things are moving forward and so in this case. So, maybe let me, did anyone talk about self free because I don't want to repeat things but did anyone talk about self free DNA at all before me today. Okay. What we know today is that when you have cancer, typically, the cancer will shed some of its DNA in the blood. Okay, the reason is that, you know, even in cancer you have a lot of cell that and when cells died, you know things break down and DNA enters in the blood stream. Typically is more signal, but you can find fragments of DNA they are not normal to contain mutations, right. And the protein some protein levels also change when you're casting. So if you can sample just a simple blood test, right, and then you measure. And so the self free DNA for just when we talk about this DNA floating the blood when the self reading among the self free DNA. There is the circulating tumor DNA, basically, the fragments of DNA, they are coming from the tumor. And so that's CT DNA. So by looking at that and proteins. And some, you know, methodology we could classify patients or people whether they were healthy or not whether they had cancer or not. Okay, and so let me let me just first say that this doesn't have to be necessarily blood. We did this for endometrial and ovarian cancer with Pepsi. You know, using urine as a liquid for your pillow and bladder cancer. And you can use this technologies also for detecting recurrence of cancer. So instead of finding an early in patients that don't know they haven't, you can also use this technologies for patients that had cancer has surgery. And now they wonder is some cancer still there is cancer coming back. And again by checking the blood. You can test that right. So let me, let me talk one second about the first one. I'll talk about the first one a second and I mean a little bit and then the last one. So the first one which was called cancer sick, as I told you was measuring circulating tumor DNA and protein levels. And it was a pretty large case control study. You know, about 1000 patients and 800 controls across different cancer type and also across different stages. Okay. And I want to go through how we talk about this because again, to me, this, you know, whether it's the model or the feature engineering. Those are the most important part, even from a machine learning perspective. Okay, then the final I'll show you in a second I use logistic regression here as the plastic part. Okay, so nothing too fancy. I want to prove a little bit by doing something super sophisticated which we did later, and the improvement was very little, but really where the major improvements are is in this part and that's why I'm showing you this. So, the first question is, okay, let's say we want to look at mutations that are typical in cancer. How would you pick which mutations to look at. Okay. And here is, here's what we did. There is this database which is the TCGA, which is the cancer genome atlas. And, and so it's sequencing data from many cancer types, and we said, okay, let's look at the mutation. This is the most common among all cancers. Because we definitely would like to find that one. If we are trying to have a multi cancer test, we better find that one right. Okay, so that's the. So we looked at the all the sequencing data, you know, 30,000 patients. And we just ask which one is the top mutation interest of frequencies so very simple. Okay, what about the second and wonder about it. So we ranked them. Okay. And then for every tissue, and also all of them together, we asked, okay, how many. So starting with on the x axis here, I have those mutations ranked. And, you know, when when the value here is one, that means that I'm only picking the top mutation in the ranking. And I look at what is the proportion of patients in the TCGA data set that would be caught. If I had this perfect technology that if the mutation is present in the patient. You mean, I would detect it. Okay, so this is an ideal scenario. This is a completely theoretical scenario, because in many cases, even if the patient had as as the mutation, I may not be able to catch it in the blog. Okay, but from a theoretical perspective, you know, basically this is an upper bound of how well we can do is to say well, how many of the patients in TCGA had the mutation. I understand that in the data that we use TCGA, these are sequencing data or cancer. So you literally take a sample of the cancer itself, you sequence and you look at the mutations. And because the mutations are going through this chronal, you know, expansions, they are say sub chronally present there so it's somewhat easy to find them. We will have the task to find them in the blood. If some of the DNA of the tumor and shed in the blood, much harder task, right. But let's pretend we have a perfect technology. If the patient has the mutation cancer we are going to find in the blood. How many would we find based on TCGA. And this is the curve that's represented here. Okay, for every tissue. And so what you can see is that you know with with about 60 with about 60. Well we ended up choosing 61 but at the round 60 amplicons amplicons are just small regions of DNA. Okay. Like an interval. So we just essentially went once we got to this particular 60 positions this 60 little intervals on the gen genome. We were observing the curves in a flattening. And since you have to think about sequencing cost, because you know ideally if you don't have any issue of cost. Right this is an optimization problem right, if you don't have any issue of cost you will sequence all mutations you just do across the original. If you want to try to save in terms of cost. Then what we observe is that actually with a very small number of positions I mean look at that it's only 2000 basis total. We were catching almost, if we had an ideal technology that could catch everything, you know, we were reaching pretty much the upper bound, or what you could find. So now we decided to, you know, to, to get this about this. In, you know the 61 I actually don't remember exactly why 61, but the point is from a machine learning perspective at around 60. We, there were, there were not really major advantages in increasing the number. Okay. So that the way that we're selected is what I was saying before which is start with the most common mutation. And that's your first one. That is in one region, right. Right. And you know if it happens the two mutations are very close to each other they may happen to be in the same region so you capture them together. Right. So is this clear. So this is how we decided which one to put into the, into the, you know, the method, the algorithm. Yes, please. I mean before metastasis. If it's metastasis kind of too late. You know, once, once the cancer is metastasized, and it's multiple places in general, you're not going to catch up. So, in the TCGA data set. I have no blood samples I actually have the results of sequencing the actual cancer of about 30,000 patients. Okay. So I can say, for example, take a TP 53 particular mutation. Okay, let's say that's the most common. Now you say how many patients in Colorado cancer have that mutation. So I go to TCGA, I look at the sequencing data of the tumors and I see what is the proportion of them that that one. And then, and then I go to the second most common. And then I say, now if I do, I've considered the first and the second together how many patients would I catch in the TCGA data set. But again, this is by looking at the sequencing of the tumor. So it's an upper bound because of course once our technology then goes in looking at blood, which is very indirect way. Right. But this is, we want to do early detection. We don't have a cancer to sequence if we if we have a cancer then we already know that there is a cancer. Right. Is that is that clear. So would it make sense to make specific tests, maybe for the most common cancers, or would they have a similar result as this one. So it costs wise it wouldn't make sense. Yeah, so two parts of, you know, to two answers. The first one is, as you can see here, for some reason that we actually get an aspect initially. The number you know here is the point is 60 or 61 right, as you can see this flattening happens to be with this particular combination of mutations happens to catch pretty much what you can catch across different cancer types. Okay, it didn't have to be like this right. It didn't have to be like this. And then some cancer you catch a lot in some with the same 60 you catch very little. It could be that every cancer has their own. And so there is no overlap whatsoever. Okay, so I will say the fact that with 60, you're able to get to where each of the curves of different kinds of types get flat, you know flatten down indicates that there is some major advantage are looking at it all together. The second part of that of the of the of my answer is, but you're exactly right you want to ask the question because for example in breast cancer. There are ready screening methodologies, right. And in different kinds of types of different instance. So, the value of catching a cancer that's very rare may not be as important as catching something that's quite common. But if you have already a lot of methodologies out there for catching that common cancer, then maybe the multi cancer test is not really adding much to what it's already there. You may be better to improve just on detecting say breast cancer specifically, since that one is very common or lung cancer specifically. So that question becomes also an economics question. And we, I have a paper I'll show I believe it's at the end of my slides so I refer you to a, or I can give it to you later, the reference but where we did a cost benefit analysis. And where we show or give a sense of when it's convenient to have a multi cancer versus when it's not. Okay, because again, every cancer has its own situation with screening methodologies available for some. Some are very good. Some are not good. Some cancers have no screening methodologies and so on. Okay. Yes. So we are taking the TCGA data set so those are all samples with like tumor samples. So we're not here looking into normal samples which might share these mutations. And like last class we're talking about some driver mutations appear as early as 15 years before the onset of symptoms. So I'm just thinking of the practical implications of an early detection test. So if, say someone finds a driver mutation from their cell free DNA, but they do not the like the cells carrying the mutation have not yet reached a point where we can call it a cancer. And so it's like, you know, you, you might get cancer but you might not. And then it's like an indefinite wait till you get cancer. So, you should work with us if you're interested because you just mentioned maybe one of the most fundamental problems. And that's exactly what I'm going to show you next, which is, okay, fine, these are cancers but how do we distinguish cancer from normal? What we mean for normal is not really normal. There are driver mutations just not full cancer. But actually, if you remember what I showed you in the first class, I had one slide where he said, you know, the new normal. Like what we currently know, thanks to many research also mine is that actually we are full of this mutations normally so that is normal. You know, how normal is normal and when it becomes cancer. That's exactly the task of this early detection methodologies to to distinguish to separate the two cases and that's that's it. That's the challenge. And so I'll show you what we did. So, you know what you can do. You can create a DNA of a person. You can attach unique identifiers to each molecules that we call them yet. Oh, please. I'm happy to answer now. Yeah. Where is the chat. I don't know. Are these genes generally 16 genes generally connected to cancer. Yes, absolutely. So that's a great question. Oh, you see the chat here to let me let me just like here. This way I see someone asked a question. Yes, so great question the. So if you look at the list you can refer to the pay I'll refer you to the paper it's a 2018 paper when you look at the list of the 16 genes are exactly, you know, the most common cancer, you know, gene mutated most common genes. Yes. So the answer is yes, absolutely. That's, that's exactly why we were looking at them. And based on their frequencies. So, okay, so now imagine you take this blood sample of someone. And from that blood you extract DNA. And now, in each for each fragment of DNA they find, there is this, you know, this technology that allows you to attach labels, basically to every fragment. These are those labels you IDs. Okay, it's what here in the symbol is. So if the fragment is this orange piece in this figure, you know, this preceding piece which is red, green, purple or blue are these various labels unique identifiers that's what you ID means. So they get attached. Why do we do that. Well we want to kind of keep track of every family that it's own thing, right, to separate them. And then we amplify them and amplify them means that are you, I'm sure you're familiar with PCR. If you're coming from system biology have heard about PCR. So I want to explain it but you know basically you get a lot of copies. And now. You can have the slide for PCR if you don't remember. But so now you for each. So for, basically, you end up at the end of the process that, instead of having just one copy of that from it. Now you have multiple, multiple copies of the same program. Right. And, and now there is a bunch for example this one is a fragment with a bad notation say right now because of PCR you have a bunch of them. Now the problem as you can see is that what we call wild type which is normal to say there was no imitation, a real mutation region because of PCR. Every process of amplification and effective and through sequencing you are adding mutations. But hopefully in PCR you are adding mutation to only some of the clone. Okay. In fact, it depends. You can understand that from here right, it depends on where, say if this is your process of PCR. Okay, you're just expanding to start with one molecule and you create a million of them. The question is where is the mutation happen where is the mistake in PCR because PCR it's right piece in PCR you're doing so it's like doing cell division right you're taking your DNA and you are opening it. And you are duplicated, but duplication, like some division, every time you copy DNA, you can accumulate mistakes. So then the question is, when is this mistake occurring. Well, if that mistake occurs in the first division, due to PCR the first cycle, then 50% of the population will have that error. It's not real. And actually those are the worst, because it looks really real, right, it's 50% of population hasn't. So anyway, so, and of course in the in the ideal scenario when it's real 100% of the fragments have for that you ready. But you see now why we want to keep track of the of we put labels, because now I can distinguish this group of this set of fragments from this set and so, okay. And so, what we do is we look at after doing PCR we sequence and we look at what is called mutant a little frequency, or AMF. What is the material frequency is how many mutations I observe out of all basically the fragments that I read in the family. Okay. And given that then I can build a distribution. So here I get to answer your question which is, now I do this process of the, what I showed you before was data from TCGA so that's cancer data. Now, in our study we took blood samples from health individuals, apparently health individuals, we possibly some of those drivers, certainly with some of those drivers and cancer patients. And so now we do this process of PCR and sequencing for all of them. And now we observe the AMM apps this meet and leave frequencies in health individuals for every mutation, and in cancer individuals for every mutation. So now we have to distribution so now we have healthy individuals that have a driver mutation, which will show up in our sequencing. And that becomes normal. Right, so we observe some health individuals that have a particular TP 53 mutation or chaos mutation. So now we have distributions for how large those frequencies are because think about in a healthy individual, you may have a chaos mutation, but unless a very large clone. And this may be as very small signal. Right. So when I look at those frequencies that I just show you this may have, I will expect that only a few UID is heaven. Right, because out of all the blood I analyze there is this little. Now if instead I have a large tumor. A particular mutation is present in all of us say, and a lot of this gets shed in the blood. I should observe a much larger frequency. So I may have to be much higher right. So imagine this to this, I don't have a plot of the two distribution by managing to distributions with some separation right. So what I can do is just like log likely rush ratio. And this is weighted by for because we did this in four wells. So we repeated the experiment four times, just to be more robust. Okay. And this was the score, this was it. You know, here P is essentially you can is the P value is the probability to observe. Okay, here P of n is the probability to observe that frequency or higher in the normal population. And P of C is the probability to observe that frequency for the imitation in the cancer population. So you consider that this standard statistics, you know, if you don't remember just look at the look like a ratio. Okay, so that's what I called the Omega score someone asked me, why did you call it Omega score, I don't know. I just picked a Greek letter and ended up in Omega. Okay, so that's, that's that part and then, and then on the other side we were looking at protein levels. And so through to the process that I want to explain that because of time but essentially we selected. You know this nine proteins that you see here. And actually, eight. So I'm looking at there is one here that I guess was dropped afterwards, but. And so we use. And so, and then we just use logistic regression. And as you can see, we select the proteins where cancer patients, the signal for cancer patient which is the red was generally much higher than for health individuals. Okay. Now I'll tell you something where people like you can have a contribution the medical field, because, as I told you, you know, historically in the medical field and doctors are historically used to this. And quantitative stuff are of service to them, you know, doing a power calculation for them or some whatever analysis. Okay. Today, because the data are becoming so complicated actually often you inform the methodology. Okay. If we have the proteins in this paper is because of some point I said, guys, the signal is very strong also from proteins why not add in. What was the problem of initially the doctors with the proteins. It was the when they were looking at this figures. I think these are called waterfalls right plots. They were like, look, there is no enough separation there are, for example, for OPN this protein here, right, there are some patients I don't know if you can see there is some blue even here, very close to the top you know, on the on the lower level here on the x axis. So close to zero. Do you see how high these patients are here, and these are normal. These are not patients that are normal. So they were saying, I cannot include this protein because there are health individuals that have a heavy, pretty high. So there's no a good protein. But the point is that mathematically you can understand that look at the figure. There is definitely some separation the summit degree of separation right red is definitely more present here than here. So I put enough of these proteins together, even though this one on its own yes wouldn't be very good to make the call by itself. Once you combine the information that comes from this product and this product and this protein. Now you can separate things actually quite well. It was very important. It is very important and a detection to focus on very high specificity. And for specificity right we want to have a very low false positive rate. We don't want to tell people that cancer when they don't. So to give you a sense of the game we are playing we are playing at around 99% specificity. Okay. So all that is considered not really acceptable for future tests methodologies, maybe 98 but we are there. So that was their concern, right. So here is a mathematician that says to the clinician actually this very good and I'll show you how we can do this. And then they were convinced, and this became the test. The performance was pretty good, I would say. Of course it was a function of stage. So stage one are the hardest to find because they are the smallest and they shed less DNA in the blood. Right. Stage two and three was much higher. But look at the performance. I mean, sure it's not perfect. For example, in breast it's, you know, less than 50%. It's about 40%. But when you look across the this a cancer types that we analyze. You know, even if it's not perfect if you did this test say once a year, maybe you don't catch it this year you catch the next year. Okay. But especially look at this first five cancer over reliever stomach pancreas and esophagus there is no screening today today. I have a friend mathematician that on July 2 was the tech that we lever cancer stage four passed away on July 24. Okay, this is like we are in 2022 this unacceptable. Okay, so no screening whatsoever. And here there are technologies that actually can do a pretty good job of finding these cancers. Now this is a case control study is no perspective study, right. So the performance in a perspective study, where the prevalence of cancer is very, very small, right, less than 1% of the people of a given age will have cancer. So you have to take all of that into account but but this is very exciting I think. Okay. Maybe just let me say, just to give you a sense of how I have one more small topic and then we're done but let me just say that this technology. This was in my disclosure so if this becomes a test at some point I have some royalties on it. But the company I wasn't part of the company and it was built at Johns Hopkins Johns Hopkins sold this thing, this company that you know start up for two billions. Okay, this is how this is the value of this type of stuff today. Okay, the last thing I wanted to show you is how you can use this same type of approaches for monitoring patients after surgery. So in this case, this was published in June. It was published in the New England Journal of Medicine. And it was in Colorado cancer patients. Yes. Yeah, that's a, that's a great question. So right now those tests are about $1000. There is one available. Not this one, but it's, it's called gallery, you can buy it's $949 in the United States. Well, if you have the money, I recommend it. But obviously at that cost health insurance is definitely United States that will not pay. So a lot of my work is how to make that, you know, $200. And I think we are doing it. We actually are in the last month or two of finalizing exactly what you asked and, and what I believe is that when this plot tests are $100 $200. Now you have something that a person may decide to pay, instead of going, you know, 234 times to the restaurant to even even if health insurance is not paying. If that's the cost. And we are, you know, I'm in the planning of 100,000 people study to test this blog test on on them. And if it works, it's going to be a three year study, but if it works, it's going to be very exciting. I think, I think we'll be successful. So, yes, so here. I'll tell you this is funny. So here the question is, this patients have colorectal cancer detector early. And then the question is, do they need chemotherapy. Is there cancer left surgeons leave margins around. I'm sorry, cut margins around the cancer just to be safe. Right. But now even a surgeon, you know, you don't you don't see the cancer they don't come with flags right you can kind of see where the cancerous tissue is, and that's that tells you how good a surgeon is. But, you know, it's it's a very imprecise thing. So the question is, can we use the blog test. So let's sequence the cancer of the patient. Let's see what mutations that patients cancer has. Can we find those mutations in the blood after a few weeks. If we do, then that patient should definitely go under chemotherapy. And if he doesn't, she doesn't, then that person as per chemotherapy. And what we show in the study is that using this technique. Half of the patients were spare chemotherapy. So half of the patients that would have gotten chemotherapy, didn't have to and you know this is both medical from a physical point of view, great news for the patient from a financial point of view it's great news, right. So, and again, what we did there is something where we use a digital approach which means that I told you that we did the experiments in four wells. Well, here, we wanted to be even more sensitive so we did that on a 96 wells plate, which 95 wells contain samples from the patient and one was control. And again, we just approach it here I'll just tell you the general principle, because of time, but you can find all the details in the paper that's that's there. Essentially for every mutation and for every sample, we look at the MAF of that well, and we scored them with the same type of idea right so how probable is this in a normal individual. Okay, versus what we observe in that passion. And then, but we have here now 96 wells or 95. So now we want to combine the information coming from each one of this 95 wells. Okay, to be more powerful and what we see instead of four wells. It's 96. And then, because a patient, usually in a cancer has more than one mutation. And this not just for one mutation, but for all the mutations that that passion have, you know, so it could be 234 mutations. Okay, so then you combine all of this course into one final score. Okay, and let me just say that, as I said, through this. In fact, I forgot to have as let me maybe from this paper you from the, here you go. This is the paper on the right. But if you look at here this was the preliminary study that allow us to do, then this this randomized clinical trial. Here what you can see is the separation between patients that. So these were all patient having the same surgery with this with the surgeons convinced to have removed everything. Right. So after the blood test, after four to seven weeks, we could tell with a pretty good degree of separation, those that were going to recur from those that were not. Okay, and that's how we spare chemotherapy to offer the patient. And I'll conclude with this, you know, this is the day was, I think was in 2016 17, they I realized that what I was doing was actually having an effect on patients because what happened one day is my medical colleague calls me the toboggan caused me to his office. You know I was producing this course, right. And so the way it works is, this was a study done in Australia, about 10 cancer centers in Australia, or 12, and asked Hopkins on the sequencing and algorithmic side. So what happened is the patient undergo surgery will take this they would take the sample of their cancer fly it to Hopkins, we will get a sequence the analyzer, I will provide a score, send them back to score based on you know, is it high than that patient with undergo chemotherapy is a law that doesn't need chemotherapy. So one day, Augustin calls me in his office and says, Christian, for this patient, you gave me a score that's kind of in in between. And I look at him kind of like, yeah, welcome to react I mean sometimes, you know, sometimes it's just not clear. Okay. And so he says, Okay, I understand but so what should we do. And I said, Well, if you were me to, you know, air on the side of being cautious, I would say, put the patient on the chemo therapy and go which team replies. Well, okay, then you have heard what Dr Thomas said, and I'm looking around like you know who is he talking to and I guess we were on speakerphone and it was the Australian doctor on the line. And he says, Okay, I'm going to go to the next room to tell the patient that we start chemotherapy. Okay, and that day. I kind of freaked out a little bit because I really felt, you know, a very personal level that what we were doing was literally have an impact, you know, like a few hours later for in the life of a person. This is a very serious thing, you know, this is not a joke. It's just, it's not just a theoretical game. But on the other side, I think this is what is exciting about this field, right, that we can really have a major impact in the life of people. So I'll stop here. Thank you. A reminder that if, if what you saw here, what you saw here was of interest to you or you were interested in considering, or even exploring working with me and my group. So just send me an email. If with a picture even better so that can remember. I'll probably remember your face and thank you. Thank you for listening. Questions. So, then thank you again, very much. Thank you. Oh, there is an online question. Can the blood test for cancer detection be used globally. Yes, the, in principle, of course, there is no difference. I'll mention two things. The first one is, of course, especially for developing countries as you can imagine. You know, the more expensive, the less probable that the test will be used. So that's one reason why it's so important to me to try to take the cost of this test as low as possible. And also, unfortunately, you know, there is a lot of bias, even in terms of race for what has been sequenced until today. And so TCGA, it's, you know, 95% or so Caucasian. And one of the goals of the study, for example, that we are about to start is to have representation of Asia and, you know, African American and Africans and Hispanic and all of that. Because what is normal in maybe slightly different in different, you know, ethnicities and races. And another question. Can it detect to which organ the cancer is localized and with what precision. Oh, that's a great question. So the answer is, yeah, you can. In fact, that paper I show you cancer sick. We were the first one to do it. And we use random forest, putting everything together to, you know, the traditional proteins and which mutations were giving us, I don't remember it was about 80% accuracy in picking the top two tissues, from which it could have come. So yes, and the other test I mentioned a competitor by grail the one it's on sale already right now does that. And in fact, you know, it's a it's a newer test that does that better. There are some major pros and bustle some very important cons in that approach, because I've heard that and I expected this to happen that you know if you tell a person. It looks like you have a Colorado cancer. And you may end up doing more exams. When it's a false positive, and there are lots of false positives. You know, say the positive positive value of this test maybe something like 2030% Okay, so this means that seven out of 10 times the result is a false positive. And it's still good. That's great. I, I hope for something like that. When you deal with 70% of people that doesn't really have it and now these people go and do something to check the column, then they don't find that then what so you know what what am I going to do. You guys told me it was in my column. Well, maybe we're wrong now do a city scan to check the other organs. And then maybe the city scans shows nothing, you know, and then this person is going to stress out for the rest of the life about their colon. It's really nothing, right. So, since at the end of the day to detect the cancer, you have to use a city scan usually. I would say, you know, across the border list. I would say, I was for now my, my choice is to stick with just whether there is or not, and localization it's a problem for still to, you know, to be work a lot on before in a applicable in the job population. Okay, thank you. Okay, so if there are other questions, then we thank you again. And we have a coffee break up there and then after the post session. Thank you.