 Let's give them a sec to get a seat. This was the useless stuff, which means all this, pretty much everything, but not Zoom. May I start? Yeah, please. Okay, thanks. Well, first of all, thanks for inviting me. It's like a nice coincidence. I would say that first time I came to Trieste was in 2017 when I was teaching these similar topics. But on Kodata summer school, so it was summer, it was a little bit better, I would say, but I mean, it's always a pleasure to come and see people on the other side of the city, let's say. Well, actually now I moved in Trieste, so. What I'm going to tell you is a little bit about how we can use computational models to, maybe should we close the chat on the one, or yeah, I can, oh, I can do it myself, sorry. Or maybe I can just go for this. Yeah, that would work, right? I'm going to tell you a little bit about computational modeling of cancer sequencing data. What's the pointer? So this is something that essentially from the point of view of quantitative, it's all about, does it work? No, it doesn't. Okay, I can probably move my, yeah, exactly. I'm going to discuss essentially how we can use computational modeling techniques to understand tumors and therefore I'm focusing on what is something that we usually call computational oncology. It's just a new field that has emerged. And I think it's like very interesting, at least in my opinion, because there is a lot of technological development around that and therefore there is the need for people that can understand the data, the models, they can write it on a computer, they can, there is the need of people like you guys, like us, I would say that can make sense of this complex data. In this introduction, so I'm going to start getting a bit of introduction of what we do as a lab. We are based in Trieste, well, you probably know the place, this is like nice, despite being a small place, it has a lot of interesting things, including super strong, you know, winds compared to the average parts of Italy. We are pretty much close now to this, to this castle. And yeah, we are a small lab. We started a couple of years ago, but I think we're growing faster. So we have big derivative, I would say. And in general, we have, we're always on the lookout for interesting people. Whenever we have grants and money, things go quite well these days. So you might have some interest in getting in touch with us. What we do for a living is essentially drawing these kind of things on the blackboard, which are, you know, the concepts of tumor revolution, which I'm going to introduce to you over these set of lectures, at least at the beginning level. And then we do some mathematical models and we do some inference from the data to try to see if our hypothesis makes sense in terms of, you know, the signals we expect to see in the data. It's really about drawing these kinds of things on the blackboard pretty much every day. We have a team of interdisciplinary people and I want to show you just like some of the people we do, what's new in Chrome. Yeah, we updated it. So we got like a number of people that have some are like physicists that come from hardcore theoretical physics, some former CISA people actually. And then we have a number of PhD students we come from different kind of angles, right? We have people that know quantitative biology and genomics by the ground, but then they specialize in as masters in data science. And then we have pure data scientists, computational people. So there is a number of people that provide different insights on the problem because we really try to make interdisciplinary research because we don't really just want to work on the computational problem per se. We really want to try to make some impact in terms of real biology and also clinical practice in some way because at the end of the day we're still studying a human disease. So we want to have an end on that as well. And most of these people are sometimes also because supervised with several other scientists in Trieste or around what we do for a living is actually developing a number of tools that we always release to the public domain. So we have these packages which some of them are going to be discussed over the next lectures. So most of them are developed in ARR or sometimes they have some Python behind but pretty much we do everything in ARR. So all the practicals of my lectures will be done using the ARR programming language. So please try to get it installed on your computer if you have questions, you can ask me an email or something but just like come with ARR Studio for the next lecture so that we can do some practical analysis instead of doing. So we do the talking today and then we go for the practicals over the next lectures. And you are invited to go and have a look at all these kind of packages that they have their own websites. Sorry, that's the right link. They're all available on GitHub. They have their own websites and they have articles that explain you how you can plot data, there is example data, you can fit the models. Pretty much everything is in the form of well-documented materials. So if you're curious and you wanna work more on these kind of topics, you have a lot of materials you can start from, okay? So what I wanna try to do with you these days is instead give you, well, an introduction to all the fuss about computational oncology and what I think are the most interesting things from a quantitative point of view as well. Are there any questions at this point? Please feel free to interrupt me anytime. I know you have different backgrounds. I'm gonna start from the most simple things and then we'll be on top of that. Essentially, we're gonna revolve around three main areas. First one is explaining you what these kinds of genomics which might be a little boring if people have a ground in neuroscience but it might be important for people that have a ground in physics to have an idea, for instance, of what does it mean to talk about sequencing? What does it mean to have grid counts from sequencing experiments? What does it mean to have DNA mutations or unemployed in kind of context which are like the bread and butter of people that work on the field of computational oncology but they are not necessarily immediately obvious for all the other people. And then from the point of view of, probably you heard a lot of buzzwords regarding no mutations during COVID, there is the new variant, blah, blah, blah, blah, blah. There is evolution of this variant and now it's all the, I don't know what variant are we with these days but I think that at the end of probably this lecture you'll have a little bit more of an idea what does it mean to have a mutation, a mutation fixation of population and things like that which is exactly what we do when we think of tumors as an evolutionary process which is the other side of this set of lecture. So discussing the concept of clone expansions, how we can use mutations which come from a cancer genomics perspective to understand the clone evolutionary process that is behind the formation of the disease that is behind of the responsible for how disease responds to treatment and so on and so forth. And we're gonna have the four two calling to cause concepts like a null model of new resolution and some form of selection at least. So you'll have an idea about, general idea about evolutionary processes. And at the end of the day, we're gonna build on top of this concept to have our model to make some inferences based on cancer sequencing data because what we want to do at the end is to make some, you know, reasonable analysis of this data and trying to understand tumors as evolutionary processes because that's what they are. And that's the only way of giving them a dynamical flavor. So understanding them as dynamical process that changes over time because it's a disease characterized of course by this kind of dynamical behavior. Does it make sense? Cool. Just like to give you some heads up for the point of view of the kind of machine learning that people do in this kind of field. Most of it is some form of unsupervised learning that looks like clustering in some sense. And also putting colors on top of points. As you see example on the top left, you have this bunch of points that are black and you want to make them color then, you know, make some understanding of these colors in terms of these evolutionary process. And then you might have also the endograms or things like this. When we think about phylogenies probably, right? This might be a concept that most of you might be a little bit at least familiar with. So essentially there is no just, I don't know if it's gonna let you down but there's not gonna be any deep learning or any things like that in what we do. Mostly because in some way, I think that the kind of problems we want to solve are slightly different is really more about trying to find some structures in our data in a way that we can understand the process, making some assumptions about the structures and looking at the data from the point of view of what is the latent structure of this case gonna be about the cancer practically. But there is no things like, I don't know, training a model over 35 million images of these or that. So it's a kind of a different field, right? In some way also because we don't necessarily have these huge training sets, no? Luckily because otherwise there will be like millions of people with a disease. So it's a little bit and there are millions of people by the way, but anyway, it's a different kind of concept from, I don't know, image recognition or other stuff. Even though there is some deep learning and especially in people that do things like additional out encoders in the context of single cell sequencing, et cetera, which is probably something might be discussed by Catalina Vallejo in these lectures. She does mostly a lot of single cell stuff, as far as I understand, while I'm more the genomic sky in the context of cancer and there is Gabrielle on the Pigeonomic. So there is, I think it's a nice school because you get to see different type of stories, right? So let's start putting the disease into context. So the application, we want to, that's a sign. So we want to focus on why we do this thing, right? Which I think it's always a good start. It's a predominant disease in the sense that it has, of course, a strong diffusion. It has some form of equity in the sense that it hits from the richest to the poorest people because the disease that can essentially be motivated by the basics functioning of cells and the formation of organs and tissues. So it's something that is really prevalent across all the individuals. Of course, it has an incidence that is spatially different across different countries because it has also some relation to lifestyles and therefore habits of people. So it has this kind of environment component. There is also a very strong genetic, there is a very strong genetic explanation of how this disease comes. So it also has relation to... Is this adjusted for life expectancy? That's a very good question. I don't know because I didn't do this picture myself. I took it from Google Cancer Incidents map worldwide, something like that. That's where I stopped. But I have something that might connect to your question in the next slides. It has an incidence that depends. So not all the... There is a bias based on sex. So some people tend to get tumors of a certain type, tumor of another type. And you will see that, for instance, even the rate of mortality is different across tumor types. So they're not necessarily all of the same severity, I would say, but there is a huge variety of tumor types. There is pretty much every different type of cell in the human body can lead to cancer. So there are roughly 200 different types of cancers. Some are more common than others. Some are more related to environmental factors. Some are more biased to hit on males or females. Of course, breast cancer is more characteristic for women or the men, even though also men can get breast cancer. So there are a lot of different things, of course, but trust it definitely comes to a man. So, you know, of course, there is some specificity to the individual. And what is interesting is that we are doing better at understanding and in the disease, and sorry, understanding this disease and treating this disease. So there are like huge improvements in terms of the way we develop the drugs and we use them to... to treat cancer patients. Even though, of course, there is a... it's still kind of one of the most prevalent diseases of mankind these days. It's not the first cause of death in the world, but it's, I think, one of the top three. The first one is should be heart condition. So this position is third, but it's really important. And it's really fascinating because it's difficult to understand and also because it relates to basic functioning of cellular processes. And so something we can think about and we can try to measure with sequencing data. As you see here, for instance, there is a very high rate, increased rate of death rates by lung cancers that, you know, you would probably speculate that this correlates with diffusion of smoking habits around the world. So you see these kind of signals that are most likely telling us how much environment can somehow influence the probability of the disease in some way. But things are going better in the sense that there is decay trend for a number of tumor types. So it's a disease where we're doing better. It's still very important. So it's still very under the spotlight, I would say. So let me try to bring you inside the world of what are the main ingredients we need to think about if you want to do this kind of machine learning and data science. So what is our interpretation of DNA in the context of the disease? What role does it play? Why that is important? And so we'll have to go from the idea of DNA mutations and eventually measurements, which is going to be the thing we're going to use to develop our models. So just like a quick recap, this should be something familiar pretty much for all of us. We have chromosomes that constitute our genetic material. We have 23 chromosomes, two copies each. All of them are surrounded by things like chromatin, which is probably the thing that, I don't know what Gabrielle explained to you, but she was probably talking about chromatin, right? Things like that. So that's the epigenetics part of the story. There are wraps around the nucleosome and keystone. So this is just the way the DNA molecule falls itself. It's in the form of a helix. We know that. But what comes as important for us, it has this sequence of nucleotides which are bases that determine the actual material which is stored inside the DNA molecule. It's important because this molecule, not all of it, but 2% of that is covered up by genes, which are the piece of information that encodes for proteins. So pretty much 98%, 99% of the genome is called non-coding because it does not contain genes. The small bits contains the genes, and the genes are the most important things because the genes are what actually produces the proteins. And the proteins are the things that make the actual functioning of the individual, of the organism, right? All the interesting stuff is carried out by the proteins. In the process of production, of the protein by the DNA, which is something that's very complex and goes through the intermediate product of RNAs, is you can think of these as a process of going from a computer program, which is the DNA, something where I've written all the information that I want to write in compiling this computer program and running this computer program, which is going to the executable, which is the making of the protein, right? So one is like the recipe of your pasta dish you want to do, and the other thing is the actual pasta dish that you're going to eat tonight. So things can go bad in cooking and things can go bad in doing proteins as well. And in fact, when we talk about the presence of mutations, we mean somatic mutations in this content, in context mostly, we refer to the fact that we might have changes inside this sequence of nucleotides. Maybe most of these changes will just make nothing, or will be completely in influential to the final outcome of the proteins. If you want to just put it as simple as that, but some of these changes are going to be important. In the sense that might impair the production of protein, or they might need proteins to have different shapes. And if you probably heard of these things like alpha fold, this kind of deep learning prediction of protein structure, et cetera, what makes most of the structure, sorry, what makes the function of the protein is the shape of the protein. So the way it comes up, fold it, as I always say, it's what gives it a certain function, because it's just like a kind of mechanic process that things attach one to the other. So if you fold it in a different way, you might have a different function for the protein. So some of the mutations, not all of them, some of the mutations, but be such that the protein doesn't fold in the same way. And therefore you might have misfunction of the protein. And essentially this is one of the reasons things start going bad in a certain way. You might say, well, you know, just let's let's invent some way of removing mutations. That would be a very bad idea because mutations are the things that create diversity. Right. And we are all similar one another, but we're not exactly alike. Right. So the difference between me and you is just probably some millions of mutations that you might have. The difference between me and you is just probably some millions mutations out of the three billion molecules, sorry, nucleotides molecules are not really that many, but, you know, it makes a huge difference. So one of the fundamental ingredients to create evolution is to allow for variation. Right. And so mutation needs to be there. Don't come up with crazy ideas about removing that because that would be very problematic. So consider that there is a fundamental difference relative to epigenetics, for instance, right, because here we're talking about making one modification inside the DNA molecule of one cell. And then these has profound implication over all the behavior of the cell, because the basic step that happens during the growth of a line edge, which is a set of cells that come out of an original cell, or of an ancestor cell, is the fact that when cell divide, they copy the genetic material. So you have a very simple process in which you start from one cell, and you go to two cells here and things go, I'm going to make things in a very simple way, but you want to go from a certain genetic material in one cell to two daughter cells that have exactly the same genetic material. That's the mathematical way of seeing cell growth, right? But the reality is very different from that in the sense that this process can be complicated in the sense that maybe I should keep a zoom here on the blackboard if it was something. How can I, is there any, is this, I don't know, because I think the camera takes the full room, but maybe if we focus it. You want to make this bigger. Yeah, to focus this on the blackboard otherwise they don't see it, right? Probably. I'm not going to use it much, but if I do, I don't know, I mean like really zooming. Ah, there you go. Yeah, that thing. Yeah, okay, like this is fine. Thanks. So, yeah, sure. I do myself. No worries. And I'm going to go for this. Okay. It makes sense. Cool. So we have this process that of cell division, which is very important because every time that these things happen, all this color stuff. Every time these things happens, if you have a particular mutation in the position of the genome, this mutation is going to be passed along to the next generation of cells, right? So that's the difference between, in a very simple way between genetic and epigenetics, in the sense that by genetic changes we refer to something which is usually heritable. And therefore it keeps to remain over time, if this is your time dimension, it keeps remaining over time, while epigenetics might be something that is reversible. So after a little bit of time, this thing might not be still anymore there. But one of the things that happens is also that on top of these, whenever you have this process that requires to create genetic materials or duplicate genetic material. It's a perfect process in the sense that there might be new mutations coming out over these populations. And this will be present from one generation onward, right? And this continuous process of mutation generation over the creation of cellular populations. And as you can see from panel over there, this kind of X is the mutation that is present in the orange cell, and then this one gets Y. So the blue cell has both X and Y, right? And the green one gets X and Z. So you have this kind of idea of a genotype as a combination of matrices that come, sorry, mutations that comes out, you know, as time progresses, it comes out as more and more complicated, right? One of the key things you should keep in mind is that in the context of this particular disease, there is no way of getting cancer with just one somatic mutation. There is no way of getting that complex disease by just one mutation. It's not like certain, there are diseases in which it's just enough to have one mutation and that'd be enough to get the disease. And this is one of them, but in the context of the disease, things are, disease things are a little bit more complicated. You always need to have multiple type of mutations, one on top of the other, which justifies the drawing I was doing just before, in order to get to the final disease. In a single gene mutation, you can cause a cancer. There are some, you know, preferential type of mutations for the cancer, and they often depend on what, on the cell of origin of the disease. So tumors in the lung, they look a little bit different from tumors in the colon, that's the latent structure in the population we were discussing before, right? But in a very simplistic way, there was this, there is this kind of view for instance in the context of colorectal cancer, which is one of the most famous ones, in which you have these passages that go from the normal epithelium, so the normal cells of the tissue to the actual carcinoma, which is the official cancer thing, that are all caused by the subsequent accumulation of mutations, impairing the functioning of important genes like APC, Keras, Keras-2 for P53, so probably some of you that already work on sequencing and stuff, they heard these names, like P53 is the most famous, it's called the guardian of the genome, so some of them are very important. It is believed that mutations in these genes that cause important changes to the phenotype of the cancer cells, for instance, mutations in this APC gene are responsible for the switch from the normal epithelium to the early adenoma, which is the kind of classical mutation we see that comes out a little bit out of the normal epithelium, grows like a small mushroom in some way, right? And the more it grows, the more, sorry, the more mutations you get, the more this structure become complicated with the final end point of giving you the metastasis, which is just the way that tumors use to move from one organ to another organ, right? You get a metastasis on the liver, primary tumor in the colon. There is this kind of progressive accumulation of somatic mutations in which more and more advantages genomes get selected by a Darwinian evolutionary process, and I'll try to explain these over the next set of slides. Does it make sense so far? Okay. So, one of the things which is the technology, sorry, yeah. Just a question about what you said before, the analogy made from computer code, moving from source code and so basically these mutations are like breakage, like corruption of source code that could lead to breakage of the... It's a bug in a computer reaction. This is my very computational way of putting it down. Yeah, and also before you mentioned epigenetics and how they are not inheritable, but that's not always the case. Sure. I mean, it might be that epigenetic changes are also inheritable for certain, for long time windows also, right? Yes. But it's more like here you are. Once that molecule is broken, all the progeny gets the error as well, right? So it might happen that if you break it too much, this kind of cell dies because you're impaired so much the basic function of the cell that it has to die. But in some way, if it survives, it's going to carry over this legacy of mutation that comes off. It's not necessarily true what I'm saying. There might be a way in which you lose some mutations, but it's perfectly fine to assume that this does not happen. While the epigenetic mutilation, for instance, of a promoter of a gene, is something you can lose over time, right? Yeah, so you are not in co-pactation, right? So your kind of analysis, you always only consider genetic mutations. Well, in the context of understanding tumors as an evolutionary process, you need to have a barcode that tells what is the history of the process and mutations are your barcode basically. And asking me is because I was thinking about some works I've seen recently where people have seen that, for example, if you compare chemo-resistant tumors versus non-chemo-resistant tumor cells, you see some distinct difference in the chromatin structure, for example. If you look at the system mutilation, you see a difference. Yeah. So it appears that there's like epigenetic factors that drive also. You know, of course, I'm not saying so, for instance, people, there is a lot of dissing in the community, I'm right, you're wrong, you know, so genetics versus epigenetics is one of those things. They play both of them an important role. The classical view of tumor evolution comes from the genetic side, but now it's adapting to accommodate also epigenetic evolution. It's just much more difficult to study even mathematically because of this fact that it does not stable over time. In my opinion, this makes mathematics sometimes extremely complicated. But yes, there is, you know, good reasons to believe in both one and the other. Both of them play a different role. It makes sense to start to think about the genetics perspective to begin with because that's a simpler one. Hi. So, so there you in one of the slides you showed one stage of cancer progressing to the other stage and each is so sitting with one specific mutation. Yeah. So when you say that there is no single mutation causing cancer. Is it that each of these steps need to have at least multiple mutations or is that what I mean by that is that let's let's just go try to use this kind of cartoon representation on top of that. That means that these cells carry the mutation in APC in these cells carry both the mutation in APC and chaos. Right. So this is like something that is been believed to be true. I don't think it's actually true. We have a lot of reasons to believe that this is not necessarily true. And here it means that there is a preferential route to acquire this final endpoint of the disease that consists in acquiring as a cumulative kind of effect, a cumulative fitness effect, acquiring multiple mutations. All of them will be inside the same cell at some point. Okay. Can there be like single mutations which prime cells towards their path to carcinogenesis. Can there be single mutations which prime cells towards their path to tumor genesis. For example, here. Just if assuming that this is true. If there's an APC mutation which leads to an early adenoma, can we, it's still, there's no going back from early adenoma to normal epithelium rate. One of the fundamental things as I was saying is that you don't really reverberate back to the wild time. That would be, yes, that would be one of them. So you would not go back to the, to the early healthy tissue, I would say, general, but to progress further you need to get an extra mutation thinking about as necessary, not sufficient condition to progress to the next step. Sometimes people think about that in terms of that. Okay, thanks. Okay. What I'm trying to give is like a view of this process which is not perfect but is true enough to make what we have to do right so there are a lot of more complicated things and special cases in which you can actually lose the mutations blah blah blah. I'm just curious, is one mutation of these has high derivative to to derive other mutation other mutations. When you hide the like, does it like improve the possibility of difficult question I'm not sure I can answer yes or no. Okay, your question is like that's the probability of a mutation increase given another meeting. I would like to like to arrive to this, this stable state of metastasis. I would say, for instance, no from a Darwinian point of view because the probability of a mutation should be independent of fitness advantage of that mutation. It's selection that selects things right. Okay. But this was really a fantastic question. Too difficult to just swan the first one. Like a comment on what she asked and I think like if there are mutations in say genes responsible for DNA repair. Yeah, but it's a different story. Yeah, I didn't go there. So there is some for instance biological mechanism that is responsible for the, let's say, never made probability of getting this mutations when you divide. If these mutations in place one of those genes, you might have an increased rate of mutations over time. So, in some sense, yes, but I think she, I think she meant in a more broad type of way. As I said, like the truth level I'm positioning myself is enough to be not too false not too, not too useless right otherwise. I'm not a biologist so can I move forward. Okay. So what we actually do when we sequence right, we extract DNA from ourselves, the kind of technology I'm going to refer to is bulk sequencing did Gabriel tell you anything about that. No, essentially the idea is that we pull we take ourselves. We extract a bunch of cells here right. And each one of them as their genetic material inside the DNA. We, we extract all of it, and we pull it together inside the, you know, some tube or anything, and we put inside in a machine that speeds out short fragments of this DNA. The DNA comes out in short fragments the length of the fragments, depending on the technology we use the amount of fragments depend on the money, we put in the experiment. And the quality of these things depends also on the technology because sometimes for instance that the machine has some kind of error probability right per nucleotide. And each one of these things is essentially is more view 200 nucleotides for instance of one DNA molecule, one of the things we should observe immediately is that we don't have any information to map these back to the cells. Because this is like taking, you know, a lot of different fruits and making a big smoothie. You have no idea that you can see how this must be some bananas flavor right, but you cannot really. It's not invertible as a function right you get the read and where does come from you have no idea why there are other technologies in which you can actually instead barcode each one of them. And so you can assign the cells to the cells to the cell sorry the reads the segments to the cell of origin. That's called single cell stuff. Different type of performance is can be used for different type of things is not all what we're going to discuss in these lectures, we're going to focus instead on these short regular story on bulk sequencing is called bulk because you make this bulk of cells, you make these reads. And what do you actually do what a bioinformatician does for a living the classical bioinformatician spends his life, taking all these things, and putting them on top of what is called the reference genome, which is the thing on top here, aligning them. The bioinformatician alliance reads, most likely that's the classical view of bioinformatics. So what we do as a lab we have to do it to get to our interesting data for the evolutionary inferences we want to do, but that's 100% part of the process. And I'm not going to discuss these but it's also complicated problem because it requires, you know, finding the right position for each one of these short segments. And imagine if the segments are short, they might map to different position of the gym so it's kind of complicated as well. Right. This question from the back. This is a very good question. I can repeat the issue. The question is like do we sequence all the genome. That's a very good question let me just make a drawing at a moment that's live this morning but yeah. Okay, this is the old genome. Let's say this is the coding part of the genome. This will be like 2% right. We said 2% of it. It's more part. Sometimes we do this, we sequence these things and we call this whole axon. That's the name of the technology that we use for sequencing. Other times we do whole genome, we sequence all of it, or other times even we do something smaller. So we look inside of the axon of all the axon we look at some targeted panel. What do you think is the difference, the cost. Right, it's always about money at the end of the day, because this thing is really really small this 2% of this. If you want to cover your genome with all these reads. No, if you have that you need to get many to cover all of it, you will require more than what you can, you know, just to cover 2% of that. Okay, but there is a fundamental difference, according to what you want to do, you might require to work with the full genome, or only a targeted part of the genome right. So that really comes at the point in which you need to decide what is my question, and then I do the experimental design in a way that I can answer my question. Many people come to me always and say like I've done this, can I answer this question. And most likely I say no, because you should have done that. One of the cool things is that we are also involved in experimental design know I want to make this statement about this process. What should I measure right. That's where start the science basically, while sometimes it does not go that way. But this is really good question. What we're going to discuss in these lectures is this kind of sequencing we are going to have like a full genome. And the reason is that the statistical signal we're going to look for is distributed across all the genome, and you will see it in the practicals. In which way, can you just decide to sequence a part of the genome that's a very good question. I'm not a wet lab scientist but they use primers to cut the DNA in certain positions so that's how you do it. Basically, once you extract the DNA you always, I mean you can't choose what extract you get all of it right basically, but then you can reach to select certain areas over others basically said again. It is known where it is located but it's spread across all the chromosomes right so some genes are in one one and cross one two and three is not like in one position. So he spread all over the place. But once we have taken our reads, and we have done our bioinformatic thing. So aligned all the reads. What do we have is essentially a reference sequence. So what we think is there, let's say average human in some way, we have this mean filled human in some way, which is a sequence of nucleotides really do have that is called enemy changes periodically we update our belief about this, this reference. And then we can look at all of our reads and check if there is any difference from the nucleotide we were expecting to find at a certain position. And the actual, sorry, a difference between nucleotide we found a certain position in a and the actual expectation about nucleotide there should be a G. That will be a mutation found inside one of these reads that we know it comes from one of these cells which one we don't know, right. But the kind of quantity. We pay the money for in the kind of quantity we use to make what we're going to do in our lectures is the variant a little frequency. So it's the ratio between the number of reads that come up with a variant. The number of reads that come up with an a here. And the overall number of reads, we can pile up at a certain locus. So here we would have 12345678 reads as total coverage is how we call it, the coverage is green thing here. 1234 reads with the mutation. So the violent analytic frequency will be exactly 50%. And because we have a huge molecule with three billion basis DNA. Essentially we're going to have like potentially thousands of these values for full genome. Does it make sense. There's another way of representing that right in this case it would be like this blackish thing is the read with the somatic SNV. This is terminology you might want to get familiar with is called a single nucleotide by it. So, all we do is about the mathematics we do the inference we do is all about the vitality frequency spectrum that we can measure through a sequencing experiment. The secret we want to get away from our cancer is all about relating how these kind of spectrum changes over time as the disease progresses. Does it make sense. Cool. If you are not rich, because this thing cost a lot, consider that generating a full genome. Let's say with a coverage of 100 reads per position of the genome on average, it costs some thousand dollars so it's not really cheap if you want to make like I know hundreds of samples. I'm lucky that you can actually get these data, sometimes for free from a lot of public databases, which is always helpful especially if you are, you know, starting and you want to get your hands dirty without spending the money. Right, so. In this comparing these values. Am I correct that you are going to have some time slices and in each time you have different reads and find these values because the different are changing over time. So how we can say this measure and changing of this measure can be assigned for growing the cancer is a good point I should have put a problem as light on this. So, most likely, we don't have longitudinal data we don't have multiple observations of the time. Most likely we don't. We have a single snapshot of the process. Still, I'll show you, we can get an idea what's going on, even with a single snapshot of the process. Why do we have a single one there are many reasons right some are practical reasons of collecting data from patient means like chopping up tissues out of patients out of people right. Sometimes we actually have a temporal dimension so we are correct in saying that sometimes we are able to collect things over time. So we have also, you know, a time label t1 t2, etc, etc. There's something else instead we chop a large portion of tissue. And then we get our sequencing done out of different specially separated region of our tissue. So we sequence this one position one position two, so on so forth. Sometimes we also have this thing over time. Okay. But, sure. Okay, that's a very good question I omitted to to go in the detail of that. When I was saying that. But this is a good point. When I was saying that we extracted DNA out of all ourselves. We don't make any statement about whether they're tumor normal cells right. So, most likely you will be collecting also normal cells so some of these reads will be coming also from normal cells, and in the actual presentation here is contamination being referred to the fact that some of these reads come out of normal cells. So the violent to let it frequency is not adjusted. It's not normalized for tumor content is what we say, because it's just like the overall fraction of reads observed in the sample. So we cannot distinguish, but we can do normalize, which we're going to do in the practicals when we when we work with real data is very good question. So violent to let it frequency is just the notion of a variation at a specific genomic locus. That's what this notation notation means. Yes. And often we have like a lot of samples. No, what you see here is just like the classical way people using genomics to to make sense of this data. They try to represent it like this right they put these huge matrices where you have each one of these columns to be one patient or one sample whatever it is. So one of these rows is one in this case is a gin and the gene name you find it reported here on the left. And this different color of the alteration is the type of mutation you have seen. So what I've discussed so far is one particular type of mutation is called single nucleotide variant. And actually many more things again happen in some way. I don't really care about that for now. So what I think is just that we need to get away with is the fact that this kind of thing is like a binary zero one matrix if you want. Okay. So those are your matrices, some genes in this case, and the columns are the samples when I came in 2017. I told how do we analyze this. Now we do something else. Sorry. But this is the kind of data that comes to you basically now we do something more fancy more complicated at the level of each one of these individuals. So we inside each one of these columns, and we want to make a story about this particular process in each one of these columns, not across the individual so it's not like a story about multiple patients is about high resolution one single patient. Okay. And then up here is just I think in this case, yes, it's a grouping of people. It's not a picture that program myself I should have put I apologize I should have put the reference for that. But these are like. So wind is a pathway. These will be like a functional group of genes in some way. But for instance we see one thing you need to be looking at this right. There is a huge degree of mutual exclusivity across the mutation patterns in different individuals right. And these are some important biological implication when we look at data from different individuals. So, but let me move forward. We understand there are some other mutations we know we can sequence mutations and, and therefore why didn't we solve cancer yet. That will be the natural question right. I gave you the secret thing is the mutation blah blah blah. We have the technology to measure that as to some stuff and solve it. And the problem is that it just not works in this very simple way. The evolution is the medical process so we need to do more than looking at some other mutations and I'm going to try to, to make it with a simple example. This is data from a very beautiful cohort to release the on the New England Journal of medicine in 2017, where each one of these things you see on the top on this matrix is a little bit like the matrix you were seeing in the previous slide. Each one of these things on the, on the columns. It's, it's, let me check. Okay, sorry, the things on the rows are somatic mutations. Yeah, I was confused. And the things on the columns are again patients as in the previous representation. And here I'm taking all the mutations that I find in one particular patient. I put a one if there is a mutation or zero otherwise, and then I do some standard hierarchical clustering trying to see if there is any structure in my data. What I want to look for is a grouping on the columns of this matrix know. So a grouping on a thing on top here. But clearly there is no strong signal no, there is nothing that makes good clusters in this data. So what I might decide to do is to increase the minimum frequency I want to look at and say well sure let me look at mutations that happen in 5% of my cohort, for instance. So I'm going to drop essentially rows out of this matrix, keeping the same number of columns. And still the signal is, well there is a little bit of separation here, but things don't go very interestingly well. So I'm going to go on and go forth and increase this cut off more and more. And eventually I can go to something that is okay there is a little bit more groupings structures in my data. But to get here I had to throw away most of my data, first of all, and also the kind of signal is a little bit disappointing right there is nothing particularly striking. And there is a huge amount of noise in this kind of thing not noisy in the point of view of noise from a technological point of view, but the fact there is some intrinsic variability across individuals. There is fundamentally part of this process which has nothing to do with the actual function of the process, but makes it impossible to look at two data points and make them comparable on this standard classical metric space because there is a huge number of ones that make no sense for the overall functioning of the process. So this is an important confounder not just physical sense confounder in a classical sense but it is a fundamental problem in terms of what is the actual signal we should be looking at. You should do something else instead because it's a dynamical process, you should do something like thinking in terms of how the process happens over time. How the process evolves over time, which is what we do for a living right, we put this kind of evolution interpretation on top of the sequencing data, which is what I'm going to teach you over the practicals of these of these, of these lectures. And if you do that on the very same type of cohort, this one here, you will see that the data will look like this, you will find this interesting idea. I'm going to give you to tell me that this is the same kind of structure of the thing we've seen before right now we have very well separated clusters where there are some features here that are somehow different. Right, there is much more clear and clean signal in this type of interpretation of the data. And this is done by using a computational method we developed when I was in Edinburgh with Widow. I think not about the particular presence of mutations in one sample, but how we can give a temporal explanation to those mutations. And now we can compare data points not for the type of mutations and they have, but for the way they are evolving over time. So bringing in a kind of trajectory dimension in the process to compare things from the way they evolved, rather than from what they look like now. Yes, there is a question on the chart from Beatriz. She asks, do you also count for the possible? Yes, I didn't make any distinction at this point in the sense that we use any somatic mutations, synonymous mutations, non-synonymous mutations, we use everything for the process. Well, it's related to that last question. What is a silent mutation? In this case it refers to mutations that don't have any functional effect. Right. But in this case, on this thing what I want to give you as the point is that each one of these things like these edges and arrows we draw is a way of representing the process as in the cartoon I was showing before, APC, K-RAS, et cetera, et cetera, et cetera. So bringing in a temporal dimension on explanation of each one of these things. And the method in this case, for instance, believes that in the context of this number of patients here, the one in red here, the presence, the co-presence of mutations on this gene, PPT-3 and FAT1, it also suggests the temporal accumulation on first acquiring one mutation and then starting the other mutation. So it brings in a temporal dimension in the process and makes those mutations not just in form of the presence but also a sort of a latent clock in the process of mutation accumulation. So the main point I want to make here is that this brings in a temporal dimension in the way we think about the process. Come again. So you have enlarged your starting data set, adding more data that you have computed. I don't have completely understood how. No, I wouldn't say how we compute that. Yeah. So it's, I have. You change from one feature space to another feature space if you want. So instead of looking at present mutation, you lift your problem into a problem of understanding the same data matrix becomes a matrix of these four. Let me try to synthesize only to see if I understood well or not. You have the frequency of every mutation for a patient, okay, so that is a column in the previous chart, right? More or less. Yeah. And based on what you have measured, you have predicted what should be appear in a bunch of time because I think you could also see it like that. Yes, in the sense that you can predict what is going one might be coming next in some way. Yes, because you give bringing a temporal dimension. So, but it's most likely the fact that you reverse time, you go time backwards in the sense that you look at the genotype today, and you say this thing today came out of this thing in the past. So you try to make an educated guess of how did you get to the point where you are now. And, but if you know how to do that you also know how to predict what's coming next right because it's just a flip coin, basically. So you have not completely understood how you can do this mapping this temporal is complicated but it's kind of practicals we're going to see and then inferential models we're going to discuss now are about putting some temporal dimension to the process. So, I hope you will make it. So if you go back one slide. Because you observe more mutation TP P 53 than in fact one. And in all cases where you have mutation in fact one you have mutation P 53. So there is, so there is some sense in having a statistical dependency positive like the one you mentioned now know you find a current. And so the marginal of one is higher than the, than the marginal of the other plus we have a question high joint probability right, which is the way in which you would, for instance infer these kind of models as a Bayesian networks if you want to put an edge of dependency between those two things. But this is not the way this is done in this particular way, but this is what I taught in 2017 on the other room so yeah it's a good guess that you have. Another thing that we need to think about is that in this process, and now we get to our modeling part in this process, we should think of the process of mutation accumulation inside one patient has having these kind of latent three of cell divisions, like the other thing I was picturing here, where most of these cell divisions, they, they create diversity, but they are not functional in the sense that all cells have the same colors and colors instead distinguish population that are different from one another. So the main difference is that these first blue cells acquired some special mutation relative to his ancestor. And these green cells acquire something special relative over his ancestor. So he has some special mutation, but in between the green ones, they're all alike, they have the same. They are part of the same family of cells, something we should be calling the same clonal population, they have different genotypes, so they have different DNA molecules because they have every time they divide they have different mutations, but from a functional point of view they are exactly the same. So we could call these things phenotypes, they have different phenotypes, for instance, and the phenotype might be in fact the possibility of evading a drug, as you were saying drug resistance or metastasizing. So our point is reducing the complexity of this kind of process to the evolutionary trajectory, this just subsequent set of steps by looking at the variantly frequency spectrum from read count data. That makes sense. And by the way we have been thinking about this model since 1976 by these British scientists which is called Noel and this is called the clonal evolution model by Peter. Noel I think was called Peter yeah. So the dynamics of cancer evolution, they need to be discussed now in these are the general dynamics of somatic evolution so every evolutionary process should have at least three ingredients to be discussed. The first one is mutation and drift, and then there is selection so neutrality is a combination of mutation and drift mutations are the things that I was discussing before. So the probability of acquiring these mutations, which is independent of how much these mutations are functional going back to your question, and neutrality is essentially the answer to the question what does it happen when nothing happens, right. You have these more increased set of mutations that are accumulated inside each one of these cells. They are, they make cells different one another, but nothing particularly happening relative to these cells. Drift is essentially the probability of, you know, finding high frequency mutations, not because they're functional but just due to the fact that they are drifting at random in a population and the result of a random stochastic process in which you have like these events that happen and become important when you have low numbers. For instance, a mutation becomes over represented in a population not because it's important, but just because all the other cells in the population die for some reason right that's a drift type of event. This is random, essentially, why selection depends on reproductive capacity of the cells on the fitness of this population so a population that has these two important mutations should grow faster than a population that has a single important mutation. That's the idea of having these additive type of fitness models. One of the important things is that both of them reduce heterogeneity right. And that's that selection creates bottlenecks. Right. If I have if you know, all the, I don't know all the daughters in Trieste will be with, I will have all the kids of Trieste right in five years from now, essentially, my genetic background will be a reached relative to the actual genetic background of people in Trieste now, because I underwent strong selection right. Despite what some people might think, or might be the intuitive explanation selection reduces heterogeneity because one lineage tends to be dominant over the other lineages right. In the slides before when you show this, you know, before like this, this latent clonal structure and phylogenetic tree that you uncover. Once you have this, how explicit is information about you have about the various groups like can you say. We're going to try to infer right going to take our sequencing data and our job is going to put colors on the tree establishing which are the colors and how many colors are there. Yeah, and then can you like say what makes this color different from the other color that hopefully okay hopefully you will try to do that as well. Okay, thanks. So essentially, the difference in selection and neutrality is that selection shapes the evolutionary tree right the kind of thing you see on the left is a phylogenetic tree. The one you see here is completely balanced because it's neutral, what the one you see on the left here is unbalanced because it has been shaped by selection forces right. There are more offsprings of this set of cells rather than the other cells rather than what basically means to undergo some form of positive selection. Just to, because I want to. Before you evolution right I'm not trying to. And I want to give you some idea that it's possible to observe the process in reality this is a very beautiful video from from study carried out in Harvard where people have put this mega petri dish, which is this huge petri dish, where they is like several feet long. They put different concentration of an anti biotics going left to right towards the center right and vice versa. So these numbers refer to the concentration unit of these antibiotics, the antibiotics is a is a is a barrier for the, for the population of bacteria to grow right because it kills them. So here there's no antibiotic here is one unit and then 10 unit and so on and so forth. So these kind of things are going to see now it's evolution in real time over several weeks of growth. So there is a gradient the push things towards the middle, and you will see that this population starts growing and they hit a barrier right to hit the barrier over, over, over there pretty much where there is a first concentration of the biotics and cells start dying. They die because they go against the bacteria, but all of a sudden, while they keep dying, they get more mutation and make them resistance to these one unit of antibiotics and they start growing until the hits the second barrier, and they keep dying again but they keep growing at the same time and they get more and more mutation in eventually, they can really physically create colonial expansions that you can see by I in some way, based on these continuous evolutionary process where there is a strong negative selection induced by the antibiotic, eventually, the fully resistant population manages to acquire the center of the plate where you have 1000 concentration units of concentration of these antibiotics. What they can do here is trace back to the cell of origin this process, bacteria origin this process, and after evolution and history of the process based on how things went over time doesn't make sense. This is exactly the same thing that happens inside the cancer. Divisions acquire more and more mutations mutations make the bacteria resistance to these antibiotics. Eventually, the super bacteria that comes out there they are fully resistant to the antibiotic. If you want, you can think of cancer exactly the same way but not just cancer evolution of this planet. This process should be thought of in terms of these kinds of things evolution of species on this planet evolution of species on other planet if it exists. There is a characteristic time scale of the process. This process happens over two weeks time evolution of this planet happens on a time scale of millions billions of years. What happens on a time scales of what five years, 10 years maximum the average age of, you know, human is a question from the back. I just like to to ask what does play the role of the antibiotics or display the role of the antibiotics and in the body. This is a fantastic question. Yeah, I forgot to tell that like we are complex right. Despite the way we behave sometimes we are complex organisms, and we have for instance cells that need to, you know, feed over a certain amount of nutrients, they need to coordinate their patterns of growth. And these 10 sandwiches we've got to share them right so that everybody gets a slice of them. If you more is just kind of a renegade cell that starts doing whatever she wants. So you always have negative pressure towards the, this kind of non self cells, which is for instance, caused by your immune system. And imagine these negative barriers towards tumor growth to be exactly the fact that we have acquired over millions of years of evolution, a number of mechanism that make cellular growth, essentially controlled. You don't grow tissues as much as you want. Right. The tumors overcome the classical mechanism we acquired during evolution to become cells that can grow in a completely uncontrolled fashion, they can feed as much as they want. They don't respond to standard mechanism of cell death cells are very kind to each other. One, one cell might sell might tell to another cell, go kill yourself and the cell actually does kill yourself right. It doesn't do that. Right. So they have like the capacity of acquiring, you know, in the long life in some way. Right. So they can do all these things. So the negative battery can think of is the basic homeostasis of a tissue. There's a question in the chart whether the antibiotic is the same type only the concentration changes. It's the same type of antibiotic as far as I remember from the video. In this particular case, which depends on the bacteria and the type of antibiotic, and there is a, I think it's a science paper from these other people. You have this kind of gradient of the more mutations you have the more resistance you become, which is not necessarily true in the context always of for instance in cancer sometimes for avoid taking to a drug might be sufficient only even one single point mutation. So, but this was just to have a simple example and also we cannot make these videos over people right so it's fundamental. This is why a lot of studies on evolution are done on bacteria because you can control the system right so let's start discussing just to give it a little bit of idea and then the statistical models will be presented in the next lectures so we're going to see the model, and we're going to play with it right. So, all our stories about the vitality frequency right so these kind of measurements will theme from sequences. And we have many such mutations so essentially our data will look like a histogram of vitality frequencies done by taking all the mutations that are present on a genome that we have sequenced. We're going to put them together. And we're going to reason about the colors in the phylogenetic tree for instance how many colors are there what are they doing etc etc etc by looking at the shape of this distribution. We cannot predict what we want to say about the disease process so how many for instance colors are there etc etc etc, based on how does the site frequency spectrum looks like. And this particular type of distribution. And to do that we're going to bring in some ideas for mathematical modeling of evolutionary processes, which might be similar to what Thomas eight is going to teach you, because as far as I understand he does a lot of sense themselves dynamics, so I would expect him to tell a little bit about that. And this essentially brings in the temporal dimension of linear bird that's a classic processes or something that in physics is very, very common. And that with some observational model for frequencies that comes out of sequencing so more machine learning style, and we put both of them together and we're going to get our statistical inference method that can look at sequencing data over a full genome. So this is one of the analytic frequencies, and it's going to tell us basically how many populations are there. And if one of them is growing faster than the others, which is a little bit the question about the tree, I was putting before. So the resolution at which we can make this type of statement, it really depends on what the resolution of our data. If there is a population of three cells out of a billion. No way, we'll never see that. But if you have a billion cells. Some hundreds of millions of them not 20% of them 30% of them 40% of them behave in a different way in the sense they are outgrowing the other cells. We will be able to see that. Does it make sense. So we can probably measure due to intrinsic technological limitations and cost of sequencing. There are other things that are go to the microscopic level. For that we most likely need temporal data, because something that is more here is going to be large here, maybe. But we're going to be able to do the kind of inference we want, at least for large populations. And to bring this in we're going to have to use some idea from population genetics to create a null model of evolution. We can use some Dirichlet based type of models of clustering to have a nice way of accounting for, you know, probabilistic signals in our data. That's it. And then we saw this is going to be the topic of the next lecture, how we do this from a single time point. And then also in the second lecture most likely we're going to discuss the problem of how do we do these if we have multiple of these points. We can do everything together. So what I want to just give you some heads up about is the fact that the shape is what matters here right. So maybe this thing has a different story than this thing. And this one is also different from that. So let's just think about how we can predict a certain shape, given the clonal structure of our cancer if you want. Okay, questions. Now one question that maybe maybe naive this thing about the shape. If you change the order of the, if I'm to correct this histograms you measure for every position. Yeah, you have like your full genome. You have in the same detected. I don't know. 10,000 somatic mutations. Yeah, you get the variant frequencies of these 10,000 things. So it's a vector in 10,000 dimensions. And you make a histogram that doesn't make sense. I okay, okay, because I thought that's for like, I'd say you have like 10,000 position you forever those opposition you you No, no. No, no no no no no no. Okay, that's clear because. Thank you. So essentially you take your, you take your reads for every position you count the valentinelli frequency. And it's what you use. Okay. question in the chat at which level of the mutation is it possible to identify the type of cancer among others? At which level of the mutation is it possible to find a type of cancer among others? So I'm not sure I understand exactly what is at which level of the mutation. Which level of the process probably? At which level of the process of evolution? Well usually when you collect the tumor you collect it from an organ, you already know where it comes from in some way. So I would say that probably is already in the, it is true that sometimes there are situations where this might be a little bit more difficult and in fact there are tumors of a known set of origin, that's how they're called, but most likely you know where is the organ from which you did the resection. So I think that the answer is explicit in your sampling of the data. So for instance you chop off a piece of colon and that's because you have a colon cancer. So I would say it's kind of explicit. In case of bacteria in an experiment they die and they survive. What could be an evolutionary reason for cancer progression in drug resistance? Why tumor despite it can be transmitted to the organ? That's a very good question and in fact some people might say well why is like, I mean the growth of a tumor is detrimental to the host, right? In the sense that the tumor grows but the host dies and so it's really an evolutionary process that might be that might be actually a criticism to define this as an evolutionary process but I think that at the level of the time scale at which we look at that which is the lifespan of an individual it makes sense to think of that as an evolutionary process which is the reason I made the example of evolution over a planet over billions of years and evolution over a single person over the lifespan of an individual, right? And this was like evolution over two weeks, right? So there is a characteristic time scale to the evolutionary process. Is Vaf same as minority frequency? No, that's something that relates to copy number events which I'm not sure you're going to see because it would require too much work but if you have particular questions about minority frequencies or anything you want you can just drop me a line. I'm due to end up in six minutes so I think I want to wrap up what I'm going to just like do you want to spoil it or not? Why not? But like super fast like this. So next lecture we're going to discuss a little bit how we look at analytic frequency and we infer some latent structure of the population, we're going to start putting down some models, you're too smart so I need to be faster, why you don't come to the next lecture, right? So we're going to frame the problem as a density estimation problem and then we're going to have to go back a little bit to some mathematical models and discuss the process and eventually we're coming up to discuss to how do we make inferences with a model that we published a couple of years ago now on Nature Genetics which I think it puts together all these kind of ideas all together and then we're going to go for some practical, I'm going to give you some details about the statistical model but we're going to go practical on the analysis of real patient data. This is real patient data that has been already analyzed so essentially in the practicals we're going to reanalyze this data so you can go home and if you ever end up having some of this data yourself you can definitely have an idea where to start from to make your analysis or if you read papers and you like and they give you mutation data you could download it and analyze it yourself. Okay, questions? So just like one at the end of the day so Ari and I want to spoil that. If you are in general interested in these kind of things we are organizing a conference in April in Istanbul which is a satellite event of RECOM which is a computational cancer biology conference it happens every year. This year is going to be organized by me and Gabrielle Schweikert so they had a lecture for the PG90 expert of these classes so if you develop statistical models that can be applied to the context of cancer it might be a nice opportunity to come and send your papers. We should have also proceedings to a journal but that has to be announced over the next few weeks I would say. Okay. Thank you. Thank you. So now we have a coffee break I guess. Antonio is there coffee? Okay so there is coffee upstairs and then we reconvene here at 4 for the spotlight talks. Thanks.