 Yep. There we go. Now we are recording. Then we are in this working folder where I have released the data for you. So we are in the top level folder of this Dropbox link that I share with you. And we are working this run of R5. And once you install DevTools, you have to run the first two lines to install two packages that we are going to use right now for the analysis of these samples. One of the things, there's some old code here which is not used right now. But let's start with analyzing one data from the PicoG folder. So we have PicoG folder. Then there is this RDS-Jozone kind of folder which is essentially folder that contains a number of distinct files. Observed that the size of these files is different. So each one of them, this is data released from the public ICGC portal. So one of the portals I discussed in the first lectures when I was mentioning data availability. So you find that data over there. I just preprocessed a little bit this data in the sense that I have taken the data and spread it by sample and created one RDS per patient where I've pulled the information regarding some mathematicians, copy number events and metadata for the sample that we're looking at. So something that we're going to inspect right now. Essentially, I just avoided you to parse some large text files and I wrapped everything inside an RDS and gave it the RDS. So the data is already available inside the RDS. So what we can actually do is just run indeed this loading of the RDS file that comes up as, you know, read RDS and the path to the RDS file. I decided to work with this sample for no reason. It's just the first one that I picked from the list, but there are many. So whatever we're going to do now can be done on several such samples. So once we have loaded the object, the object we can have a look at what is actually inside the object. The object contains indeed several samples. One is called copy number alterations. I give a very brief explanation on that and there is mutations. A mutation is actually the mutation sequencing data that I discussed over these two lectures, the thing that comes out of the sequencing experiment. Then there is metadata for the sample and the list of drivers and I will look into this right now. So let's start with metadata, which is the most simple information. I'm a huge fan of t-bolts, so I work within the deployer type of environment. Therefore, I'm not using data frames, I'm using t-bolts. Is everybody familiar with t-bolts? T-bolts is just a way of working with something which is not just a classical data frame but contains this nice way of formatting the data. It shows you that this is essentially a structure of one row and eight columns, where these are the column names, sample, purity, deploy, purity, underscore, blah blah blah, wgd, underscore, blah blah blah, tumor type, et cetera, et cetera. Essentially, this is just a nice way of avoiding a lot of the drawbacks of data frames where there are names accessed by rows in this complex factor, related things. I really suggest to you that you only use t-bolts in general to represent your data. These huge parts of packages called tiny vers that allow you to make this very simple analysis. Is something probably familiar with Python something that is similar to pandas type of managing of data frames? Or maybe some of you might be familiar with data table from the art world for handling large data sets, but it's essentially another way of not using data frames, which is always a good thing. At least hiding data frames into something more consistent and there's a lot of explanations on the web for this. Anyway, this is just like you think of this essentially at the end of the day where you have this information. This is a sample ID that of course is the sample name of the RDS file. It's the same. And this is the purity plodio sample. What we care about is actually the purity parameter, which tells us how much of the, this is something that some of you ask question about that, how much of this tumor, sort of DNA with sequences is coming out of tumor cells relative to normal cells. So 64% of DNA with sequence comes out from that. It has been estimated by the people that performed copy number calling, which we didn't discuss, but we need that to get to some analysis. So other kind of things we want to have a look at is essentially mutations data. So as you can see, mutation data contains 17,000 somatic mutations for this sample. So that's the set of mutations that are being called over the full genome of this individual. So that's actually the, there are 17,000 positions of the genome of this patient and show a difference relative to the reference. And this is actually the table that reports us exactly this piece of information. The table comes with reporting the chromosome, which is a code for the chromosome names or chromosome one, in this case, the position of the mutation from and to tell you exactly where the mutation is located across the genome. The reference nucleotide, so the nucleotide we're expecting to read an alternative nucleotide. So the number of, so sorry, the kind of mutation, this would be called as an A to G kind of mutation, right? So the A was expected and the G was found, right? So this is a substitution of a single nucleotide. The A nucleotide has been substituted by the G nucleotide. And if we, so this data, I left it like reach of information because this contains 42 more variables on top of the first one. There are 54 total columns in this data. The one that we care about are essentially the following. Are you familiar with the deployer type of parting with the spike operator? No? Well, this is all part of the, of the, sorry, I was assuming this was like pretty standard these days in our, this is a way of saying that so you should install the deployer as a package, otherwise this wouldn't work. But this is a way of saying like take the field of mutations inside of its X and pass it to a function which is called select. Okay? In the pandas data frame, it's just sub-setting some of the columns for this data frame for for your data structure. So what I want just to show to you is, sorry, I forgot, I need to run my required deployer function because if I run this instruction without having loaded the deployer package, this symbol is not recognized. If once I run it, the deployer package comes up and the deployer package imports the pipe operator and now I can actually rerun this information. And this is exactly just a view over our data that shows us what is the information we are going to use to make this type of analysis. No? So this is read counts data that I explained in the previous lectures. So this says that at a particular position, two millions and one 19 blah blah blah, where I was expecting to find a nucleotide A, I actually found the genucleotide and I had eight reads covering this position of the genome and out of eight reads, four such reads were coming up showing the the variant allele. And the variant allele or alternative allele, let's just sign them, was the G. So that means that four reads had a G whereas we can conclude that other four had an A because the total was eight. So the variant allele frequency is just obtained by dividing Nv by Dp. And that's why it comes out to be 50%. Notice that the coverage is not the same across the human genome because the position I have immediately afterwards, it has a coverage which is much higher 41. So have many more reads that pile up at that particular position and that position is about 200,000 bases apart from the previous one. No? Despite the Rarmor mutation, sorry, the Rarmor reads mapping on top of this location, I was still expecting to find a nucleotide C while actually I found some reads harboring the nucleotide T. And in this case, the variant allele frequency is half the one that I was seeing before because only 10 out of 41 reads were having the alternative nucleotide. No? So the variant allele frequency is different. This should already make you start thinking about the histogram I was discussing in the previous lectures where we had different values of the variant allele frequency. So if we want to inspect a little bit of these things, are you familiar with GG plot? Yes? Who is familiar? Who is not? Who never heard of GG plot? Never. Okay. It's just like part of the plotting system. So I wouldn't worry if I was in you. You can't learn it right now because this is just training over there, but just have a look at what I'm doing. What I'm going to do is to plot a little bit of this data, some information over this data. So the simplest thing I'm doing is a histogram. So I'm using a geometry histogram. So first of all, let me actually source the required package. No, I have this thing. I mean, I'm not, I'm going to spend too much time thinking about how it works, but essentially what I'm trying to do here is a histogram of the column called DP. So I want to take this column DP, the depth of sequencing, and I want to make a histogram, sorry, of course, not bin width equal to one, but maybe, I don't know, with 50 bins, something like that. So if I actually run this, I get to see what my data looks like. So of these 17,000 mutations, what I can actually see immediately is that the coverage of these mutations is on average around 45 something. Come again, yeah. Well, this is something we can decide ourselves right now for the analysis. So we could decide to filter out some mutation because they have low coverage. Yes, there will be a possibility. But first thing we have to do is to look at the coverage. So we're going to start to cut off. So on the data like these, you could decide, well, I don't know, let me try to do this. I'm going to show you this. Let's say we want to color, if we want to remove the ones that have coverage of 10, at least 10, you know, we would, you see what that means, right? I ask like to color the ones that have at least 10 bases to cover them up. And most of them they have. So 10 wouldn't be a very important cut off. So maybe you could raise it up to 15, something like that. Essentially, we're talking about chopping the left tail of this distribution. Of course, of course, if you make it like this, then it's going to be problematic because you throw away most of your data. So this is just really something you have to think about yourself when you deny, there is no best practice for this. That's what I mean. Literally depends on the quality of the data. So you want to balance between throwing away low quality data, but not throwing away too much data because otherwise you have no signal anymore. And one of the things we can actually probably do that is a little bit more interesting is to look at our elephant in the room, which is the variantality frequency. So we use the same kind of approach, but now I'm beginning with 0.01 as a amplitude of the beans because this is like the value that's ranging from 0 to 1. Therefore, I'm just saying that I'm getting all the values. Yeah, I'm going to make it like this. This is just taking all the values with 100 beans, essentially, if this is to go. So what do we see in this? What do you observe? What are you expecting to see? We were expecting to see a bump of chlorametations, we said, something high frequency class, we were expecting to see the tail. Right? What do we see actually in this case? Do we see the bump of chlorametations? Yes, that's what we see, right? Yeah. Do we see many sub-chlorametations? Do we see the tail, for instance, which is a particular type of subchlorametation? We don't see that. Why? But the problem here is that we don't have, apparently, we're missing part of the tail, right? So it's really missing even here, this part of the science-frequency spectrum. The problem here is that the resolution of this data is too low to see that type of signal, because the average coverage is only 40-something. What is the average coverage? You can actually ask to them. So there are some. So the median coverage is 40, right? So it's not really very high resolution. If you were thinking about seeing things that are very sub-cloner, right? They go very low frequency. If the expectation with a 40x tumor that is 100% tumor, we were expecting 20 reads to get unexpected part of the frequency of 50, right? And then everything that goes lower goes to lower frequency than 50. But if we look at this sample, what we immediately see is that our expectation of 50%, it is not 50%, it's queued toward the left. And that is because in the denominator of the function or the form of the vitality frequency, there is the total coverage. And the total coverage is a combination of the alleles and reading from the tumor plus from the normal cells. So what I was doing in the mathematical explanation I was giving the other day was assuming that we're working the ideal scenario where I just sequenced the tumor. But because of these spillover of normal cells in the sequencing machine, this normal DNA, some of the reads I get, they are from normal cells, not from the tumor. So they don't have the variant, right? Because the mutation is just part of the tumor cells, no? Therefore, I get a shift towards the left of the actual allelic frequency. And this shift is proportional to this quantity, which is the purity. And in fact, things seems to be kind of reasonable because I have a purity of 64%, which means that I would expect my peak in the frequency distribution to be roughly 32, half the purity. As much as 100% would peak to 50%, 64 peaks to 32, right? And this thing is not really like very far off from that, no? I can put a line over there if I want to be just like that. You see? It pretty much goes where you're supposed to go, right? So it means that the purity is estimated correctly. But this makes it impossible for us to see the signal of the detail, right? Because it's too low frequency, I'll say. Our hope is only to cluster these coronary cell mutations, which is something which we are going to do right now. So what we have to do is to take together copy number data. Copy number data is something that looks like this. So this is just a way of representing tumor regions, white areas of the tumor genome, and a representation for the number of copies of the maternal and paternal alleles, so the major and the minor copy numbers, no maternal and paternal. And we need to use that piece of information to map these set of goals to map some mutations, essentially. So without I can't explain too much this in detail, what I'm actually using is this kind of package that we developed to make this kind of process just to pre-process the data. So what we have to do is just to run this comment. And as you can see, the guy says, well, I found a number of information in your data, including some values where anase are being removed. We started with 17,000 somatic mutations. We could map 16,000 single nucleotide variants, so the one that is just a change of a single nucleotide. 800 such mutations were a little bit more complicated, so they were involving more than one nucleotide, which is called insertion deletions. All the mathematical theory explained to you is completely independent of that. These are just different categories of somatic mutations that we can find in the genome. And there are a certain number of copy number segments. The good thing of all this is that this object is now something that contains S3 methods and can be used to look at these data. So you see it is brought in all the information like purity and stuff. Everything is now inside this object. You don't have to work anymore with your, you know, table files. Everything goes inside here. This object keeps like everything inside in a consistent way and allows you to make nice things like, for instance, this kind of information, like, I don't know, plot in this. This thing has fancier visualization compared to the one that I was using before. Also built on that is just like showing the different type of a little frequency distribution where the mutation is split as SNB or insertion deletion. This is something just like, it's a package that contains a lot of interesting things for data visualization, but this is something that we're not covering right now. But it was good to get this because it maps all the data on the human genome. So what we have to do when we want to make these analyses at least for first approximation is to get only the somatic mutations and map on certain areas of the tumor genome. And the simplest possible thing to do is to start taking the mutations and map on diploid areas of the human tumor genome because you know that the tumor genome is something that is very complicated. I think we could actually have a look at how does it look a full tumor genome. We can do that by using again this package, which is specialized in copinamic processing. That's how a tumor genome looks like. If we want to look at it for white, each one of these things is a chromosome, one, two, three, four, blah, blah, blah. And this is the result of segmenting the tumor genome and understanding how many copies of each chromosome we do have. Because one of the things you know is that we are deployed organism, so our cells are 2N, and we call them, right? But sometimes in the tumor, very bad, nasty things can happen, which means that we can get multiple copies or lose a little bit of copies of the actual original DNA. And in this case, this kind of information is just reporting this type of visualization, because we have here on the, sorry, I should maybe, sorry, because on the sharing is more like this. Okay, so I'm going to go here. So what you can see here is that the deployed case, so the standard situation is when we have one copy of the major allele and one copy of the minor allele, which is in this visualization, through for any time we have this like greenish thing on the back, which is put for this reason by the visualization function. So all these chromosome 2 is deployed, ethrozygous deployed, it's called. Chromosome 3 is deployed, chromosome 1 is deployed only on the left part of the chromosome, so the P arm of the chromosome is only deployed. And for instance, chromosome 7 is like amplified, because it contains two copies of the major allele and one copy of the minor allele, so it's a trisely, there are three copies of those chromosomes. Here is that there is a loss as much as here and here, because we have one copy of one chromosome, the zero copy of the other chromosome, so we have a loss of ethrozygosity, this is called. It has a lot of implications for tumor growth and stuff, but what we want to do at the end of the day here is to highlight the green areas and just pull out mutations from these green areas. And we have this kind of operation done for free by asking this object to give us only the mutations and map on top of these segments, which are called 1-1, because I have one copy of the red segment and one copy of the blue segment. So this one is 2-1, this one is 1-0, okay? This is our one notation for representing them. So I can do this and this will subset the tumor genome to retain only the chromosome, where I have the signal I want to look at, and then I'm pulling out some other mutations by using this other function. So this makes a little bit of computations. At the end of the day, what I'm getting is actually 11,000 mutations. We started from 17,000, we had 11,000, okay? So we dropped some, because they were mapped on this weird genome area that we don't want to look at right now. And we have the data that we can use to start doing our analysis. So 11,000 is kind of a huge number. We can just like for the sake of, sorry, for the sake of seeing how this thing has changed, because we dropped some mutations, right? We still have the same type of information. We still have the Vaf column, so we can use the same plotting piece of code to make the analysis. It looks a little bit like more clean than before, right? Essentially, we have lost a little bit. So you know what I can do? I can do this for you guys. Let me do this. F1 and F0. I can actually do this, you know, right? Assembling plots in our can actually use Pashwork, which is fantastic for these kind of things. Yeah. So the difference, the one on the left is the one where we have only retained the one that deployed, the one on the right is the original data. The main difference is this kind of a little bit long tail on the right. The long tail is caused by, we can deduce, because we removed some mutation, the dead long tail is caused by mutations that are mapped on these non-deployed copy number areas, right? So complicated stuff. As you can see, this is affecting the shape of the little frequency distribution. And as we said, our story is about interpreting the shape of the distribution. So this technically speaking will be a statistical factor. Because the shape in the little frequency distribution will be caused, therefore explained by the latent copy number state rather than the actual evolutionary process of T1 growth. So we had to remove that. All we had to include that in analysis and other stuff. But what my point is that we had to get to this point to make, to have the assumptions of our analysis in place. We look at some mutations, where the only information that affects the frequency, the valentilate frequency distribution, this vast space, is the actual way the tumor is growing. So once we have this kind of information for this deployed object, what we can actually do is to do, and we're going to do it like right now. I'm not going to analyze all the mutations, 11,000 is take a little bit more time. I don't know, like four or five minutes. I don't want to wait for this four or five minutes now. I want to make it fast. So I'm going to just sample 2,000 somatic mutations. One thing you can actually do, we're going to do that afterwards, is check how much the distribution change when we sample the subset of the data. But then I'm going to use this mobster feed function from the package mobster that you have installed on this line in the beginning on top here to make the inference. So this kind of function, I'm going to prepare it on a large screen. Where are you? There you go, here. This kind of function, it contains, let me go first to the help. So this kind of function, I put a lot of efforts in making nice documentation and showing everything. So it has a number of parameters. Oh, now they're in color. I had no idea. So they must have changed something. You have that you can specify how many beta components you want to use in your model from 1 to 3. It's a local optimization algorithm. Therefore, you want samples. How many samples you want to draw? Is it like in a restart in an expectation maximization that would be? There is different algorithms for initialization. PIX is one based on a k-means primary clustering on the data. And then you start initializing your posterior from there, which makes sense. Then you want to say, for instance, if you want to look for a tail or not to look for a tail, we already know that there should be no tail in this data. But in principle, you don't know in general. And then we have parameters about, I don't know, how many steps of inference we want to do max? When do we want to stop? When mixing proportion don't change by a factor of 10 to the minus 10, something like this, which means we got a reasonable convergence. Then we want to say model selection. And then we have the possibility of running on parallel core machines. So if you have machines that have multiple cores, you can definitely run enjoying 30, 40, 50 as many cores as you want, which makes things faster, of course. But for this simple data, it's going to be easy. Plus other parameters. One of the cool things is that we have implemented this auto setup equal fast configuration, which is essentially is going to parameterize the inference in a way that it's fast. So it's really not doing the most exhaustive type of thing, but it's very good to get a very quick idea of what is going on with this data. Once you have done these, you can go for a more thorough type of analysis, but this is the good one to start with to get a simple idea of what is happening. So let me just, this is all the highlight of the lecture. So let's look at what's happening. So there's all these nice outputs. It says that it's going to feed these 2,000 mutations. It's testing these number of data distributions. It's testing the tail and not the tail in the model. And then it's doing some little bit of like post-processing of the clustering by adding a little bit of filters. And then it's essentially saying that it's not going in parallel. It's going sequentially on the CPU. And it's testing essentially two number of beta configurations. Sorry, yeah, k configurations. Both of them with both the tail and on the tail. So there are four configurations. And for each one of them was testing two initial conditions. So at the end of the day, it's testing eight different type of feeds. And using as a score function for the feeds, these kind of restricted, reduced ICL, reducing integrated classification likelihood that I discussed the other day. So it putting in the entropy of the latent variables and doing other things to give a score to the model. And well, it says like 19 seconds were enough for this computation. So it's a really kind of fast thing. And at the end of the day, we get some information because we get back the list of all the feeds computed by the algorithm. One is, so let me check this. So we have a table of the feeds. We have the results from a single run. We have what is considered the best model and a table for model selection, the one that is being used to decide which one is the best model. So the simple thing to do is to have a look at the best model. As you can see, we all, do you know what is S3 methods in R? Do you know what they are? So an S3 method is a way of developing a kind of an object in a sense that whenever I do some action over the object, some method is called in a completely implicit way. So when I do these X dollar best on the shell, it's actually the same thing as calling these kind of methods over the screen, which is a print function that makes nice outputs and stuff, right? It's not just like, we try to make them in a nice way, right? That you get information straight from looking at the screen while you code instead of making plots of an NPDF. I already see things here, right? Essentially, this is telling me that it has feed the model with one beta. There was the best feed for the guy, which of course we expect because the data looks exactly like that. And he had not need to use the tail because we didn't see the tail, right? Also. And so, you know, he's not intelligent, but he's computing some statistics and he says that I don't need that type of mixture component to explain the data because there is no that signal in the data, right? So he's doing what we do as intelligent human beings in a statistical fashion. So this is telling us that, of course, from a clustering point of view, 100% of the mutations get assigned to the unique component in your feed to your data. So this is really the most boring feed ever. One component, one cluster, and of the business. But there are some information like that we could have a look at. Sorry. The good thing of having S3 methods, sorry, there is some deprecated feature that I need to work with, is that also the plot method is an S3 method that gets different according to the type of object you do, right? So you know, when you do like your regression in R and do plot is regression, every plot is different. How that happens, that happens because somebody implemented a particular S3 method inside that class to work in a different way. It can't be magic, right? So somebody needs to have done the dirty job. So what this guy's actually doing here is reporting indeed the kind of feed, no? So the feed positions a density function, a mixture density, which in this case is a monomodal density, right? Because there is only one mixture component. So it's a stupid density, right? The formula is f equal summation over pi f over certain k, something like this. In this case, k equal to 1. And in fact, in this particular case, even k equal 1, it is equal to f of 1, if you want. And pi 1, because pi 1 is equal to 1, right? Everything is in the same mixture component. And in fact, this is what I see here in the feed. All the colors represent the clustering assignments. So where I assign my points, the density function, the red thing is actually the density function of component f1. And the black dashed line is the density function of the overall mixture. But in this case, the two things collapsed because it's the same density, right? That guy is all the density, essentially. And so we have to like this kind of inference. With the trivial and notice that the inference makes sense. It gets done in a single step because there are no latent variables in the model with a single cluster. So whatever you do is maximize the likelihood, right? So this is like the MLE estimator for the single beta component that you have. No? Does it make sense? Correct? Cool. So what are there is like a number of other data sets which I invite you to play with this very same piece of code and see if you find something else. And if you finally drop me a line and we can probably discuss what I want to actually to show to you now and because there is not because it takes a little bit of time. What I want to actually do is to show you the vignette of the other analysis, which is more juicy, right? There is more stuff. I want to send you home with something nice, not with this horrible sample that we looked at right now. So let me... One sec. I want to share this. Where is you? Sorry, I don't find the screen sharing thing. Okay, here. I can share this. So let's have a look. This is done with a nice markdown so we can definitely focus on this markdown without recording anything, okay? In the code, it's the same code we have used now, right? It's just more... So we're going to use these very same packages to preprocess the input data and get limitations. We want to work with the package for the convolution plus another package which at the end of the day we'll position all these clones on a temporal model to give us what is the clone tree inference for our data. While this thing contains a lot of information, but the classical... This is like instruction for the installation of packages we depend on, but we don't have to do it right now. So we have in this case data, which is available as independent CSV files. So we have mutations on a separate file, copy number data on a separate file, metadata for the sample on a separate file, which we read. These are CSV files, so read them with just under function to preprocess common separated values files. And then I have the sample ID and the tumor purity estimate and the reference genome. And if we look at the SMV data, so the single nucleotide variance data, so in this case we're already filtered out in search and deletions. I leave you only single nucleotide variance data. The kind of format that you see is pretty much the same thing we have seen before. We have again the chromosome location, the position of the mutation from two, the reference allele and the incentive allele plus the read count information. So DPE is again the depth of sequencing and V number of rhythm device. What do we see as a difference to the previous guy? What is that we can observe immediately just looking at numbers? Anybody? Well much closer for 25, but it could be there's no real thing that I want to notice. It's not that something else that relates to money. So before we had most of our reads around 40 in terms of coverage. First one here is 192, 63, 133. So I don't know if it is true or not. Well actually I do know, but as an educated guess I would be doing is that in this kind of sample we might have much higher coverage. What does that mean for us? Probably we have a little bit higher quality, high resolution at lowest frequency spectrum. So if something is small we have more chances of seeing that. It's like a telescope. The more sequencing coverage you put in some way the more you can see small things. The more you want to see the more you pay. So that's the drawback of the thing. But here the coverage looks higher. A lot of these numbers are above 100 which is almost like three times more than the other case. Then the allelic frequency is just the result of the same computation as before. So there is no much information on the fact that is closer to 50. From a theoretical point of view if you had input, if you were to sequence all the cells of course you would have a true value. So there is more noise around the quality of the allelic frequency value. But the more important thing is really that you can go to lower frequencies because of higher coverage. So I'm just showing simple things like what is the number of somatic mutations per chromosome and they turn out to be kind of similar type of somatic mutations. So a lot of mutations are C2T substitution of cytosine to timing. This is just like preliminary things one just do to have a look at the data. Nothing really inference or anything. Just like looking at things. You always have to look at things to spot unexpected things that maybe your sequencing didn't go well. Maybe something went bad. The button where you push the button and you get the initial paper out, it doesn't exist. I don't know if they ever told you that it doesn't exist. It doesn't exist in reality. You have to really be careful about your data. At the end of the day we switch here and we do exactly the same kind of thing we did before. So we pull together all these pieces of information, the copy number, the somatic mutations, the purity, etc, etc. And when we print the object to the screen, this time we see again that in this case majority of the tumor genome here, majority of the tumor genome is actually again in these etherozygous deployed configuration. One copy of the major allele, one copy of the minor allele. And for you that you work on copy number, tumor suppressor gene, etc, you immediately see here, for instance, that you have a combination of a non-synonymous mutation on a gene called T53, which is a very important gene, which sits on top of a loss of etherozygosity segment, which is a segment where you have zero copies of the minor allele, no, which is in your poster. What I can actually do is to make these plots. This is just a joint plot on the left side of the little frequency space and on the right side of the depth of sequencing. So let's just compare depth of sequencing is in log space, clearly. And what we see is that the average, the median coverage is 150. So this is really higher coverage than the one we have seen before, right? Really, really high coverage. And plus there are these colors here that represents somatic mutations sitting on different types of copy number segments. The one that we want to use for our analysis are the green ones as before. But look at this at the virally frequency spectrum. Let's just forget about this thing blue on the right. Forget about that. Does this look like the one we were seeing before? It does look different, right? Because instead of just seeing the peak of color mutations, we see these extra peak here, it sounds probably interesting. Plus we also see this. It looks like a neutral tail, right? It looks like the tail we were discussing by theory, the one we predicted by mathematical arguments. The results and observation we can actually do on this frequency spectrum. Anybody? It's much closer to 50%. The peak, right? Which means what? That there is much more tumor DNA in this sequencing run than there was in the previous run. This depends on a number of things, including for instance, the type of tumor that we are working with. Certain tumors due to difficulties in collecting the starting DNA will have more contamination of normal cells. This one is pretty clean because we know that the pure tumor will heat at 50% on that frequency for mutations in deployed areas. Therefore, being very close, it needs a de-purity side. And in fact, the purity of this guy is very, is extremely high and is reported here is 95%, right? The previous one was 64%. So this one has much higher coverage than the previous one and much higher purity. So in general, we can expect to find more complicated stuff in this signal. It is not the case that in the previous one, the tumor was simpler. That's not something that we can say. What we can say is that in the previous one, the data did not have enough signals to understand what is happening down at low frequencies. Keep this in your mind because this is a fundamental difference. The data we had before was not high resolution enough to make two complicated statements. While this data seems to be high resolution and high purity, whether we're going to make or not complex statements depends on what we find. But at least here we have a chance. Do you get this? Does it make sense? Essentially low resolution means that you can't say too many things, right? High resolution means you can maybe say a few more things. Before we were adding a low resolution scenario, now we are in high resolution scenario. It might still be the most modern tumor but that one, we don't know until we analyze it. So let's make again the plot we had done before. The plot we had done before comes out like this this time with the greenish areas, which are the ones that are deployed. Again, we see that most of the tumor is in deployed state. So we are good to go. We can skip some of these plots. This is something we need to do other analysis. And then we can look as you can see. We go back to the same type of analysis where we get deployed mutations by subsetting the ones that sit on this one one. And the histogram of these looks like this. Same thing we were seeing before. Okay. And now we can ask the guy to make the analysis. There is a little bit of noise. These things are not expected to be here. These are missed calls in the set of copy number events essentially. So we just like chop off this tiny little dots here. And we cut off frequency to be at least greater than 0.5%. We set a seed and we use again the same kind of configuration for modeling friends we have used before. And the thing starts. There are 5000 somatic mutations. The model tests again the things. It now takes 16 seconds. And the result that comes back is the following. For sample H04328GK, the best fit from a statistical point of view comes up with two peaks in the frequency distribution plus a tail, which is completely different story from before. Before we had one beta peak and no tail. Now we have two beta peaks and a tail which match our intuition when we looked at the data in the very beginning because we saw that type of signal. And then we get information like 74% of the mutations can be assigned to cluster C1, 14% to the tail and 12% to cluster C2. And therefore the tail accounts for 746 mutations. And these are the parameters learned by the model. So this is the exponent of the power load tail that comes out equal to 2, which is exactly what we expect by theory. But mathematical theory tells you that the exponent should be 2. So this really matches 100% what you expect by mathematical arguments. And then these are the estimations for the peaks of the beta distributions. One looks at around 47 by a time frequency and then one at about 23. So what we can actually do is to have a look at the model selection table. You see for each one of these runs you have all the scores so you can quality control your inference to see if it makes sense. It reports your negative or likelihood by your information criterion, archive information criterion, the entropy of the latent variables, the ICL score, the reduced entropy, the restricted reduce ICL, number of model parameters, the tested number of components, and the tail, et cetera, et cetera, et cetera. And at the end of the day this is what we obtain. Something that is actually matching exactly our intuition. We have one clonal peak of mutations where we can map also one of those important mutations, one on a gene called tert. It was a promoter mutation. It's not even imitation in the coding part of the gene. It's on the promoter. So it's something very interesting from a logical point of view. It's a very famous type of event in the context of glioblastoma. I don't know if it's all because it's GBM stands for glioblastoma, so a brain tumor. And what we see is this. This is our guy. The blue peak is the subclone associated to cluster C2. And that's the thing that has been selected by the model as a possible explanation of this extra peak in the frequency distribution. That for you, that means that essentially you should believe that you have evidence of an ongoing subclone expansion inside this tumor. So inside this tumor, which is already a tumor complicated, blah, blah, blah, blah, blah, there is another population inside that is growing that most likely you can explain by means of positive selection pressures. So it's growing up in time. That means that if you had collected this sample one year afterwards, this thing would have been larger in size, eventually sweeping over this population. But you have evidence of this kind of process in the act. So this is really catching evolution in the art. This is a genuine estimation of a subclone expansion from a whole genome sequencing assay. And you also know how big are these populations because these accounts for roughly 25% of the tumor cells. So one every four cells of this tumor belongs to the subclone expansion. Three out of four belong to the ancestor of this expansion. So from a point of view of the kind of drawings we were doing, we have this with this guy, which is the red guy. And we have something blue that is growing inside here, which right now is like one four over three out of four. But if we were to follow this up in the future, this guy would be larger. Do you see that? Because it's growing, because we see in the frequency spectrum that is actually, it might also be dying, but in this particular case, it's actually growing. Does it make sense? All our job, all our discussion, everything we have said was about understanding tumor as an evolutionary process. So we have discussed a lot on what is the kind of sequencing information we want. Why do we want to look at the DNA? And we have explained the fact that we want to look at somatic mutations because they act as a sort of a bar code of the process. They store the history of the process inside the DNA molecule. We have developed an intuitive mathematical approach to think to two aspects of the process. First one are bumps in the frequency spectrum due to alleles that go up in frequency. And the other is also joint theory to think of neutral alleles, so mutations which are non-functional that create diversity, but they are important for the process anyway. Part of our statistical signal. We have then discussed a Dirichlet finite mixture model to cluster that data, thinking about the data, pooling together mathematical distributions predicted by population genetics and canonical machine learning. We have developed a theory of inference for that and we have written an algorithm to lend those parameters of the model even to select the number of optimal components to explain the data as it's good practice in clustering algorithms. Then we have taken, we have also reasoned about what would be the expected distribution. We have done this exercise in reasoning about the model. Then we have taken data from a real patient, two data sets, one low quality for which we realized we couldn't say much about supplemental stuff and one which is far higher quality where we see exactly the type of signal we were expecting. We didn't bias the model to choose that explanation, right? As you have seen in the model, we allow the model to try to explain these things in many different ways. The model from a statistical theory argument decides that this is the best explanation of the data and this allows you to confidently assess the complexity of this tumor and conclude the beautiful insight that within this tumor there is an ongoing saplone expansion associated to that set of somatic mutations. So this is your first time, for many of you, probably the only time because you don't do this for a living, but this is the first time when you see an evolutionary process and you see competition of a species, if you want the saplone, competing against his parent, father, mother, whatever you want to like, it doesn't really matter. His ancestor and outgrowing his ancestor, which is a fascinating thing, right? There is a question or something from home and that's a good question but there is arguments for the question is like, how do you know the blue cluster is the new one and the red one is old? Because of the fact that we explained in the previous lecture that a lead frequency is a proxy for time. So you always expect a pool of clonal mutations, high frequency, which has to be the red. If there was an exchange between those two clusters, there should be a third cluster in spite of the red. If you do the drawings I was doing on the blackboard in the previous lectures, you will come up with expected a lead frequency distribution and you should agree with what I said. I hope this helps. Questions, comments, last two minutes of our time together. Everything clear? So I left you with a lot of, not a lot of data, but some data. I have way more data if you want, I can pass it to you, it's not a problem. Code for analysis. So if you really want to join the field of doing computational modeling of sequencing data in the context of cancer genomics, I hope you have enough pointers to start approaching this type of research area. Thank you. Bye, Rani.