 Okay. Good afternoon, everyone. So my name is Andrew McPherson. I'm going to be teaching you the module five, which is the SC DNA seek module, the single cell genomics on the DNA side. I'll tell you a little bit about my myself so I started my graduate studies in BC. Thank you for BC in Canada. I was working with Dr. Saurab Shah and Dr. Aparishio at first on computational methods for bulk whole genome sequencing, trying to deconvolute the multiple clones that were evident in that type of data. And I was happy when I could start to work on single cell techniques because they really allowed us to look at clonal structure in a way that didn't require us to to overcome the issues of bulk sequencing. So then with Dr. Shaw and Dr. Aparishio, I, we developed the platform to direct library preparation and collaboration with them. And I worked mainly on the computational side. And from that work, I guess it from my experience with DLP and single cell, I was then offered a position at MSK in New York and that's where I am right now. So I'll just jump right into it. So this is the Creative Commons license. This basically means that you are free to share, copy distribute, transmit the work here in to a mix if you like and adapt in your own work under the conditions that you attribute the work in a manner specific specified by the author or the licensor and that you, if you alter this work then you share it under the same license I believe that's the stipulations. Okay, so I'm sure you've already been introduced to this somewhat but and probably know this already but cancer is definitely driven by an evolutionary process. And that evolutionary process involves mutational, mutational processes acting on the genome together within perfect DNA replication, or DNA repair that is malfunctioned in some way that promotes genetic changes that transform cells from a phenotype that favors the organism survival to a selfish malignant phenotype that favors unbounded cell division. And increased cell division together with disabled repair mechanisms promote genome instability resulting in further clonal heterogeneity. And then from that heterogeneity will be selected populations that are able to invade adjacent tissue evade the immune system of a treatment and the past size. And really cancer evolution is the reason why we're interested in in heterogeneity because by being able to look at heterogeneity we're able to reconstruct and understand the evolutionary patterns that led to that heterogeneity and then we're able to predict potentially by a complete complete understanding of that heterogeneity would be able to predict what will happen in the future clinically to a patient in which we've observed that cellular heterogeneity. So every single cell talk I think has to have a slide about maybe smoothies or some analogy like this. I think that the picture we'd like to see is this colorful picture of all of these different types of varies. Of course, the traditional techniques that we apply to the sequencing of genomes. So let's take that very interesting mixture and blend it up. And then we perhaps have this difficult process of trying to deconvolute, deconvolute what went into this mixture using the that produce the bulk genome sequencing. At the end of this lecture, I would hope that you have learned or have a knowledge of single cell DNA sequencing techniques. You understand how se DNA data is analyzed for copy number changes, single cell nucleotide or single nucleotide variants, specific for specifically for understanding the evolutionary patterns of single nucleotide variants and the copy number changes, and perhaps be aware of the pitfalls of each type of analysis and know how to se DNA has been used thus far to study cancer. So single cell sequencing in effect is multiplex sequencing of single cells using DNA barcodes I think that actually describes all of them, the many various techniques that we can call a single cell DNA sequencing. The basic steps on the molecular biology side are dissociation into single cells, suspensions, isolation of single cells cell lysis and cell lysis and also removing bound proteins from the DNA so cleaning the DNA, followed by indexing or barcoding the cells so they can be identified later and downstream analysis, and then amplification and library preparation. And so, most of the protocols involve some variation on these steps. Single cell dissociation is the first step that involves dissection rinsing and mincing. The digestive enzymes and incubation these these steps are definitely tricky and very important in terms of optimization, especially for fragile cell populations from difficult to sequence tumor specimens. So it's an important step. It's the next step in the process and this there's definitely a variety of different techniques that have been used such as limiting dilution so that's where we dilute to the point where statistically speaking. We know that if we spot a certain quantity into a set of wells, then by chance, it's very unlikely there'll be two cells in each well of course, often there's zero cells, which is one problem with that solution. There's techniques like capillary pipette that are somewhat laborious. So capture micro dissection to some extent can be think I thought of as a single cell technique micro fluidics I think are one of the more important techniques as they allow this this platform to scale. And facts and blood collection are also techniques that have been used. So once we have been able to isolate the cells perhaps into into into an array of wells, then the next step is to index those two types of indexing that have been shown here the one that I'm most familiar with is the combinatorial indexing. And that type of indexing involves having a unique barcode per row, and then another unique barcode per column, possibly with duplication of those barcode sequences, but the done in such a way that each unique pair of barcodes uniquely implies the location within this within the set of wells that the cell would have come from. And then there's also non redundant indexing in which every well has a completely unique pair. So the early days of whole genome single cell whole genome sequencing or cells single cell DNA sequencing these are the three techniques. But still somewhat relevant today dot PCR was a technique that used primarily used PCR MDA uses an isothermal amplification technique using a more accurate polymerase. And then there's also hybrid techniques that are a little bit more involved, like malbec or pico plex. I think the main takeaway is that there are differences between these two, three techniques with respect to sort of the error profile and what you would be able to do with these techniques so dot PCR has a high rate of errors that are introduced by the polymerase whereas MDA has a low rate of errors that are introduced that allow it to be used comparatively better for single single nucleotide variation deep dot PCR has a relatively higher uniformity sort of the double negative in the slide. So dot PCR because of the uniformity of coverage across genome is better for copy number calling. And alternative, which is the technique that we use in our lab for DLP is direct tagmentation of an unamplified genome prior to any kind of amplification that's applied after the fact. This uses a transpose on a TN five transposes loaded with library adapters to basically fragment the DNA and also add adapters for sequencing to that DNA. What happens is the DNA is is targeted so these these TN fives they cut the DNA and introduce these primers and then the primers are used for dual barcoding in which we do a minimal amount of PCR, just enough to introduce the barcodes into the into the amplified sequences so that the DNA is barcoded for identity and identifying the cell of origin of the sequences later. Another technique that is I think is quite interesting and and novel is single cell combinatorial index sequencing. And this uses a split and repooling approach to ensure that for the most part cells are uniquely barcoded, even though they're never isolated into individual wells for one for each cell. So what happens is in a first in a first part of the experiment, the first barcode set first set of barcodes is introduced into the nuclei, keeping the nuclei intact using a TN five transposes. And so you have pools of nuclei that are uniquely barcoded, each set of nuclei is uniquely barcoded. And then these sets of uniquely barcode coded nuclei are pooled together and then redistributed and during a second barcoding step using PCR they're barcode with a second set of barcodes. And the hope is that it's statistically unlikely that a cell or nuclei would would end up in the same set of wells across these two grouping and splitting and redistributing steps as another cell so no two cells hopefully would end up in this in this the same well as in the top and the same well in the bottom and therefore all cells would be uniquely barcoded and so this doesn't, there is some error rate to the, to this process so I think there's maybe five to 10% of the of the barcode pairs end up being for two or more cells. And finally the method that was developed in our lab, and that we now have at MSK is called direct library preparation. This uses cell isolation using a piezoelectric dispenser so this is a droplet dispenser that dispenses into a nano wall array. And by effectively using a small, a small acoustic pulse to create small droplets and together with an onboard camera. We are able to able to observe as we're dispensing droplets when there is a cell within the dispensation region. And when we observe a cell, or when the machine observes a cell it automatically takes the nozzle and dispenses the next droplet into the, the well in the nano wall array. And it's followed by steps that do additional imaging, add lysis buffers do TN five tagmentation to additional additional reagents steps and then pool the resulting tagged DNA, and do regular sequencing on aluminum sequencing machine. And maybe I'll mention but unfortunately it was discontinued. This technique uses uses uses an emulsion of droplets. There's two steps here where in first step in which droplets are and droplets are sorry cells or nuclei are placed into droplets using this microfluidic system. And then following that those, those droplets with cells in them are run through another drop microfluidics device that incorporates these gel beads which have the additional reagents and barcodes that are needed it for independent for the next steps which involve amplification within each of these droplets, followed by pooling and then regular illuminate sequencing. And I think that the final type of single cell sequencing I'll mention just because now it has a commercial platform and we're starting to see more publications using this platform is the tapestry platform for mission bio. This is also droplet based. This is a very similar set of steps where cells are nuclei are incorporated into into droplets of oil in an emulsion. And then within that those droplets independently. We have distinct barcode I five and I seven barcodes and together with gene specific primers. We do a PCR to amplify specific genes and also add cell specific barcodes. So this is a very this is a target approach that involves developing a panel, but it can be very high throughput. Okay, so in the next part of the talk I'll, I'll describe a CDNA data and analysis. The analysis that are possible using DNA seek are to uncover clonal substructure. So looking at different populations of cells and how they differ in their, perhaps driver genes, or mutational processes. And then also looking at clonal lineages so understanding basically the afflution relationships between those clones. Looking at tumor evolution, perhaps over time where we have time series measurements of of clones, from which we can understand changes in the clonal abundance and then also looking at mutation co occurrence. So the first look at one of the data types to data types. I'll talk about the most in this set of slides, which is low coverage single cell whole genome sequencing. This encompasses DLP. 10x CNV would have also been a low coverage single cell technique and some of the other techniques like dot PCR a bit higher coverage but they're also much less lower much lower coverage than a regular whole genome sequencing so this is what you would first see when you open up one of the BAM files in IGV. As we zoom out a bit we can see that hopefully we get even coverage on average and that coverage. When we look at any two cells isn't biased to particular regions in the genome and that's what we've optimized for in DLP and others have optimized for with say dot PCR or 10x CNV. I think given that the coverage is low for these types of sequencing assays. These really are copy number primary assays and so the first step is a copy number analysis almost certainly. Copy number calling in single cell whole genome. It's, I think, Serena will have introduced you to this and it's quite similar. The bin reads across the genome are binned into regular or regular length bins. We have GC correction and mapability correction as the next steps, removal of outlier bins and then removal of outlier cells, followed by segmentation and copy number calling sometimes those two are done jointly. So this is an idea of how important GC correction is. This is one of the cells that we see constant with DLP. If you overlay the GC of each bin, the average GC of each bin, you can see that there's a strong relationship and then if you look at that relationship as a scatterplot between reads on the y-axis and GC on the x-axis, sort of Nike swoosh shape that shows up the sort of bias that you get characteristic of sequencing. Going back here, this bias is usually corrected by some kind of polynomial or low S regression that's fit to this curve and then we effectively regress out the contribution of GC to the biases in read depth coverage. Segmentation involves a few different approaches. So we can have a sliding window approach. CBS is one of these types of sliding window approaches. CBS stands for circular binary segmentation. This is a procedure where we iteratively look for intervals of the genome that have a that are outliers with respect to coverage and then we iteratively build up a list of regions that have differential signal between their neighboring regions. A hidden mark of models are another option for this type of analysis and that's, these are what we use primarily in our lab. They allow joint calling of the segments and the copy number state through this Markov process at the bottom, right here, which basically describes the different probabilities of going from one copy state to another, as you move across the genome, you may, you may sort of, I guess, you know, some of the intuition you have can be encoded as a prior into this kind of transition structure such that for, for instance, you don't often transition into a homozygous deletion state because you don't expect many homozygous deletions in your typical genome and so learning that transition matrix allows you to see where what kind of transitions are more likely and then at the same time infer the actual states for different regions of the genome and what the boundaries are for those, the states within the genome. So there's another approach called an objective function based approach that doesn't do copy number calling at the same time but does does effectively a segmentation with a specific like mathematical objective function that it's trying to optimize for say the objective function approach is similar to the sliding window CVS approach but instead it's trying to optimize a specific a specific objective. These are nicer approaches and. Yeah, I think they're a good option also. Excuse me. Yeah. In the, in the previous sessions, they talked about the different positions in the genome have a relationship to each other. They called it linkage or something like that. There is a Markov model. The assumption of Markov as far as I remember is that the probability of going to state the next state doesn't have any relationship to the previous state, but here if two states are related you can't go to the other state. How do you assume this kind of assumption. Which, can I ask which module that was so I think linkage that would probably refer to genetic linkage and. These are genomic positions aren't they. So, yeah, doesn't mean that the copy number in different positions should have a relationship to each other. Yeah, that is a very good point. I think maybe not. Maybe I misunderstand your use of the term linkage but what I will say is you're absolutely correct because one thing that I'm not talking about this very much in this set of slides. One thing that we're missing here is the different connections between the segments that are implied by genomic rearrangement. There's certainly, you know, there's certainly violate the assumption that we have in the Markov process that a bin is a bin's copy number is only dependent on its adjacent bins not on this and bins in the genome, because, you know, we could have instances where a bin has been rearranged by these very complex processes that take different parts of other chromosomes and juxtapose them with different positions and in the different chromosome. And so that in those situations yeah I mean this is a very much a simplification and a model that you know is more useful than it is. Absolutely realistic. That makes sense. So we assume it but it's not a realistic model. Yeah, it's useful more than realistic for sure, like most models. And, and can we assume it like for, I didn't understand that rearrangement that you talked about. Can, can you please explain that a little more. Sure. Can you see my cursor on the slide. Yeah, I guess. Okay. So, you know we're saying that this. These bins here that I'm circling that in red are only really dependent their state is only dependent on the neighboring ones or the ones perhaps neighboring in this nearby amplified set of bins. Only the adjacent bins impact the probability of your state under a Markov process. But, you know what if this purple region over on the right has been sort of cut from this right part of this chromosome and joined to some part of this chromosome of this green or perhaps that's I'm trying to come up with an example that's off the cuff but actually here's maybe a better example. So, this is amplified, but we also don't need to know the orientation of this green segment it could have been reversed and amplified. So that would mean that this these bins here are actually adjacent to these bins over here. Does that make sense. Yeah, that makes sense. Thank you. Okay, so an example of copy number in a single cell. This is the cell, I think is a lymphoblastoid cell line it's near deployed. Actually, I'm wrong. This is, this is actually the ob 2295 data, high grade serious ovarian cancer cell lines that will be working with in the lab. So this is baseline diploid. A median copy number of two but you see that there are quite a few amplifications and solutions. It's a female sample obviously so it doesn't have a white chromosome. And then this is copy number in a tetraploid cell and from the same sample. And one question is how, how can you tell from this data that it this is a good solution that it's actually tetraploid versus deployed. And the key thing to look for here is these copy number states at at a level of copy number of two or sorry three or five. And here's a copy number one segment. So, in terms of evaluating whether or not you have a good copy number solution. One of the questions you should ask is whether or not the the pointy is something is correct or makes sense given the data. So we can see that it probably does because we have these odd numbered copy number segments that help us sort of anchor to these anchor to these, this solution. Of course, we could end up with a solution like this. And here the question is, you know, there's no odd numbered copy number states that are helping us to assess whether that this or ascertain that this is a tetraploid cell. There's no significant segments at any of the odd numbered states, we could divide the entire set of copy numbers by two and get an equal equally reasonable genome that would be baseline diploid. And this is the identified viability of copy number problem that doesn't go away with single cell for sure. And is perhaps exhibited here. This is just showing how well we can fit an HMM model to data of using different ploidy solutions and you can see that this is the log likelihood so actually it's the negative log likelihood so the smaller this is the better and we can see that we could just keep in increasing the ploidy and we would keep getting a better fit according to our model for the data so unfortunately things like measuring the likelihood of your model in an HMM don't give you any idea of whether or not you should pick a tetraploid over deployed solution. Okay, so that's copy number I'm going to jump into phylogenetic analysis. And first, talk about the different models that have been used for phylogenetic analysis in SCD and AC data. I think the most prevalent is because of its simplicity is this assumption of the infinite sites that leads us to the perfect phylogeny model. The infinite sites assumptions is the assumption that mutations are only once throughout the tree. And that there's no reversions of those mutations or parallel mutation. That can be shown of the, an example of this is shown in the bottom right this is actually from next train which is about viral evolution, but the, the, the really the principles are the same here. These mutations. Once they originate you see that they propagate to all of the descendants this blue mutation is exhibited in ABCD and E. It's originated near the root of the tree. And then this this red mutation is only in C. And this yellow mutation is in all three of its descendant genomes or sequences. Oops. And for the infinite sites assumption is that the genome is quite large and mutations are although they do happen frequently they're relatively infrequent compared to the size of the genome you maybe get 1020 maybe 100,000 mutations in a very highly mutated patient sample that the genome itself is three sequences of sequence. So, relatively speaking, the, the chance of any one particular location in the or nucleotide in the genome of being mutated is low for that particular location. Of course, this is similar to an HMM. It's a useful but imperfect model. You know that there can be violations to the infinite sites assumption and including conversion evolution, where there is a strong fitness advantage to mutating a particular gene cancer. There's also the possibility that there's a back mutation. Those have been observed, for instance, in BCR, BRCA patients that are that are treated with PARP inhibitors you can get a reversion of a BRC mutation. And you can also have instances where mutations can be lost by copy number change. Two alternative models here are finite sites. This is just sort of a, I guess, a free for all model. If you could put it that way, mutations can occur, they can reoccur, they can back mutate so anything is possible in this model. Another model that I've used in my work is the DOLA model mutations can occur only once in this model. Somewhat like the infinite site assumption. So assuming that the genome is large relative to the rate of mutation. But they can be lost, for instance, through copy number change that deletes the segment that harbors the mutation. And once they're lost, they cannot reoccur. In a study published last year, and through an analysis of single cell data, the authors claim to find quite a bit of evidence for violations of the ISA in real single cell sequencing data including parallel evolution, loss of heterozygosity. So deletion of a region that contains MSNB and also back mutation. And within the context of single cell, one of the, this is not necessarily a model, but sort of a simplification in the way modeling is done, because often single cells are quite similar. So we find that, for instance, this red mutation is differentiates this population of three cells from all of the other cells, but within this population of three cells, they these cells are genetically equivalent. And what this has done is to create something called a mutation tree which is just using mutations on internal nodes to show the structure of the evolution of mutations, and then cells, these internal nodes in clusters to represent which, which mutations are in which cells. So for instance this set here of cells have all of the mutations that were acquired above in ancestral nodes. And these two blue mutations and this red mutation. You'll notice that this mutation tree on the right here is using the perfect phylogeny model. One of the methods that first methods that came out that did this inference, this type of inference where they trying to try to build a mutation tree. Noisy SNV presence absence data is called site. So the data here that's input to this, this method is just observations of presence or absence of a particular mutation in a particular cell. So the sort of upstream analysis done as part of a bioinformatics pipeline will just will give you counts of reads. For each mutation for each cell for the reference or the variant. And then, as a first step, the site authors just to steal that down as did we see I mean enough mutations, or enough mismatches in the reads that would support a particular SNV to say the SNV is present if we did mark a one in this matrix. If we didn't remark a zero if we didn't really see any reads that overlapped that particular mutation at all then we have a missing value. And unfortunately, missing values are quite prevalent in this type of data as you'll see when we do the lab. Just because you have something called a little dropout happens so in the amplification. You can have preference of amplification of one region over another resulting in very free reads coming from one region of interest in a particular cell. So site, the site authors propose a method that takes this data and produces this mutation tree and we'll go into the details of of the method. Another method is sci fi. This is more relevant because we are going to be using a sci fi inferred tree in our lab. And this is a slightly different approach because it goes from a they propose a model that incorporates both both the read level data and the tree modeling the presence of absence and absence of mutations across the set of cells. And so they in a combined fashion call the presence or absence of particular SMVs in each cell and reconstruct the tree at the same time in an integrated way and then call mutations by doing posterior inference over the inferred model. And then finally a method that my colleague and I published in 2016 doesn't do phylogeny inference but I'm showcasing here because I, it gives you some insight into one of the one of the potential pitfalls of this type of data and this type of analysis which is doublets. And I think I find the colors here a little bit confusing but so if you'll allow me to explain them a is the reference means that there's only the reference reads that are occupying this particular or covering this particular mutation in this particular cell. And AB is heterozygous and B is homozygous for the variant and so that means that we have this collection of cells down here that have essentially all of the variants that we have profiled again here that the columns are variants. And then the rows are cells. So these cells in clone three have all of the variants at least heterozygous Lee. And in comparison to these populations at the top zero and one which have just a subset of variants. A prayer we knew we happen to know from in this experiment that these particular cells were, or at least the data in these particular rows comes from multiple cells, and likely from combination of the cells in zero and the cells in clone one clone. And three is essentially cells pairs of cells from zero and one. And then of course if you infer a tree from that kind of data you get something that's much more complex than you expect. And by contrast, if you use a method, which is what we proposed in this paper that is capable of filtering doublets then you end up getting a much simpler and more realistic tree. Right. What about phylogenetics and copy number data that's certainly an important problem. And we have. There's, there's been much less work on that probably because it's quite a difficult problem, and possibly also because single cell whole genome sequencing is more quite a bit newer than techniques that look at these in single cells. And the method that we've proposed recently in a bio archive paper is called sicker. This is baby Bayesian phylogenetic inference from CWS derived copy number data. And the idea here is to use transitions and in copy number data in individual cells as as binary as phylogenetic markers, and then perform inference on those markers. And the assumptions here are that we don't have any loss of change points through deletion, which is a somewhat strong assumption in some contexts. And we also have the assumption that change point distance approximates evolutionary different distance and I'm showing at the bottom here. So what do we mean by this idea that we use copy number transitions as markers so for instance in a we have perhaps an amplification that results in two copy number transitions. One at the beginning of the amplification another at the end of the amplification stacked on top perhaps we have a subsequent amplification that introduces two more copy number transitions so two more additional markers. In the middle you hear it. Excuse me, you can see where we have an instance of a duplication overlapping a deletion that results in the removal of these change points and so this is an instance where the model is not going to sufficiently capture the changes that differentiate these different genomes. The data looks like this for SIGCA so we take a copy number matrix that we see here, rows again are cells and along the y axis in this heat map is chromosomal positions. So we take data that looks like this, and we try to infer these binary indicators of transition points and then build a tree as you see on the left. And I think, and competing method that's also recently come out in my archive that I think is quite interesting is something called psychone, and the difference here is they look for transitions and copy number across all of the individual cells. But then what they do is from those transitions and copy number they build a segmentation of the of the genome into regions, and then they reconstruct phylogeny over the copy number changes within those regions. And in integrated fashion they take these noisy raw read counts cells by bins, you can see in panel B and simultaneously infer a tree as you see on the right and call copy number as you see on the bottom in the middle. So, another type of analysis you can do is something called suitable analysis. So, let's say that we are interested in SMV evolution in low coverage data, like DLP or 10xCNV. What we can do is first cluster the cells by their copy number profiles. And lastly, you can also use a sitka or psychone to cluster to create to generate an evolutionary tree and then cut the evolutionary tree at certain branch lengths to generate clusters so first step is to generate clusters of cells. As you can sort of see here with the left. Now we have like an annotation of the cluster across this population of cells. And then by clustering of cells, you can then merge all of the data within those clusters, and then use those as effectively as just independent WGS data sets, from which you can call SMVs call break points to any of the analysis, you would do with a regular whole genome sequencing data so long as you are able to cluster this data into groups of cells that are large enough. And then you can do something like this. Where, and this is the data that you'll be analyzing in the lab. So we've looked at SMVs and break points across these nine clones, I believe, from the OV2295 samples, and then reconstructed a phylogenetic tree across the clones, including all of the driver events annotated here. In the last part of the talk, I will go through some of the studies that have already been done with leveraging single cell DNA sequencing and talk about just in general where the field has progressed up until now. So what we've been able to study so far as clonal evolution, therapeutic resistance so understanding how clones and fitness is affected by by drug treatment. There's been studies of metastatic dissemination and the dynamic processes by which clones populate other other tissues. And there's been studies of invasion in pre-moleconcies. One of the first studies was this study published in 2011 by Nick Naven, who is now a pioneer in the field of single cell genomics. When they used .PCR to sequence 200 cells. This is, I would say the first big single cell sequencing study. And what they found was that when they reconstructed a phylogenetic tree from the copy number that they inferred for each of these cells, they described what they saw as punctuated clonal expansion with few persistent intermediates. And a subsequent study, this was also done by Nick Naven's group, who studied clonal evolution in breast cancer sequencing and using a technique called nuke-seq. So nuke-seq is effectively MDA on cells that are in G2M phase. So they have double the amount of DNA available for sequencing, which helps with things like allylic dropout. So if you have four copies of a particular loci as opposed to two, it's less likely that you will lose both, you lose an allele when you amplify from those four copies. So this, they applied nuke-seq to understand the breast cancer sequencing similar to the previous study. And they found that aneuploid rearrangements occur early and point mutations evolve gradually and they were able to do this because they combined .PCR for copy number with nuke-seq for SNVs. And this was a strategy that was used by subsequently by a few papers, combining multiple types of sequencing. One type of sequencing perhaps is strong for SNVs and one is strong for producing copy number. And so there's been quite a few studies and methods surrounding the combination of those two types of data. A study by our group that came out in 2016 used a large amount of whole genome sequencing that targeted bulk genome sequencing and targeted single cell sequencing to try to understand a clonal evolution and clonal spread in high-grade serous ovarian cancer. And so there was quite a few data modalities in this particular study, but the one that was quite useful at understanding evolution was the targeted single cell sequencing that used it, used custom echelofluidics and a custom panel is designed for each of the patients that we were studying. And we were able through using that data to be able to resolve fully resolve clonal genotypes and be able to understand which mutations were concomitant in the same cell populations and then study those cell populations more accurately as they, as we observed them in different locations within the peritoneal cavity. And what we found was was that there was polyclonal seeding in some instances, so multiple clones seeding the same metastatic site within the peritoneum. And we also found examples where mutation loss through copy number chain was was driving diversity in genomic diversity in some of the tumors. Much more recently, this paper that I'm showing a figure from here is the DLP paper. And I really like this particular result from a fine needle aspirate that we did, that we applied to which we applied DLP. We were able to extract 220 cells from like a very, very limited sample, so limited that it's possible that it wouldn't have been feasible to apply whole genome sequencing or other sequencing techniques to this amount of DNA. But using DLP, we were able to isolate for a small number of cells realistically, 220 cells from the sample, and then sequence them and then obtain very high quality copy number that allowed us to identify clonal and subclonal gene amplifications. In the same study, we looked at rare or events that happened in rare populations or in individual cells, where we see copy number changes affecting individual chromosomes or spread throughout the genome affecting many chromosomes, even I would say sort of catastrophically affecting most of the genome. And these, these are likely the result of errors during mitosis or misaggregation that happens sporadically and contribute to some extent to a small amount of heterogeneity and then maybe provide a reservoir for which selection can happen if, if by chance one of these mitotic errors happens to lead or amplify a chromosome that's has added fitness advantage. On the right here we're showing that for these top two data sets gains are more prevalent. And this is also associated with the fact that the top two are deployed. The bottom is the bottom right is tetraploid and in that sample we saw a prevalence of losses over gains. Even more recently and I had my notes to describe the number of cells that were sequenced as part of each of these studies but I forgot to mention them. The first few sets of studies that I described from Nick Damon's group and from other groups run in the hundreds, 200s. I think our nature genetic study that on high grade serious cancer was less than 1000. The DLP paper was on 50,000 and this recent paper by Nick Damon's group is on 10,000 cells so you can see that the advances in the technology that this type of an out of type of data generation is becoming much more scalable. So in this paper. This is again on breast tumor evolution. They sequenced I believe eight TNBCs and reconstructed phylogenies. They investigated the copy number structure of the of these tumors using a new method that is very similar to DLP. And they were from this data they were and so in silico simulations. They studied their model from punctuated evolution and clonal stasis which is the first model that they proposed back in 2011 to now a model in which there is some ongoing CNA evolution and some transient instability, preceded by so there's a punctuated burst that generates a lot of heterogeneity there's some transient instability that generates a small amount of additional heterogeneity and subclonal structure, and then there's ongoing CNA evolution but at a lower rate and I think the key sort of question they're trying to answer here is. Is there a mismatch between the amount of heterogeneity we see when we look at individual cells versus the amount of heterogeneity we would expect. Historically, based on a reconstruction of the evolutionary history. Is there a mismatch between essentially the historical mutation rate and the present day mutation rate contemporary mutation rate. I think this is the last few slides will finish on studies that use single cell SNB sequencing and this is a study that uses mission bio. The studies here are really starting to increase this is this is a study of 735,000 cells from 123 AML patients so huge study. They used two different types of gene driver gene panels. They investigate ML and try to understand several different aspects of the evolution of ML in this slide I'm showing mutual exclusivity between the drivers in individual patients of each of these. Heat maps and then on plots on the right are one patient at the top and one patient at the bottom. You see, especially at the top there's a striking amount of. In fact, it seems that no cells harbor the same to two or more of the same mutation. So there's a striking amount of mutual exclusivity and independent evolution of. Different drivers and presumably clonal competition between these different cell populations that are East each harboring their own driver mutations. From that data, they were able to look deeply at what happened to the different populations under under pressure from. Chemotherapeutic treatment and they found a number of different signals for instance for clonal selection here we have a large bottlenecked due to the imposition of this treatment, followed by resurgence of new populations. That are able to evade the treatment. And then in contrast we have. As far as as as failure of treatment to reduce the burden of the disease. And a significant amount of clonal clonal competition between these different populations with distinct drivers. And the last study I'll talk about is one that we recently published, in which we applied DLP to a serially, serially passage triple negative breast cancer. You can see some of the data that was generated here. I think we had 130,000 cells in this study. And this is probably about 10,000 that were in this particular data set alone, from which we constructed a phylogenetic tree that you can see on the top right and then tracked over different courses of treatment and without treatment. And how the clonal populations and which clonal populations to precedence within these xenografts and you can see that different distinct populations have higher clonal fractions at the end of this time series, but different between untreated on the left and treated on the right, as shown by these clonal fractions at the top and then these fitness. coefficients on the bottom. Oh, I didn't mean to include that slide I'll go to the next one. So challenges and future developments I think one challenge and also something that's a definite positive as I'm pretty sure that data generation will only continue to increase as we have methods that do single cell sequencing in that are more routine and more scalable. And as as sequencing becomes cheaper, it will allow us to increase the coverage per cell with the same cost and also increase the utility of this data, eventually reaching a place where we can have more accuracy for say things like SMVs in perhaps with low coverage sequencing data. Yet there's many challenges I would say this is just definitely a very short list that I've jotted down here. We're in sizes right now at the coverage levels that we have in single cell data copy number with the accuracy that is possible with some of the the assays that have been optimized for SMVs or copy number alone. So that means change at some point also. And finally I think there's quite a bit of work to do yet to be done on phylogenetic models of copy number and single cells as you as I've shown in the in these slides. There's only two methods of significance right now and they're both unpublished. So I think in the future there'll be a lot of computational development on the side.