 It's my pleasure to introduce Dana Peer from the Memorial Sloan-Kettering Cancer Center, where she is the Chair and Professor in Computational and Systems Biology. She is actually one of the leading researchers in this field of computational biology in the world. She did her PhD in computer science at the Hebrew University in Jerusalem, was in a postdoc at Harvard Medical School, and is now, as I said, at Memorial Sloan-Kettering, one of the leading cancer research centers in the world. She has won a large number of prizes and awards. For her work, I highlight two from the bioinformatics side, the ISCB Overton Prize for Outstanding Work by a Young Researcher in Computational Biology. And as I saw last week, she was named an ISCB Fellow for her continued contributions to the field. We congratulate you on that, and we congratulate ourselves on having you here to tell us about machine learning and its role in single-cell biology. So thanks for inviting me, and I'm going to do something both completely different from the previous very deductic and, you know, methodological talk, as well as very different from the talk that Fabian gave this morning. I coordinated a bit with him. So while I was sound asleep, I sort of know what he told you. So, you know, I know that right now machine learning means AI, but for me, actually, you know, in the field of single-cell biology, it's not the best approach. So I'm really interested in understanding biology, particularly understanding cells and tissues and how they work and how they develop and how they respond to biology. One might ask, well, how does this relate to biomedicine? In biomedicine, a lot of the problems certainly in the machine learning field and in the computer science field where people are comfortable, there's a lot of classification. Will you respond or not respond to a therapy or, you know, finding biomarkers? Can we take a small number of biomarkers that we can measure in a patient so that we can feed them to our favorite classification approach? Basically, these are very powerful and very important. It tells the patient what medicine we should give them and what are their chances of responding. But in cancer, which is, you know, my passion in field of research, or we're in a situation where there's just too many people where the answer is, we don't have a good drug for you. We have nothing for you. Your prognosis is horrible. There's a lot of people which we can help and the field is making progress. But my goal in passion is actually to be able to find new drugs and new therapeutic approaches. My strong belief is the way to do that is to really understand the underlying biology. And that's what I'm going to focus on. And rather than give a very deductible talk, I'm going to, you know, give you a lot of vignettes about sort of my philosophy and some things that are important that are often neglected. So when I started, I started as a computer scientist and viewed cells as little computers that need to, you know, get input from their environment and make decisions. Are they going to proliferate or are they going to differentiate this, that, or fate or the other? Are they going to respond to a signal such as activation or stress response? And basically rather than transistors and wires, the computing devices of this computer are molecules. So way back, you know, before single cell was a phase in a fad, even based on just multi-color flow psychometry, we took the approach that if a cell is a computer and if we can measure some of its molecules in a multiplex fashion and sort of observe a lot of different cells and what they're computing, we can actually try and learn the computation directly by looking for statistical dependencies between molecules and the data. And, you know, the pretty naive approach using, you know, vanilla Bayesian networks work surprisingly well. So data said that we could measure in a single afternoon and, you know, run a minute on a computer using, you know, fairly vanilla Bayesian network circa 1990s actually recapitulated two decades of biochemistry, correctly, very little error, and even predicted sort of novel interactions between these classical well-studied signal signaling proteins that were later validated to be true. So, you know, this was a very powerful approach using, you know, Bayesian network vanilla, but the power and principle is that we can look at the data itself, we can look at each cell as an observation and try and learn these relationships from that. And now with all these new single cell genomics technologies and the droplet based technologies and the ability to measure thousands and now going on hundreds and thousands to millions of cells, we can actually really learn networks from the single cell data and we can even, you know, collect enough cells, 10,000 cells from a single patient learn patient specific networks. So now I'm going to go all the way and move up from 2005 to 2021 and give you a specific example in the clinic of some work that we've done. This is a collaboration with Kathy Wu at the Dana-Farber Institute led by Elham, who was a postdoc in my lab now, you know, independent faculty at Columbia. And basically we wanted to be able to understand the difference between responders to non-responders. This is like the canonical most important thing that someone might want to understand. But, you know, getting data from the clinic, I've worked with cell lines, I've worked with mice, you know, the clinic is the messiest data to work with. There are so many confounding factors, a lot of these patients have different comorbidities, ate something different the breakfast before they came to get their sample, had different medical history, different drugs. Also, you know, you can't really control it. The patient is the first thing. So, you know, with mice, you can control it to optimize your experiment with patients, you know, their comfort and medical well-being is, you know, top priority, obviously, and that often impacts the way you can collect the samples in a very reproducible and controlled manner. So, one of the most important things to think about is your cohort is going to have huge heterogeneity, both biological because of the variety of different people, technical because of how the data was sampled. And the best thing you can do is try and get as good a clinical cohort as you can. Putting thought into putting together a clinical cohort, for example, a prior therapy has a massive impact on the cancer. So, if you can control for the prior therapy as much as possible, that's your chance of actually getting something out. Now, when you can combine good data that was well-preserved, a well- annotated coherent, homogeneous as possible cohort and strong computation, you can do quite a bit. And in this case, we wanted to understand response to DLI. So, these are chronic myeloid leukemia patients that had a bone marrow transplant. Things looked good for a while and then they relapsed. In these patients that relapsed, one of the things that sometimes works for these patients is to re-infuse them with lymphocytes from their donor. And that sometimes helps and sometimes doesn't. And we wanted to understand, well, how does this revive the immune system? And what is the difference between the responders who, then you could see here in the plot, the tumor burden, this black to blue orange line is tumor burden. In the good case, the tumor burden goes down and in the back case, it continues to grow. We actually collected a pretty small cohort, but what was important about it is it was high quality and it was longitudinal. And none of these patients had any prior therapy other than the, they all had the same type of prior therapy history. And as references, at first we started normal healthy and that was so far off left field that our references were people who had the transplant and never relapsed. And that worked a little bit better as a reference. So again, I cannot stress how messy clinical data is. And one thing that we haven't really solved, Fabian showed you this morning some autoencoder based ways to integrate samples and that's pretty nice except that puts you in a Latin space which you can't really get out of. You can't really translate it back to the genes in a good, powerful way. And if you talk with biologists, if you put them in a latent space where they can't really interpret what's going on at the level of the genes, the biologists won't be happy. So if you don't want to use an autoencoder for your normalization and even the autoencoders don't really work, how to normalize in a way that distinguishes between true biological differences between these patients and all the technical reasons why these samples are different, which are huge, is a real unsolved problem. And at first, this is a ticini image and you see each dot is a cell colored by their patient ID. You see dramatically different patient IDs here and you see a lot of patterns. Now, if you look at it, you could give it an interpretation. These yellow cells are high in interferon. You might think this patient actually has biologically real high interferon. You can't know if it's technical or not. And we do like having mixed therapy. This is the first large-scale dataset of tumor infiltrating immune cells in patients. So we didn't even know what to expect when we first looked at it because cancers are actually very different. But one of the things that we now know is that the immune system should look quite different across patients and you want mixing. That is, if you look at the most similar cells to each cell, you want a lot of patients in that neighborhood. And when you see immune cells that look this different, you know that you have an artifact that you have to fix. And there's no free lunch. None of these algorithms really work perfectly. One of the algorithms that I really like in the case of patient data, this is sort of the strongest algorithm that that's most biologically driven and its price is, you know, it's very messy to use and very heavy to use. So it's, you know, it doesn't scale well beyond 100,000 cells, but it really is based on the biology. And its key assumption is that gene-gene relationships, covariate relationships in the data, the same covariate relationship that allowed us to learn that Bayesian network before there do represent real biology and that these gene-gene relationships, you know, the covariance between genes is something that somehow survived these batch effects. And the batch might impact levels of individual genes, but that the relationship, you know, should be preserved and can somehow, you know, give you some ink link of a signal across these different batches and batch effects and that the batch effect impacts genes rather than their relationship. So since we want to learn covariate structure, basically we want to model each cluster, each biological cell type as a multivariate log normal that represents the covariate relationships that are present, that are important in this cell type. And we assume that basically all the different clusters across all the different patients should have these same multivariate relationships and we want to find some correction to the data that will converge on these multivariate relationships, which we also learn across the data a little bit messed up because of the batch. And this actually works incredibly well and as I said it's computationally heavy, you know, the direct opposite of these auto encoders which can scale to whatever number you want. But they give you interpretable biology, they give you the gene-gene relationships that are driving the model, they give you the differentially expressed genes parametrically, they mix the patients very, very well and at the end of the day you see very strong signal for all the expected immune cell types in the data. And I stress again when you're working with real data, the immune cells should mix and when you see that the immune cells aren't mixing well, you know there's a problem. But this covariation completely breaks in the tumors and we do see that each tumor is its own beast and you don't want to actually overcorrect these tumors because then you're actually normalizing against real biological variation. And that's one thing to be worried about because many people overcorrect, overnormalize, overpush samples that are biologically different to be the same. And so this is so powerful that we can take two different technologies, these are two patient cohorts collected, two years apart measuring, one is the indrops, the other is 10X, one is measuring one side of the RNA, the three prime, the other is the five prime, yet these covariate relationships are such a biological base fundamental entity that we found the near one-to-one matching across 34 specific T cells clusters across these two patient cohorts because we are, we believe, capturing something real. And even at the protein data, we found this sort of same covariate relationships between entities. And so applying this to our data to sort of really make sure that we account for all these variances in the T cells, we found 43 unique T cells. Now I'm not going to tell you the entire story here that's up in bioarchive, but I do want to tell you that now that we have these clusters we can understand the relationship to treatment. We can see which ones are predictive of a response, which one dynamically change in the response. And so we actually use the Gaussian processing modeling to see which are the T cell subtypes, who are the ones that are responding to this therapy, which very, very specific T cell subsets are growing and expanding once you infuse them with the DLI. So we did this Gaussian process models with a little bit of adaptations to take into account the messiness of the data, the sizes of the clusters, the fact that some samples were really good. Some samples, most of the cells died and we had very few of them in the freezer. So basically we had a prior to measure, okay, what do we expect if we have a really small cell population versus a large one? What do we expect when we have a real crappy sample? We don't want to give it a lot of weight versus a sample that's really rich in cells. And so we encoded all this into our prior on the Gaussian process relationship and we actually managed to learn these dynamics. And we found that terminally exhausted T cells, they basically follow the tumor burden. They're actually predictive of whether you'll respond or not, but basically follow the tumor burden. But these progenitor exhaustive cells that start, you know, is very, very small populations, very, very tiny pre-therapy, massively expand consistently across all the responder patients across following therapy. And we see two specific different progenitor exhausted clusters or cell types that have this expansion. So the tumor again is in gray and you see as the tumor burden goes down after the therapy, which is the red line, the growth in the size of this population, we see this occurring across all the responders in blue, but the non-responders, this isn't happening, the non-responders, the progenitor exhaust that aren't changing, the terminally exhausted are just doing whatever they want. And, you know, there's no relationship. So, you know, as I said, that we learned a lot about this from the data. I'm not going to tell you the whole story, but you know, the main question is the why and to go back to networks. And one of the most powerful tools to try and learn networks in addition to RNA is ATAC-seq. Now, we actually collected sorted ATAC-seq for these things. This was before the 10x single cell ATAC-seq. And we really wanted to understand, okay, what is driving all these differences between responders to non-responders? You know, what's going on before and after therapy that causes this expansion of these cell types? And one of the main things we found is that the difference is predetermined. So before and after therapy, the epigenetic landscape was very, very similar. The big difference was between responders and non-responders. So the epigenetic landscape, our priority, was wired very differently between responders and non-responders. And basically in the responders, they were populations that could respond. And they responded not by changing, but rather by expanding. And so to understand these networks, and this is work with Elham, with a graduate student in my lab, Sandra, we actually tried to build a generative process. So actually, you know, we said, okay, the ATAC peaks of each individual cell population, which remember at the time we had bulk, we also have a version that worked with single cell ATAC-seq, where you don't have to do this deconvolution, but if you're working with bulk, you know, we observe multiple replicates of bulk, but we want to deconvolve it into the specific regulatory networks in a cell-type specific manner. ATAC peaks serve as a prior for transcription factors that regulate targets. And so the ATAC-seq peaks, the motifs serve as a prior to guide a model for regulatory interactions and insolences between transcription factors under targets, which actually drive the observed covariate structure in the data, which we also want to estimate. And we could observe all this in the single cell data, which we can cluster and empirically measure these covariate relationships, which are part of a model. So the bottom row is observed, and the top row is latent that we're trying to learn. And an important thing is that the ATAC-seq only serves as a prior about half the transcription factors do not have a known motif, and you can't even use a ATAC-seq to infer about them, many of them playing important roles. So we built a plate model with an EM-Edward implementation, and this is really a generative model that tries to mimic transcriptional regulation. This model works beautifully. I mean, we benchmarked it in PBMCs, where it sort of recapitulated all the known biology. But when we applied it to this data, and particularly in the responders, we actually managed to pull out a lot of the canonical master regulators. These are TFs that have strong effects on all these very important immune checkpoint genes. And we saw completely different regulators for each of these different cell types that are terminally exhausted, that follow the tumor burden, and the two progenitor exhausted that grow. We managed to recapitulate a lot of known biology, including some poster child transcription factors, such as TOCs and TCS7. We found TOCs right as it was actually being discovered elsewhere. And as we were finding TOCs, suddenly lots of science and nature papers came out. And I stressed TOCs because it's an immunotherapy poster child for this response to immunotherapy. But this is one of these important transcription factors that does not have a motif. So if you really only limit to these motifs, you absolutely cannot learn what it's doing. And you can go all the way down to these patient-specific regulatory networks. Now, I'm not going to bank. I do believe that the master regulators are captured because there's a very strong signal. They regulate a lot of genes. I'm not going to bank myself and bank my savings on each individual link here. But a lot of known links are known. And what I like about this is you can go to the data and see both negative co-expression, negative regulation, positive regulation. You can actually look at the data that's supporting each one of these edges. And many, again, known edges were found. And this is really exciting, our immunotherapy friends, because they actually see in these networks a lot of new targets and a lot of new avenues for new potential therapies and new ways to get this to work. And I know that deep neural networks are the thing to do these days, but I actually like Bayesian graphical models because they lead to generative interpretive models. And if we want to understand the biology, it's critical to be able to interpret things. And we're not quite there in the auto encoders for this type of thing. I'm actually working a little bit on how we can make them more interpretable. But for me, the interpretability is one of the most important features of a model. And of course, we can look at a validation cohort and see that in a completely new cohort, we see the same trends. We see even at the level of protein here, we threw in some site seek, all the canonical markers that made the immunologists happy were there at the level of the protein as well. The populations that we predicted should be expanding in responders, again, responded and grew in a new cohort. And the same master regulators that we found previously, again, were the master regulators of the validation samples showing that the robustness of the approach on a completely independent new set of samples. So to recap, a lot of this sort of the manifolds that we saw earlier, the manifolds that Fabian showed you this morning, they're actually really driven at the bottom line. What creates these structure is gene-gene covariation. And yes, gene-gene covariation is nasty because it's complicationally heavy, but it's an important biological entity. And using that, we can really try and understand what's happening at the level of individual patients and distinguish between responders and non-responders as long as you sort of treat clinical cohorts with a lot of care. Now, one of my favorite things about single cell data, what makes me my passion tick is the fact that these are asynchronous. So you can actually get dynamics from a single sample. If you take a single bone marrow out of a patient or out of a person, you get the entire hematopoiesis. You get the early hematopoietic stem cells, you get all the progenitors and all the mature distinct the immune types, the dendritic cells, the monocytes, the lymphocytes, and actually development is the first principle component of the data. We first noticed this in 2011 when even the first principle component of B cells looked something like B cell development and you can get dynamics and regulation. Like EMT, you get the epithelial state, the mesocymal step, and every stage in between really giving you the ability to infer the regulators of the system. So again, a common way to represent these manifolds is with graphs, each cell is a node connected in edges to its most similar neighbors. So this is a low-dimensional space and rather than working in the high-dimensional Euclidean space, you work in the low-dimensional geodysic space, sort of traversing regions where you do have cell phenotypes where cell states exist. If you want to learn development, you have to go through a series of cells that are observed that actually exist. And as I said, I really like working with graphs because they're computationally slower but they leave me in a very interpretable domain. And so pseudo-time, the ability to try and recapitulate differentiation has been one of the most powerful uses of single-cell data. And the idea is we're going to sort of order cells if from a single sample along the pseudo-time of their developmental maturity. And we were able to get really accurate progressions in the context of development and discover the order and timing of key events. And the first time we did this, 2014, we actually managed to find this tiny population right here. And the critical thing here is, again, we had to collect many cells because this population was very rare. It was seven in 10,000 cells. But in this population, a lot of important stuff happened. This is where the VDGA recombination is happening. This is when the B-cell receptor is shuffling when the DNA gets messed up. And right in these cells, there's an active signaling checkpoint. So in the cells, right before them, nothing's happening. This checkpoint, this signaling event, is very specific to this tiny population. They give you a feel for how tiny this population is. It's that little red sliver here expanded into a circle and population three right here, the smallest pie out of this tiny sliver. And really, to get this population, we had to have very accurate algorithms and collect a lot of cells. This is why we really are pushing for so many cells in single cell RNA-C. And this is the population that goes awry in pediatric leukemia. This is the population. This is the checkpoint that gets messed up there. And by knowing normal biology, by understanding normal development, we could finally understand pediatric leukemia. But we had to identify this rare cell type to do so. And it's often in biology that the rare population is the one that matters. And a lot of machine learning turns to the average and actually ignores these rare events, which I found to be most important in biology. So going on in circa more modern, just instead of finding a linear trajectory, we can actually try to understand development and fate potential, modeling fate probabilities as a mark of change. So we have an undirected neighbor graph. We can use pseudo time to orient it. And we get a directed graph, which we can use to build a mark of change and then get all the access to all the math and tools and power of a mark of change. But again, when you're building these things, a lot of times in these talks, people tell the big picture. And yes, here we built a mark of change. It seems very simple. But one of the things I've experienced in computational biology, it's the devils in the details. It's all the small details of how exactly do you build the cell-cell similarity. How do you sort of sample this graph properly? And one important thing to know about these manifolds is that they're very non-uniform. You have about five orders of magnitude between frequencies of cell types. And while many algorithms assume a uniform density on these manifolds, including the famous UMAP algorithm, they're very non-uniform. And as I've stressed, sometimes it's actually the rare populations that are most important. It's certainly the rare populations that are driving these transitions. So we have a min-max sampling algorithm that really tries to cover the entire manifold and really covering these rare populations effectively is critical for any success. Once you've constructed your mark of change with a little bit of thought, then you could use the full power and you make terminal states as absorbing states and you have all of the closed-form solution of linear algebra empowering the matrix. And one of the things, and now we can, for each cell in a closed-form solution predict its probability of resulting in each terminal fate. And I really like watching algorithms in action. I think it gives a lot of intuition. So this is the original mark of graph where every row and column is a cell and the dots represent the edge strength. And this is the original graph. And as you power the matrix, you can take longer and longer paths getting to farther and farther away states and have the probability of reaching each of the terminal states after 50 steps, 500 steps, and as this converges in closed-form. And you can get for each cell this the probability of each terminal fate as well as the plasticity, how uncertain, how plastic it is to reach all these different end states from this mark of chain, the structure, the process of development, cell fate, choice, differentiation is all encoded into the graph structure and to the mark of graph. And you can make a lot of inferences using simple linear algebra. It sounds simple, but it's actually an incredibly flexible framework because there's a lot of knobs that we can play with. We can change the feature and the way we measured similarity matrix, choosing biologically meaningful genes. We can look at the epigenetic marks, which I showed you how effective they were in that first application in the CML. And we could change the ways the edges are oriented and I'm going to show you some examples of that. We don't have to use just soda time. For example, we can use genetic mutations and their accumulation in the cancer case. And so applying this to a case of early development, this is the developing embryo. It's actually a mouse embryo. And this is the very first lineage decision of developing embryo mix between embryonic and extra embryonic cells. Now, the fact that we can take the cells and order them so accurately along soda time, we can see what changes and what happens to these cells as they're making a fate choice between two of the earliest lineages. We can see how the major regulators in the cells are changing along this soda time and actually figure out that what needs to happen. This dotted line is a probability to go to the PRE lineage. So when it chooses the PRE lineage, what happens is a combinatorial regulation of these two receptors. So when you have this combinatorial play of these two receptors, the cell is driven towards the PRE lineage, which we could validate in imaging. We can also discover some really surprising things. So here again is that high entropy area when it's trying to decide whether it's embryonic or extra embryonic. And then it makes the decision and everything seems fine, but we actually found another surprising region of high entropy where it seems that cells that had decided to become epiblast, decided to become embryonic in pattern and be part of the body, changed their mind and become extra embryonic and serve as the placenta and the yolk sac and all these other embryo extrinsic features. And that was the Palantir's prediction and everyone thought, ooh, the biologists didn't really believe us. They thought it was artifacts. We went in, we did all the due diligence of saying, how strong is the signal? How much do we believe in it? We said we believe in it. The biologists went back in, used some lineage tracing and some Cree-Lock systems, two different systems to really validate our algorithm's prediction and discovered a really novel trans-differentiation early development. And actually this work together with Katage and Tanakis really rewrote the developmental biology textbook on five major findings. Now we can apply this not only to genes, as I said, we can change the features. This is work actually by Prisco Liberale and here the features are actually imaging, they're organoids. It's not individual cells, this entire organoids, they're shape, they're size, they're symmetry breaking, the number and ways in which they break the symmetries. And so you can extract these features from the image but some of the features are actually protein expression. So this is a way where, and this is the exact same algorithm. The only thing we're doing is changing the features which we feed it in. The only thing we're doing is sort of changing how the graph is constructed. And now we can connect protein expression, in this case the app one, with symmetry breaking and actually see what is the protein, what is the factor that's driving this phenotypic outcome of the symmetry breaking in the organoid, showing you how powerful of a biological discovery tool this very simple concept of a Markov chain is. Now getting the right orientation is a really important thing here. And so one really popular method to orient graphs to try and get a handle into causality because all the classic soda time methods make a really strong assumption of cells go from less mature to more mature. You have to know what the starting point is and it's a very strong assumption that the cells go forward in development and we know that's not always the case, for example, in the case of regeneration. So here we're using the ratio between unspliced mRNA which comes first to spliced mRNA to try and get causality. This has been really powerful and you see these very beautiful mutations of the data but actually they're very, very smooth and smoothed out two dimensional representations of the data and it's actually smoothed on two dimensions to look nice can be very misleading. In fact, take with a grain of salt anything in two dimensions because two dimensions can often mislead what's happening in the higher dimensionality and when you look in the higher dimensionality you see these velocity vectors can change even across neighboring similar states and point all over the direction and point outside the manifold into directions that are impossible for cells to go. They're actually really, really noisy. So together with Fabian from this morning and led by a great student in his lab that visited my lab, Marius we decided to marry the best of both worlds. Palantir, the mark of change, the graph manifold, the idea that this graph manifold really captures the sort of legal possible regions of cell states and give you a global structure for the data and these local velocities which locally give you some inkling of causality, some inkling of direction. And by marrying the two together and simply saying okay if we're going to direct our mark of change we're going to give higher priority to directions that align with the velocity but we're going to put it in the context of our mark of change. We're going to actually also use the expression particularly for areas where this is noisy and we're going to look at it globally and look at longer more global trends using sort of the mark of change and you get the sort of best of both worlds constraining your progress through the phenotypic manifold and again a lot of the devils is in the detail not only in the modeling assumptions but in the math to try and get these approximations done well. A very important part of cell rank is actually using general appearing cluster analysis to get a core screening which helps both in accuracy and in speed to take these very very high resolution single cell transition matrices to get sort of metastable states to really understand the dynamics of the system and automatically infer initial and terminal states. You know in biology we don't always have a case of ground truth and it's really great when you can find ground truth here is a case of cell tagging of a reprogramming in vitro experiment where cells are tagged in such a way that we know the progeny of each cell. So we over the days we can see which cells are progeny of which cells which cells have the same cell tag and during the reprogramming many of the cells succeed and get to the end of derm lineage. This is success and cell rank automatically identified the initial state the success state and other cells reach dead end. Now if you look at these two dimensional velocities and it looks really you know good it looks like they're signaling the data these velocities are all wrong and all misleading there's no path to success and we know a lot of cells succeed and there seems to be a path from success to a dead end which is biologically incorrect but when we applied the cell rank which you know overcomes all this by looking at low long range structures we could actually predict very correctly we could take cell clones and ask okay where's their progeny gonna go are these progeny gonna succeed or fail and we see a very very high accuracy of cell ranks predictions and the accuracy gets obviously better and better and better as we go closer to the end point as we go further into the reprogramming state but you know even 14 days before the reprogramming ends we have a pretty accurate prediction of 82% accuracy. Now a lot of people like these in vitro systems because in vitro systems you can actually build a ground truth but I have to warn you that in vitro systems are always much much much simpler than real biological in vivo tissue and there's a lot of algorithms that work beautifully in in vitro and have wonderful behavior and you sort of get the idea that these algorithms are working really really great but they've been optimized to the simplicity of in vitro and they completely break in in vivo so you have to sort of be aware of that and not trust not do everything in vitro even though it's appealing because it has a ground truth because it's just far more simple in its sort of behavior distribution and if you optimize an algorithm to something too simple it's going to break when life gets complicated so I like testing things also in the in vivo setting in cases where it's harder to find cases where the ground truth is known here this is a well-known system of pancreas development and people know where each of these progenitors are supposed to to go because they've been solved through very careful biology but here cell rank automatically finds the path to these delta cells automatically identifies these progenitor cells and even identifies that the key driving regulators underlying delta cell identity and this signal completely doesn't exist in regular velocity so you really need the marriage of both of them we also tested this in regeneration and this is a place where volunteer miserably fails because it's de-differentiation here cell rank predicted regeneration a de-differentiation from goblet cells to basal cells this is going the opposite way from more mature cells to less mature cells so regular pseudo time would break here and we also predicted all these intermediate stages that these goes through and this was sort of validated to some degree in imaging and this is new not only into de-differentiation but going forward basal cells always go through club cells there's no direct path between forward between basal and goblet cells but we found a sort of interesting reverse de-differentiation because velocity did not predetermine the direction of where we're going and cell rank could go the backward direction here but again velocity as I said is not a magic bullet it could be incredibly misleading splicing dynamics really depends first of all if you're out of the out of the eight hour window a lot of people try and use it in like two, four week windows of development and cancer that's not going to work you have to sort of really understand what are the assumptions of your system where something can work where something can't work velocity has a time limit it's about eight hours there are systems which simply don't have enough introns in them to capture good enough velocity where all the master regulators all the important regulators of the process all the drivers of the process don't have any spliced reads for the algorithm to get a grip on hematopoiesis is one of these things hematopoiesis is where velocity goes systematically wrong and while cell rank can make some corrections when things are systematically wrong when velocity is completely off even cell rank can't save it so here we actually developed a new approach again back to the regulatory velocity here's Cassandra again my graduate student and she said okay let's actually use regulation itself the idea that transcription factors precede their targets to actually get these velocities so here we want to model just use a regulatory model to build these velocities and we go with a linear we know that it's you know linear is a simplification but it's robust it has a closed form solution we can build sort of a genome-wide model and now if we have this model we can actually use regulation and our model for regulation to predict a future cell fate given the transcription factors and the regulatory models of the current state and of course you need to learn the regulatory system you need to learn which TFs regulate which targets and how and you want to sort of learn this from the data but you have two unknowns you don't know the the the slopes of the velocities and you don't know this regulatory matrix so that's where you know EM comes in when you have two unknowns and of course EM really needs good constraints and good starting points succeed and I'm not going to go into all these details but rather show that it really works this gets hematopoiesis you know quite accurately and you it works not only at the global level of velocities but you can also look at what each individual gene is doing in all the genes that we care about as cells are making a decision in this case between megacarocytes and erythrocytes the right genes are going up and down and we can try and understand the behavior going all the way back to the regulatory models and looking at who are the activators and who are the repressors that are driving this observed behavior and again we get a lot of correct inferences here not everything in entire global model is correct but there's enough things that are correct that at least get the global dynamics of the system correct and one of my favorite things here is that we can do an ablation experiment an in-silico ablation experiment and ask which are the master regulators which are the real regulators that are really driving these cell fate changes that we care about and for example for hematopoiesis we managed to correctly predict the most important regulators and you know they're not necessarily the highest expressed one but we've actually computed their impact on the targets and really shown that when you knock out this regulator and remove it from the model there's a huge change in the behavior and the velocities and you know we can try and actually map this and we're applying this in the cancer setting and here we have a rapid autopsy where we have multiple samples from the same deceased individual multiple primary samples of the tumor as well as multiple mettes from the same individual including three separate liver mettes which is the closest thing you're going to get in a biological replicate and here we also see these transitions these continuums from a metastatic from a primary to metastatic states and sort of different paths to metastasis which we're learning so to recap you know these Markov changes these trajectories this ability to get regulation from the data is really powerful to understand how biology works and what's driving these things if you know what's driving these things you can potentially for example stop metastasis in its tracks but sometimes actually the simplest of things work best here's a case of leptomine angiometastasis where we went to the clinic and got data and this is a really horrible disease because the patient suffer there's a lot of of physiological effects which is why these things are removed and we can actually look at them and assay them and survival is 3.5 months and there's nothing to do and here actually the simplest of things clustering and differential expression analysis got the answer and we saw that actually these two regulators for iron transport get overexpressed in the cancer cells even though they're only expressed in immune cells and normal in every single patient in leptomine angiometastasis driven by breasts and leptomine angiometastasis driven from a primary lung tumor and this gave us a new therapeutic approach that's going into patients under six months from data to clinic see I don't have much time so I'm going to go really really what we learned here is cell-cell interactions really matter so the future is really imaging this is a great technology which allows us to look at a complete lymph node but actually creating the images is pretty hard because multiplex data is complex there's a lot of damage to the tissue so when you realign it and try and get the high single cell resolution these little things are supposed to align they're all supposed to be on the same cell surface and you see that they have moved that they move differently in different parts of the tissue so no single transformation can fix it it's not a simple linear transformation so this finally is where I think deep AI works and we've developed our own little learning method which we call spirit to realign these tissues perfectly getting a perfect picture of the lymph and now we can actually look at these interactions and learn these histoma modules at multi-cellular scale we can recognize automatically the germinal center which has very different cells and cell interactions and even within the germinal center we see additional structure and at another scale different regions here we see a region the B cells are in blue and they are proliferating in yellow so we see these are the B cells that are growing and preparing antibody perhaps after you got a COVID vaccination and you see the sort of proliferating region and all the cells and the control T cells around it in cyan and here just next to it is a completely different region where the B cells in blue are not proliferating and that's because we have a regulatory T cell in red here inhibiting of course inhibition is critical otherwise our immune system would go out of control and you know this is again a very early work in progress but you know the future really is now going into these tissue contacts with imaging and that's where I actually think a lot of the deep neural network approaches will be very powerful and can help interpret it and so in summary you know single cells really allow us to understand the system understand the biology the transitions the regulation if you understand that you can know where to interfere and of course the next frontier is space where there's a lot of work to do so this is work by a lot of people in my lab a lot of great colleagues at MSKCC and elsewhere you know particularly Kathy and Dana Farber Kat locally Sasha who I'm doing the imaging with and a lot of great members from the lab again highlighting Elham Manu who is also now independent faculty at Fred Hutch and Cassandra and Doron who are graduate students in my lab and thank you for your attention and thank you for this a very inspiring talk and it's excellent overview over the challenges and possibilities of single cell research thank you very much we send a round of applause to you virtually now time for questions are there questions from within our network please raise your hand if so if not I go directly to the YouTube questions from the YouTube audience there's one question here the DLI example how did you find the correct number of clusters of temporal patterns so basically okay there's couple steps to it the clustering was done by by biscuit and biscuit automatically finds the right number of clusters based on the right number of sort of densities it finds in there of course automatically is it cheap because it all depends on the prior and the tuning of the parameters but you know it pretty much converges and we see that each of the clusters are quite distinct and separable and have their own differentially expressed gene in their own covariate cluster and the fact that we see the same clusters and different cohorts for reproducing gives me some idea those are good clusters now many of these clusters are actually really small so again I skipped over a lot of the details so really to match clusters and to get sizable clusters we actually went to meta clusters and sort of clustered very similar clusters and then use those in our dynamic model so that you know it's really hard to learn clustering is imperfect T cells are continuum the border is import perfect the data is noisy so learning dynamics of tiny tiny tiny clusters that are small small fraction of the data is hard so we pulled a bit here and again we we had very strong priors on the fact that our measurements aren't very accurate in order to be able to get those dynamics and when we did all that you know the dynamics were clear and reproducible and importantly reproduced across multiple patients thank you for that answer Joe Bani from our network one PhD student in our network has a question Joe Bani please hi thank you first of all it was really fascinating talk I have a somewhat practical question since you mentioned a few times the comparison with neural networks and that type of machine learning what I wanted to ask is how scalable are these methods and how much data do you need to get good results for example I noticed the case in which you applied it to set of six patients which for other methods might be very not a very big sample size so again each method is very different the Bayesian graphical model methods scale to the order of 100,000 cells and as you can see that that was sufficient to learn something the sort of graph based approaches and the Markov chain based approaches can scale to the low millions especially if you do computational tricks cell rank actually has a lot of computational tricks and again we use the sampling in Palantir as one computational trick and we use the coarse graining in cell rank as another so you know that that scales to millions the self to self similarity computation is the painful part of the process but again there are all sorts of tricks to make that quicker now I do want to stress that usually when you have six patients it doesn't cut it okay that that was a combination of luck and good choice and that's why I stress getting a good cohort we have often cohorts that are even bigger and biology depends on the strength of your signal and and how you know how pure your cohort is when I work with mice and they're genetically the same and they're grown in the same cage and I can sort of really have a true biological replicate yeah five and five are enough for responders and not responders that's a big effect size in patient cohorts the reason that we could succeed is because all these patients had the same treatment history and were treated by the same drugs the moment you have a heterogeneity in the past drug treatments of patients your success and and homogeneity goes out the window so the moment you can't get a cohort that that shared a treatment history and has sort of some equivalence in the way that they were treated your ability to learn anything from that is out the window and actually curating such patient cohorts is is is really hard and kudos for for paven for for searching you know through the entire dfci biobank for for such a curated dataset thank you thank you for the answer that's not a question from the youtube audience clinical time series analysis how did you deal with having not matched different time points per patient would your approach also work on bulk data so at two questions yeah so the gallecyan process one of the reasons we went with the gallecyan process is is because it's you know by nature can handle not matched and because because it sort of puts time as one of its you know variables and if I'm trying to I'll quickly reshare the screen and show that that the time really were not matched so we did make sure that our cohort had one matched time point and you know right before therapy and then approximately matched time point and in sort of number of days after therapy not perfect this is patients but close but for the time series you could see that the the rest of the samples you know we had a time series but in all sorts of weird time points sort of the x-axis is time and the gallecyan process exactly handled the non-matched but obviously there's a couple important points getting something you know close before gli I think was very very important for success here you know the question about bulk data I just think that single cell data is so powerful you know yes I had only five patients but I had you know 80,000 cells I had 80,000 data points from which to learn these networks I could break it down for example the progenitor exhausted cells that were the ones that grew under the response they would be completely drowned out in bulk they're a tiny part of the population even the terminally exhausted cells that predicted response to therapy are a relatively small proportion of the population so I wouldn't have been able from bulk to see very strong that's a signal of the predictive yes the terminally exhausted actually was seen in bulk but certainly the progenitor exhausted the responding populations understanding the response I would not have been able to do from bulk and and for me yes getting patient samples is hard it's a lot of work it's very a lot of clinical trials only have 30 patients in the clinical trial even if you have all the money and resources in the world and so the one place where you can get more information is looking at a lot of cells and that's why I love single cell thank you I have one technical question namely you mentioned correctly that details matter and the construction of these graphs and but the first thing you mentioned was this cell-cell kernel how about the the parameters of the actual neighborhood graph construction like how many neighbors do you connect like one cell to what is your threshold in cutting in cutting off similarity first of all because details matter I'm a stickler to writing really long and detailed supplemental I really get pissed when people like pack that so anything I told you that's published is you know all the gory details are in these like 80 page supplemental documents in fact my lab laughs at me for the type of supplements I demand of them and they call it you know there's peer review and peer review which is much worse that's my review but specifically as I said the cell density is what matters and you want to unify the cell density so if you want to get fine structure you want k of the original graph to be rather small so you really want to be able to find these fine structures but of course you need k large enough so that the graph will be connected if you have a disconnected graph then you're screwed so we try and look for a fairly small k that one keeps the graph connected and also is pretty robust so anything we do we always show that the paragraph that we get the same response for multiple parameters if the parameter really mattered and we get a different response we don't feel that's robust so we also show that the particular choice of k didn't matter so the moment you get a k that's large enough to keep the graph graph connected and large enough to keep your response robust so if you sort of have a good range of k's nothing happens that's the k you want to get for your k and n graph and then because you know this is um gallstein kernel and you don't want these cell densities where these dominant nodes that have undense regions to like completely control the graph you want the graph fairly uniform you say okay I want k so for each node I'm going to take a sigma for the gallstein kernel that will give me approximately k neighbors and it's an adaptive kernel giving me you know an approximately k neighbor sigma neighborhood for each cell depending on its local density and again that k keeps the graph connected and the results robust and that's the way we generally do it but these details some they're slightly different in some of the different papers but all that is in in gory detail of what and why we did in each supplement thank you I'll repeat one more question that Fabian was asked this morning named by someone else not by me but the question was what is or what will be the first clinical application of single cell technology well you know I think I've shown you an example I mean right now I know of at least seven clinical trials that have been already driven that that people have found new targets and new combination therapies from single cell I should one example with the iron binding of leptomeningials I have two more of these in msk Aviv regive has about three of these where she found you know new approaches new therapeutic targets new combination therapies nira kohen I know has one of them so this is a very powerful method to sort of getting new targets new combination therapies new molecular targets new treatments and I'm expecting that that will expand so we're past the first yes yes exactly that's why I didn't say will be but is or is so it has been yeah good again I you can really understand what's going on why something is happening what's the best place to target it what cell type it needs to be targeted in when you look at this resolution and this is how you discover new therapies exactly a very exciting topic thank you so much for joining us today Dana we all enjoyed it very much we're very happy that you took the time to to speak here and that we learned more about machine learning and then its role in single cell thank you very much