 So I'm going to continue actually from where we were yesterday. So just to briefly review, yesterday we discussed some ideas about cell fate choice and then looked at the way that questions are addressed in tracking lineages and defining cell states and cell biology. We then looked at some single cell technologies and focused on single cell RNA sequencing. We sort of took a fairly deep dive into the statistics and error modeling and decisions that are made in order to try and find structure. And I want to focus mostly today on some formal methods and then some practical applications of those methods to a problem in hematopoiesis. And I guess that just also to remind you, the idea is to try and separate out the heuristics, which we can do by visualization, to sort of thinking more formally about the data. And yesterday I also showed you that we had fun making k nearest neighbor graphs in order to visualize continuum trajectories. And I showed you a few examples. And today the focus of the second half of the talk is going to be to come back to this hematopoetic data set and to try and understand what statements can be made based on the topology of the structure. And just what I'm going to do is I'm just to sort of give you a sense of where this is going, because I'm going to stop talking about data for a bit, is to say that wouldn't it be nice if we can assign every cell in the hematopoietic trajectory? So this is the same picture that we had in the previous slide. I just cut off the large branches so that we can focus. Wouldn't it be nice if we could assign every cell a vector of fake probabilities, which is where the cells might end up in if we actually followed their dynamics? And this would be a set of predictions. And the question is, can we generate those predictions? And if those predictions turn out to be useful, then remember that every one of these cells has a transcriptome associated with it. So we can start to now ask how specific factors, such as transcription factors or receptors or chromatin modifiers or cell cycle associate with different parts of this plot. So we can, for example, potentially cut out all of the cells which are committed to erythroid or basophil lineage and ask what determines whether you're going to become an erythrocyte or a basophil cell, and so on. So that's essentially where we would like to go. It's a bit fancier than just parameterizing this space, because we are going to try and actually make a statement about how the cells behave rather than just describing the space. So the types of questions are, once we've got that, we can surprise ourselves whether there's particular states which correspond to cells which are committed in a very particular way. We can try to decompose this space and ask whether there's certain correlations in fate and so on. So this is the picture that we'd like to somehow get to. And then, of course, we could try to put that to test. So again, I'm skipping ahead, and then I'm going to leave this for a bit, if we could then sort out subpopulations of cells, such as shown here with these red dots, and then look at their fate in an assay, such as here, where this is looking at an in vitro colony forming assay, and then ask what would we predict. And then, if we can do this repeatedly and gain confidence in our predictions, we might start to believe that we've really described the population structure in a good way. So that's basically the idea. And I think at this point I'm going to switch off the screen for a bit and move to the board. I should say as a major caveat, so one of the disadvantages of being in a medical school is I haven't actually presented this on a whiteboard, so you're getting my guinea pigs here. And please give me any feedback, particularly good feedback. No, I'm just joking. Give me any feedback and do stop to ask questions along the way because I haven't tried to work through the theory on the whiteboard to an audience before. So let's just see how this goes. OK, so I guess, and for those of you who are less mathematically inclined, hopefully these pictures have convinced you that it's worth waiting until we get to the end of this. OK, so I guess now the question is, so we have sampled cells and we have to ask, how are we going to generate some predictions for their behavior? And one starting point, there could be others and this could be criticized. And actually, I want to break down the assumptions here. One starting point might be to try and think about cell dynamics as a land-gevan process. So we can write down that dynamics are some velocity and in principle, if we wanted to be very general, we would say that it depends on the state. So x over here is the state of a cell. It's not the x, y, z coordinate of a cell. We're not taking into account spatial structure and we'll discuss that in a minute. We're just taking into account a list of all of the gene expression states. So x1 is the first gene in our list, x2 is the second gene and so on. And v is the dynamics of gene expression. So anything that's implemented by a gene regulatory program, by signaling and so on, is leading to changes in gene expression, which are this dynamics. And now you can imagine that there is some set of noise terms which, in principle, are also spatially dependent. Oh, actually, I should maybe just to give, I'll put this mobility here. So this diffusion constant as space and time dependent and then we're just gonna couple that to some noise sources. And just a general observation here is that D can be rectangular because there's no requirement that this noise, we don't know how many noise sources are really coupled to the system. So there could be a number of noise components which are orthogonal or not aligned with the coordinates of the system. So there could be an independent component of noise for every gene, but there could also be noise components which couple between genes. Yeah, already a question, yes. This is an equation of a single cell. So what we are saying is that if I give you the complete state of a cell, then I should be able to predict what it does next. Well, so x, in this case, we're treating gene expression as a continuous variable. Of course, gene expression is not really zero or one. It could be any, there's distinct numbers of transcripts. But now let's really start to break down because this actually gets to the question of the assumptions going into this. So there's a lot of assumptions. There's a lot of reasons why this land Javan process could be completely wrong in its spirit. So let's think about this. First of all, our measurement of state is mRNA and the complete description of the state of the cell certainly doesn't stop with the list of mRNA transcripts being expressed. So mRNA is translated into protein. There's post-transcriptional regulation and post-translational regulation. There's post-translational modifications. The level of a phosphorylated form of a protein can completely change the behavior of a system and that won't be captured at all in our measurements. So the actual observable space that we have is not the entire space of the system. In addition to that, we have an external signaling environment which in principle is part of the system but it's not being captured by the transcriptional state of our system. And then there's measurement noise. There's the fact that actually the system is noisy. So all of this is to say that there's a good chance that this won't work. But let's now think about where all of this can enter the system. So I would argue that for processes which are fast, we can treat all of the latent variables of the system as essentially noise terms. We're essentially saying that there's some fast degrees of freedom here. They may be say very rapid post-translational modifications which will eventually lead to a change in transcriptional state. And if those changes are occurring sufficiently fast, then we can just tuck them into the noise term or we can tuck them into the rules of the dynamics. So anything that's fast is either a memoryless noise term or sits in the dynamics as saying that if a particular change in transcription deterministically will lead to a change in phosphorylation state which will lead to another change in transcription, then there's a slow process which can be incorporated into V. So we're not going to spend too much time defining precisely how all of these unknowns fit into this problem. What we're going to generally say is that there are certain situations where this could go terribly wrong, but we're going to try and develop and see how far we can get and see whether the predictions that we can make based on this view is correct. But it's really useful to at least make a list of all of the ways that this could be wrong. So just assumptions here are, as we've said, proteins, I'm just going to make bullet points here, post-translational modifications of PDMs, signaling that these are all things which are not well dealt with by this formulation. And we're going to assume that it's not lethal. All right. So now, of course, we're not following any individual cell over time. What we're looking at is a population of cells. So the generalization of this would be just to think about a Fokker-Planck equation or a continuity equation. So let's now define C of x and t as essentially a density of states. So it's the density of cells x. And also, for everything that follows, I'm just going to massively simplify and just assume that this is a scalar, which is constant. So again, we're going to make some simplifications. And I'm going to throw more and more simplifications along the way into the system to make it tractable. So we're going to see at the end that there's many, many reasons that this could go wrong. So now, we can write down that dc by dt is equal to maybe something like half d del squared C. That's just the noise. And again, we could make this inhomogeneous and so on, but we're just going to keep it simple. And you could imagine later on taking this formulation and elaborating it for a spatially inhomogeneous anisotropic diffusion. So it's not difficult to extend it if you wanted to. And then there's going to be a minus divergence of the flux here. And the flux is the density of states times the velocity. And now, we're going to actually add a couple of terms here, or actually three terms to begin with. So the cells are dividing. So in fact, their number is not constant. So we might want to add in a source term, which is proportional to the number of cells. The cells may also be removed from our system, either because they're dying, or because we have an experimental system, which is purifying cells based on a particular profile. So in our case, just to remind you, we are selecting cells based on their expression of a particular receptor known as KIT. So any cell that stops expressing KIT is going to effectively be removed. And then finally, in certain situations, we may need to worry about the fact that our purification scheme doesn't capture the base of our hierarchy. So cells are suddenly appearing in gene expression space. So this would be essentially concentration independent. So just to now enumerate this, so this is division. This is death or depletion. And this final term over here would be also depletion of source. So essentially, what this means is that there could be a net flux into a gene expression state, which is not explained by the existing cells. They just suddenly appear because of the way we've purified the cell. Sorry? So depletion means that in an experimental context, I have decided to remove cells at a particular point in space. And of course, the removal will be proportional to the number of cells in that state. If I have a lot more cells and I have only 90% efficiency, then of course, I will still have more cells. So the concentration, this is first order in the concentration, but this is zero order in the concentration. Oh, right, right, right. So this is depletion of source. So the idea is, let me try to graphically maybe show this, sorry, these columns here. Let's say that this is a small region of gene expression space and I have depleted everything that sits to the left. I've just removed those because they don't express the marker that I've selected for. But there's a net flux through the system. At this point, the flux into the state is no longer concentration dependent. It's dependent on an adjacent concentration. So I need to worry about that. So this is just a, this is being careful. In fact, in the particular experimental system that we're considering, there's no reason to suspect that there's any such term. So I've included it here, but we're going to now drop it. So this was just to try and think through the various terms. So yes. During the life of a cell, while when it's divided, it does not go anywhere. So the idea is that if upon cell division, there is a change in state, it's not never discontinuous. So you would always be able to find cells that are somewhere in the process, even if there's very, very few of them. So there's no symmetric cell division? There could be asymmetric cell division, but asymmetric cell division, oh, OK. That's an interesting point. Yeah, so an asymmetric cell division would be a point where, from immediately before division to immediately after division, you have a discontinuous relocation. That's a very good point. So we don't have asymmetric cell division included in this picture, OK? Yeah, that's a very good point. OK, yeah, excellent. So that's definitely not a good idea. And I hadn't thought about that carefully. So we're not going to take into account that term. And let's now see. So basically now I think, I mean, very, very soon, I'm also going to tell you we're going to consider steady state system. In fact, we have done some work to extend this to non-steady state, where we have multiple time points. But I'm not going to discuss that today. So I'm also going to drop the time dependence of the various terms here. And of course, all of these terms are time dependent, and I'm just going to treat the system as steady state. And just to remind you, we're looking at an adult tissue that's turning over. So the steady state assumption here is very well justified. OK. So now let's just define an inverse problem here. So what is it that we're actually trying to solve? Why are we writing this down? Our challenge is, so the problem is, to some extent, is to given, so I'm going to make a fairly restricted problem here, given C. Actually, you'll see later that we don't really need C. Given C and R plus and minus. So some rough estimate of where the cells are going and what the proliferation rates are in the system. And it turns out that this really doesn't have to be very accurate. So you can get very good predictions if you just are roughly in the right ballpark. Estimate, oh, and I should add one more thing, some idea of the stochasticity in the system. So that's essentially going to be a completely, that's a very poorly controlled parameter, this D. And we're going to just play around with it and see what gives us reasonable results. So estimate V of X. OK. So can we, essentially, from a static snapshot, reconstruct some statement about the dynamics? And of course, the reason that this would be interesting is that if we knew the dynamics locally, we could then integrate from the state of every cell to find out where it would end up by forward integration of the Langevin equation. So formally, I mean, of course, we're not going to do that literally, but that's conceptually why this would be an important problem to solve. Yeah, it's the same V. It's the same V. Yes, this is the velocity. Oh, D. So D, yes, this is D is the same D. Right, right. OK, so over here, again, I mean, what I did was to write down something very general and then restrict it. Over here, I've written D down as some very complex mobility, which can vary in space and time. And it's a rectangular matrix. And over here, I've said, well, let's just treat this as a constant scalar and see how far we can get, OK? Because you can always try to fit a problem more accurately, but just add some noise to the system and see what happens. OK, fine. So that's really the formulation of the problem. And now, we're going to try and see whether we can make some statements. OK, so we immediately, there's bad news. Does anybody want to tell us the bad news? Can somebody immediately see why can we not solve this ever? Why is this a poorly posed problem? Sorry? Yeah, there's no unique solution, right? So this is very simple to see, right? So if v is a solution, then v plus phi, where div phi equals 0, is also a, I'm sorry, I'm switching my vector notations here midway, which is probably bad. Is also a solution, right? In other words, if this was in three dimensions, you could say that the curl of any field, so the general, so the general, general, high dimensional generalization of a curl can be added, and that will also be a solution to the problem. So given a steady state density, there is an infinite family of dynamical processes, which will give rise to the same density. Another way of saying this is that I can add a rotation or an oscillation to my system, and I would never be able to detect it from a static picture. So all right, so at this point, you're probably thinking, well, what's the point of doing any theory? Because we're throwing one assumption after the other. So I'm going to throw one more assumption into this problem, which is to say, OK, let's look for a solution, which is the curl-free solution of the Dwarf, OK? So it's the solution which can be written down as the gradient of a potential. And we discussed how bad potentials were yesterday, and I think that what's nice over here is we can see exactly how bad they are. But because we can see that the existence of a potential is one of an infinite number of solutions, which would give rise to a given dynamics, OK? So a potential doesn't have to exist, and it probably doesn't exist in any formal sense, as this problem has been posed at least. So let's now just write down. So we're going to look for V equals minus grad F, OK? And now F is not a free energy, because we're not making any statement. This is not a closed thermodynamic system. It's just a formal statement about one of the special cases of solutions which will solve this equation, OK? All right. So OK. OK. I mean, there's a direct case. I mean, it's a traditional case. I have the physical analogy, of course, and I don't see that. Yeah, so I cannot see. Maybe I haven't been sort of imagined enough. I can't see why. What a potential, I mean, a potential somehow. You could argue that a potential encodes the rules of the gene regulatory network. But of course, if you write down a random gene regulatory network, it will not be described by a potential. So the fact is that I mean, the fact is this is a very restricted number of gene regulatory networks which would give rise to these dynamics. So now I think that the way to think about this is that we have a potential which we can try to use to make predictions. Maybe we shouldn't pay too much attention to the fact that it's somehow reflecting reality. It's just analytical. And maybe if this works exquisitely well again and again, we'll start to treat this potential as having some real meaning. But for now, I'd be very, very conservative. Yeah, yeah. OK? Yeah, but not to the whole consideration. So do you still have a frame of your C that it's not when it's necessary? No, only to the right over here. Yeah, OK, yeah, you're absolutely right. So sorry, I just rescaled my C. Very good. Yeah, thank you. That's a good point. Yeah, I mean, that's right. That's just, you can still define this term right here. OK, good. So all right, so let's now, this is starting to seem, those of us stared at Fokker-Planck, this doesn't seem terribly challenging. Oh, I have, I guess that, OK, just for fun. So for the non-mathematically oriented, let's show a pretty picture. And then, no, oh, I switched, I don't know, I switched off. OK, never mind. You guys are just going to have to look at me instead. OK, so fine. So now let's go, so I'm now going to just write down the simplest form that we're going to work with. So this is the problem that we're actually going to solve now, plugging it all in. And 0 equals D, I think there's a half there, D double squared C, plus grad C grad F, plus R plus minus R minus times C. OK, so we're now going to try to solve this. And now I just want to remind you, when we're not solving for C, we're solving for F, which is actually a very simple thing. Because actually, if you think about solving for C times F, it's just a first order equation. This should not be very, very complicated. So let me now just make a bit of sense of this. We're going to run into a problem in a minute, which is going to require a bit of a twist. Let's just get there. So first of all, it's useful to think of decomposing F into two parts, and then become clear in a minute that this might be useful to think about as a confinement potential, which fights diffusion. And this is a transport potential, which delivers cells from sources to sinks. And the reason, this is just a very, very simple solution, which is this is an inhomogeneous first order differential equation in F. So you can just think about it as the sum of the special solutions to the inhomogeneous terms. One of the terms is diffusion dependent. In other words, term is transport dependent. So it just naturally decomposes F into two parts. That's just a sort of a detail. So we can now write that essentially, 0 equals half del squared C plus grad C grad u. And 0 equals half grad C grad v plus r minus r minus C. And I'm not sure where this half came from. Let's get rid of that. So that's essentially the two equations. And now the first one is incredibly simple. So we can immediately write down that u is minus d log C. And I may be off by a factor of half here. Half d log C. Something like that. OK. That's easy. And now we run into trouble. So this is the point where we get stuck. So v depends on the sources and the sinks on C. And now I let you in on a secret, which you actually all know, is that we actually, oh sorry, so one for those who are very formally inclined, of course there is a general solution which depends on boundary conditions. And if we assume 0 flux boundary conditions and the special solution is a constant. So basically this is the complete solution, plus a constant. So now for the dirty secret. So the dirty secret is, of course, we don't know C. This is a very high dimensional space. And what we've actually done in our experiment is to sample a certain set of cells from C. We don't actually know this underlying density. And in fact, this problem is unsolvable in a direct sense for that reason partly. Because so in very high dimensions, estimating densities is almost impossible. It's known as the curse of dimensionality. As you increase the dimensions, the Euclidean distance between any two points essentially becomes equal. So it's essentially impossible to estimate densities accurately. The second issue is that because this is a real problem with real data, we are going to have to likely numerically approach this problem. And we're talking about a problem which exists in 20, 50, or 100 dimensions even after dimensionality reduction. And there's no computer in the world which would ever be able to solve partial differential equation on that number of dimensions even if we binarize the data. Even if every gene is either on or off 0 or 1, we would need 2 to the 50 dimensions in order to just keep track of the state of all of the cells. So that's not going to happen. So that's the bad news. So now we have a practical problem. So this is essentially unsolvable, simply. So now we really need to put this. This now reduces to a problem of inference. How are we going to find an efficient and fast way of working our way through this problem? So at this point, we have to think about graphs. And I'm going to talk about a theorem which comes from Michael Jordan and Berkeley from about five years ago. So it's a pretty new piece. It builds on a field which has been around for a while. But the results that I'm going to discuss are really come from this very new piece of work. So this is Ting, Huang, and Jordan archive. They didn't bother publishing it in 2011. OK, so we're going to think about a graph. Actually, we're going to start up with a point cloud with data points. So this we started writing down as formulas in yesterday. So we have a set of data points xi, which have been sampled from C. So we sample these points from a distribution. And we now create a graph. So we now have a graph which has these points. And it has an adjacency matrix. So this is the connectivity, which is we're going to make a normalized matrix. So it's going to be kernel of xi xj divided by sum over k kernel xi xk. OK, so this is just a normalization. And we're going to define an operator on the graph. So this is a matrix. And the set of nodes reflect. So they have a coordinate, which is x. But we can also just enumerate the nodes as the first node, the second node, the third node in no particular order. And we can think of this as a vector of nodes. And spectral graph theory asks about the properties of matrices as operators acting on this vector of nodes. So we're going to define an operator known as the random walk graph Laplacian. This is an operator, which is the identity matrix minus the normalized adjacency matrix. So this is random walk graph Laplacian. And actually, there's going to be minus signs coming. So this is often defined as minus 1 times this. So there's going to be lots of, yeah, OK. It doesn't matter. This can be defined in either direction. OK, so yes. Yeah, so right, so this is d down here. So I basically, so the unnormalized graph Laplacian would be d minus a. A normalizes 1 minus a over d. And I've just tucked d into the definition here. Right, so I apologize. This may not be entirely conventional use of the terms, but the underlying mathematical structure is the same. OK, OK. Yeah, sorry. To a particular point in gene expression space. Yeah, every node corresponds to a point in gene expression space. But notice that these operators, this L, is not acting on the gene expression space, but acting on the nodes of the graph. So this is a vector, which is a square, sorry, it's a square matrix, where if we have n points here, this is n by n. OK, so this just acts on a vector of the size of the nodes. Right, it doesn't have any clue where the nodes are sitting. It just, OK. It's a continuous space, but we've made certain observations in our experiment sampled from a density, which is a continuous density. OK, so we have a continuous space. We have a continuous density of states on that space. We sample a finite number of discrete points. And we then define a graph where the weight between any two points is defined by a kernel, which depends on the positions of those two points. And we can then define an operator based on this adjacency matrix. OK, yes, these are the two different sampled points. OK, so now to sort of cut to the chase, yeah. So at this point, this is a very, very general statement about how to construct a graph from a point cloud. And what it's saying is that you can, in principle, connect any two points in this formulation. You have to have a positive weighting, usually. OK, so k is usually a positive or a number of zero. And the precise value depends on a particular choice of kernel, which in principle could be anything. So just to give a specific example, in a k nearest neighbor graph, you will choose one kernel. OK, and I'll show you what that is in a minute. In a Euclidean, simple Euclidean graph, you could basically, or in a diffusion graph, you could put e to the minus the Euclidean distance squared. OK, so I can put whatever I want over here. And there's many, many ways of formulating the problem. OK, so this is just a very, very general statement about how to construct a graph from a point cloud. And this is a, it's not a, of course, there's other ways of doing it, but this is just a particular family of graphs. Sorry? We, let's put the biology aside for a minute. OK, this is just, there's no biology behind these statements. Michael Jordan's paper is not about biology. So I just want to get through a bit of formalism, and then we'll go back and apply this, OK? All right, so basically now the sort of the main result of the paper is of the theorem, which is really, really quite beautiful, is that for a family of kernels k, which I will define in a minute, in the limit of infinite sampling, so if I get enough cells, this random walk operator, random walk graph Laplacian, acting on any arbitrary vector, I'm going to use hat here to denote a vector not in gene expression space, but a vector over the nodes of the graph, OK? So f is a vector with n coordinates, irrespective of how many dimensions of gene expression we use. This will go to a continuum operator with a following form, half sigma squared del squared plus mu dot div, OK? So this is essentially making a link between an operator acting on a graph, on a vector, on a vector which is defined only at the points of the graph, and a continuous function which is being acted on by a continuum differential operator, OK? And now the family of kernels that we're going to consider here, in fact it's a bit simpler than the family in the actual original theorem. This is just to sort of keep things simple, is the following. So k of xxj is equal to w xi xj xi minus xj divided by r of xi xj. So it's a pretty general form where h of, say, x equals 1 for x smaller than or equal to 1, or 0 for x greater than 1. OK, so it's a sort of a top hat, all right? So h is just a top hat function. So I'm sorry, and I should, yeah, and it's strictly positive, so that's OK. OK, good. So this is essentially, for this family, there is this convergence, and there's a recipe, which is what's beautiful about this theorem, which just tells you, if you know the kernel, exactly what continuum operator you would be generating. Yes? That's r. So r is, OK, yeah, so what is the interpretation? So w, OK, that's good. Actually, I should just explain. So this is just a form. But the way to interpret this is that w is the weighting on the edges, is the weights. So it's the value, it's essentially the weight on aij. OK, it's the value. And r can be thought of, it's called in the term that's used in the theorem, it's a bandwidth. But essentially, it's a cutoff. So essentially, what this is saying is, if you're closer than this rxixj, then you get included, and otherwise you're 0. And if you are included, you may have a weight, which is not 1. OK, this is the statement. And you could imagine that, for example, you could set r to infinity everywhere, in which case you just have a fully connected graph with some weighting wxixj between points. Or you could have a graph with the weight being 1 everywhere, and you're just either connected or not connected, depending on some threshold. So you can create a very large family of different graphs by playing around with w and r. OK, it's a very, very flexible framework for making different formulations. And each one of them will give you a different operator. So this is why it's such an exciting theorem. Because if I now want to approximate a continuum operator, I can do that based on sample data. I can do that and just choose the graph, which will give me the right operator. And we're going to use this theorem twice, once in order to solve for the potential v, and once in order to generate the Land-Juven process, which allows us to predict the fates of cells. So we're going to use this theorem twice over, OK, once acting on v, and once acting on c, all right? OK, OK. So this is really the main formalism. OK, so now I'm going to give you what sigma and mu are, and then I am going to, wait, I may have, sorry, just a small correction here. Yeah, no, this is all correct. OK, so I'm then going to just give you, we'll just take a really, really simple example of a lattice, which everybody will have a good intuition for, and we'll just quickly see how that works. Yeah, yes. So the, yeah, so. If we go horribly. So actually, yeah. So I. Can I ask you a question? Yeah. In order to work, does the tan have to do with. OK. Going to infinity, we can take it in a 2-plus way. So, OK, so I'll answer this in two parts. So the first part is entirely caught on to the fact that there's a very simple statement being made over here. We're approximating a continuum by a point cut. The key point, the big difference between the 2 to the 50 in this point is that the reality is the very, very little of gene expression space is being used. And what this allows us to do is to focus on the dynamics only where cells actually ever go, and not anywhere else. So the, yeah, OK, so now in the theorem, they have a section discussing convergence, and they admit that they haven't looked at the problem. So we don't know the answer. What we've done is numerical simulations of Langevin processes and shown how low we can sample and still get a very accurate description, even in very high dimension. And the answer is that, with as little as 100 cells on a single-fate choice in 50 or 100 dimensions, you can very effectively describe the predictor, the future behavior of a simulation by taking a static snapshot of cells in that simulation. So the answer is that you don't need many cells. What you do need is enough cells to sample the topology of the structure. And once you've done that, you get a decent approximation of what's going on, OK? Yeah, so the formal answer is not well. I think intuitively it's sort of clear. You need that the least dense region which matters in your process to be adequately represented and in a very hand-wavy sense, that gives you a sense that if you can see the structure, it'll probably work. And if you can't see the structure, it'll probably fail. That's essentially the suggestion, OK? OK, so now, yeah, so F is a vector defined on the space of the graph. But of course, every point on the graph has a location in space. So as I make the graph infinite, I'm essentially representing points in space. And you can imagine this being the label on my graph, OK? So X, now X becomes, you're taking the continuum limit of a point cloud. So this would essentially, that's the relationship between the two, OK? Yes? Well, so L is the identity minus a. Why do we need the identity? Maybe I'll tell you what, we'll do a one-dimensional lattice and you'll immediately see why this is a structure, OK? Just to prove by example, OK? But I just need to write down a couple more points because I haven't given you the full answer. So the full answer would be to tell you what sigma and mu are in terms of the kernel. So here is the full answer. So I guess I'm going to go back here. And the full answer is that sigma squared is going to equal RXX, all right? So sigma squared is a spatially varying function, OK? And mu, sorry, squared, OK? And mu is a vector and it's going to equal a bit of R squared XX, I'm just going to drop the vector notation here for speed, grad C over C plus grad W. So this is now, let's be precise here, this has two arguments, so we're taking the derivative of the second argument divided by W, X, and Y evaluated at Y equals X plus. So I guess I need to introduce one more notation here, which is D, which is the number of dimensions. And this is going to be D plus 2 grad over R. This is evaluated at Y equals X, OK? That's a bit of a mouthful, but it's just a formula. We can now go out there, try a bunch of graphs, and see what continuum operators they give us. And if we can find a continuum operator, which is the one that we need to solve our problem for the potential, we're done, OK? That's essentially the idea. This is how we use this theorem. We don't have to be very creative now, we can just take this and use it. OK, so before we do that, let's just quickly take the case of just a lattice just to get a really quick, yeah? It could be a very big thing. It could work, but then it could never represent the way we see it. So the argument is that the external dimensionality, the extrinsic dimensionality, could be very large. Even after principal components reduction, we still may have even anything more than 10 dimensions is intractable. But within there, we have some very, very confined density clouds. Those define railroad tracks, OK? All of the dynamics is occurring along those railroad tracks. And what we're now doing is we're confining our solution to focusing only on where the cells actually exist. And that will be a very, very small subset of the large space. So some of these dimensions are created because as cells differentiate, they just take a turning in gene expression. One gene switches on you. Each one of those is a new dimension, but it's a very, very simple curve. And this formalism just very efficiently localizes you to where the cells are actually going and measures the dynamics along those curves, OK? So OK, so now I'm just going to give an example. So I guess that you mean you trivially don't connect anything. Yeah, I'm not sure I've thought about the trivially. Because it could be that there's some, I think that there might be some continuity conditions. So if you've broken up your graph, then it may be that there's a problem. I'm not sure I've thought about that enough, OK? So I think that formally this is taken in the n goes to infinity limit. And the assumption is that you make a continuum description of your structure. So k equals 0 is a problem because you're taking a limit faster than you're going to infinity here. And that destroys things, right? So if you take k goes to 0, but slowly, then you're probably fine. But then, you know, that's not what, yeah, OK. So let's think about a much simpler process. Let's think about a one-dimensional lattice, OK, just to make it really, really, really simple. So that will give a really simple intuition for what's going on. So 1 d lattice. So for a 1 d lattice, we're going to set the weights to be 1 everywhere, OK, because we're just connecting every point to its neighbors. So there's unweighted edges. We're going to have a threshold, which is constant. It doesn't really matter what it is, constant. And our point cloud, we also know, is just perfectly periodic, because we've just defined it that way. So imagine that I give you a set of points, c, which I've sampled from, which is a perfectly periodic lattice. And I generate a kernel density, which is a very artificial condition. So let's now imagine, first of all, what is a that I get? So 1 d lattice. So a. So let's think. So we're going to connect two edges with a weight of 1 if the two points are within a distance r from each other. And otherwise, we're not going to connect them. And then we're going to row normalize, OK? So what we're going to end up with is something that roughly looks, and we're looking at a long structure, lots of zeros of the sides. And then we're going to have a connection just off the diagonals the whole way along, and zeros along the diagonals. And that's normalized, so it's going to be half and half half the whole way along, OK? That's our matrix, OK? And our random walk, our random walk graph will pass. And if I actually just take the minus of that, it's going to be equal to minus ones along the diagonals and plus halves everywhere else. And hopefully, you can immediately recognize this as a discrete approximation to a diffusion operator, OK? So what this looks like is it just looks like del squared, right? And now, if we go to our recipe, this is again super trivial, sigma squared is a constant because we've set r to be a constant. And here we have it. And mu, well, grad c is 0 because we've just defined this as a uniform distribution, a regular point So grad c is 0, grad w is 0, and grad r is 0. So we basically have no transport term, and we just get a simple diffusion, OK? So this was very, very, very, very simple. It's just to sort of see how this works in an incredibly simple case, OK? So that basically means that indeed, we get a Laplacian out of it, OK? So for this case, I can also write down sigma squared equals constant, and mu equals 0. And I'm probably going a bit too low at this point, so OK. OK, so now we can apply it to our problem. And I guess that I'm now just going to tell you the answer, which is which graph you have to choose to get the operator that we want. So let's see, what do I want to keep? I think I want to keep that over there. Oh, I also want to keep, oh, I want to keep everything, I'm greedy. I'm going to delete the kernel. And OK, so what we need is a very simple, mutual, unweighted, k nearest neighbor graph, OK? It's what we actually use for visualization. It just so happens that it's actually an efficient way of encoding the dynamics of the process. And let's now see why. So first of all, it's unweighted. So that means that w of x, y equals 1, OK? By the way, I've been sloppy in one respect. There's length scales in this problem. And I'm just ignoring length scales for now, OK? Just to have one less thing to keep track of. But those are constants. So let's not worry about those. OK, and then the next thing is we need the bandwidth. And we discussed this yesterday, actually. The bandwidth for a mutual, so this is an OR graph. In other words, I draw an edge if you're my k nearest neighbor or if I'm your k nearest neighbor. So the way to think about this bandwidth is that this is, I'm going to write this. This is sort of going to be proportional to in some sense. So maybe I'll just make this proportional to. The local length scale in the system, which is going to be the density of cells to 1 minus 1. So it's 1 over the length scale of the density, OK? So c to the 1 over d gives me a length scale, which is the typical separation between cells. And then I just need to take 1 over that to get unit. So this is evaluated at x or evaluated at y, OK? So that's essentially the form of the bandwidth for a k nearest neighbor graph. And now there's a slight mathematical hiccup, which is that we need to take the gradient of this at the point where y equals x. And a max function is not analytical at the point where the two arguments are equal. We are going to, that's the time that I've spent a whole hour on this, so we're going to speed up after this bit. So we're going to use an analytical continuity here, which there's several ways of doing it. I think that I'm just going to skip to the end is that the gradient of r, so I guess what I'm going to do is I'm going to use the idea that d of max a comma b by d a at a equals b equals half, OK? So if a is greater than b, it's 1. If a is smaller than b, it's 0. And you can show that this is an analytical continuity of that discontinuous process, OK? So this is just a small hiccup. It has absolutely no practical terms in terms of, it's important that it's half, but the idea of making an analytical continuity is not controversial. So now we can essentially, we have r, we have w, we have a set of formula, we're running out of time. So I'm going to write down what the final operator is going to look like. And I guess that at this point, it's going to sacrifice this line. And I'm going to write that the operator is going to be c to the minus 2 over d of del squared plus c to the 1 over c to the 1 minus 2 over d grad c to the 1 minus 2 over d, OK? And I can simplify this slightly, which is equal to c to the minus 2 over d grad c to the 1 minus 2 over d grad, where the operator acts forward, OK? So that's it. And now we can see that this isn't quite what we want. And I've forgotten a factor of half. So there's a file. OK, that's a small bit. So this isn't quite what we want. But now we can just make one final statement, which is that if we're in sufficiently high dimension, so in other words, if we now take the limit that both n and d are large, then we get that the random walk operator, so this is acting on f. This is acting on f. And this now will go to half c to, so in fact, this now is just essentially, oh, wait a minute. There we go, 1 over c to the 1. So this now goes to 1 over c grad c grad f. And this operator is the operator that we were trying to approximate. So in other words, what we've now done is we have essentially shown that if we create a k nearest neighbor graph on our data set, then the random walk graph Laplacian on our data in sufficiently high dimensions should describe the continuum operator that we need in order to solve for the potential. So now we've essentially reduced the problem to the following. We can now just solve a simple matrix equation, which is that the random walk graph Laplacian acting on a vector of v potential is equal to r plus minus r minus, which are the inputs that we give as our guesses for the sources and sinks in our system. We can now just invert this and get a solution. So this is a very, very fast, very efficient solution to a partial differential equation in high dimension by a single matrix inversion. And it only gives us the answer on the places where we have observed cells. So this essentially is, in some sense, the best that we could get out of our data. Because if we don't observe cells there, then it would be hard for us to estimate properties of the system. And I think there's a minus missing. So that's essentially the. OK, so that's good. So now we have a potential. And now, because I'm going to just very, very quick, I'm just going to make a statement, which we can now challenge ourselves for a second problem. I mentioned this earlier, which is now can we use the same formalism to come up with a graph which would propagate density of cells from a starting point? So essentially, solve the Land-Jewan equation or solve the Focker-Planck equation for a particular cell localized at a particular point in time. And if we have that, we can then reconstruct the dynamics of any single cell. And from there, infer fake probabilities, infer mean first passage time to sources and to sinks. We can get a sense for pseudo-time, in some sense the measurement of how much dynamics has occurred. And so I'll just very, very quickly say, so we're going to generate a Markov process where the probability of transition from an observation i to an observation j is going to be equal to e to the vi minus vj over d for ij in k nearest neighbors and 0 otherwise. And this is going to be row normalized. So there's going to be a normalization constant, which we can call zi here. And zi is just normalizing. So the probability of a transition is 1. So what this is intuitively saying, notice the diffusion has come back into the system now. And what this is saying is that if I now start off a cell at a particular point and I run a Markov process probabilistically for it in time, I can have it explore various states where its likelihood of traveling down a potential is dependent on the gradient of the potential. And the flatness of the diffusion sets how shallow or how deep the landscape is. So a high noise environment will allow me more often than not to go against the potential gradient. And in a very low diffusive environment, d is small, I will always travel down the potential gradient. So this is essentially a solution. And if you plug this in, so you can define a w and r associated with this, the r is identical because we're setting anything that's not on the k nearest neighbor graph to zero. But we're now setting a non-zero w. So the weights are now have a particular form. And if you plug this in, you will find that you recover the operator that acts on c. OK? So I'm going to, OK. So that's it for the theory. And now, yes. Yeah, so this is what gives, this is formally what's right. The question that you asked, actually, even if you're running a simulation, there's two forms of running a simulation, one in which you can't, but one in one in which you don't. And I forget one of them is called a metropolis algorithm. And the other one has a different name. And I just can't. But basically, even in simulations with thermodynamic simulations, there are situations where you would choose to do it this way. But as it happens, we're not thinking very hard. This is basically just the solution. I mean, the fact that it relates to thermodynamics is probably obvious to some. And to me, I didn't think about it. OK, so I now, do you know how I switch this on, predict it on? And of course, we don't really need to simulate this, because this is just a Markov chain, which is a matrix. So we can look for its zero modes. And those are the absorbing states of the system, which correspond to the source, things, and so on. So basically, we can very, very efficiently make calculations based on this structure. So OK. So now, this is basically just simulations. So here, we have a two-dimensional potential. Sorry, it's a 50-dimensional potential, where I'm showing two dimensions which have a hop bifurcation. And we now randomly seed cells at the source. We let them proceed. And at a random point in their lifetime, we sample them. And we then generate a static point cloud. We then introduce some knowledge of the inputs. And I'm only showing you here the ideal knowledge, but we've shown that if you get these wrong by a factor of five, you still basically generate very good predictions. You just roughly need to know where the inputs and outputs are. And you then can predict the fake probability of every cell. So if you then continue the simulation and find out where a cell ends up in sync one or sync two, and you correlate that to what you would have predicted based on the static snapshot reconstruction, and then applying essentially the forward Markov chain, you get a very good correlation. And then if you look at the mean first passage time back from a cell to the source, so you can essentially work out a time since the source, you can then get a very good prediction there. So this, as we said, is intrinsically high-dimensional because you really need high dimensions to get rid of this D. Now, if you were doing this in low dimensions, that would be totally fine. You would choose a different graph construction. So there are graphs which are not dimensionality dependent, but they don't work. They're very inefficient in high dimensions. So if you were, say, just making a fixed neighborhood, that would give you a dimensionality independent result, but it would be very, very bad for sparse high-dimensional data. So you can play around with this theorem and use it for low-dimensional problems as well. Yes? Right. Yeah, so in this source and sync model, they're not. Now, if you add in additional sources and syncs, then that's equivalent to, so the dynamics of a cell, between being, after it's being born, the Langevin process tells you that you're memoryless. It just depends on the state of the cell. And the sync corresponds to a finite probability of going into an absorbing state firm with a rate proportional to r at any particular point. So essentially, if you have arbitrary sources and syncs, you can make these predictions. I mean, that doesn't change your ability to make predictions. Yeah. Yeah, so OK, fine. So then the other point is, if you now play with this, so we've done this in 50 dimensions here, if you increase it to 100, it stays fine. If you go down to 10, it's still OK. Below 10 dimensions, the predictions start to get a bit bad, and that's because of these approximations that we're making. And of course, we get a parametrization of the space here now in terms of the probabilities and in terms of time, but we're not imposing any tree-like structure. There's a lot of attempts to describe differentiation process, and you start off assuming a tree. We don't need to do that here. OK, yes. So what we need to do is we need to know enough biology to decide that these are endpoints. That we need to know. Because for example, it could be that in another case, this is actually the source, and these are the two syncs. So we need to know enough biology to roughly describe where cells start from and where they go. And in real data, we don't set one source. We actually assume that the whole structure is proliferating, and we can make some estimates about the proliferative rate in different places. So the source is an ambient source. And then we have syncs, which are the endpoints, which are known from the biology. And now what we can do is we can link the inside of the structure where the biology is maybe unknown to probabilities of ending up in the known ends, which are well-defined cell types which everybody recognizes. Yes, Bert. I guess I'll show you an example where it is not a tree. Just from the data. OK? OK, good. So here I showed you this picture before. This picture, which is of course uninterpretable because it's just colors, is actually formal in the sense that we can assign a seven-dimensional vector to every cell, which is the fate probabilities. And then by using color mixture, we can sort of give the cells the probability. So there's really no surprises here at a high level, which is that the cells on this side are committed already to becoming erythrocytes. The cells over here are committed to becoming granulocytes and so on. The tips are completely deterministic. And over here, this picture is uninterpretable, but this big mixture corresponds to cells which have not yet committed. And there's going to be some structure to that population. So now let's go on just before we try to understand that structure, because maybe this is all nonsense. Let's go and try to put this to test. So I showed you this picture before. We can now go back to experiments where cells have been sorted out. And their fates have been observed. So we did not perform these observations. These are taken from papers. And then we can map these cells based on sorting out cells onto our structure and see where they sit. So these cells, which are so-called a single uniform population, you can actually see don't correspond to a very pure population at all. And many people now looking at single-sahamnipartic data are coming to the same conclusions that these fax gates that people have been using to sort out populations really correspond to distributions over gene expression space. And when we now look at our predictions versus what we see, you can see that there is a reasonable correspondence. So over here, I apologize. This is really hard to see. Each one of these is a different experiment taken from the literature where we have a known fax gate and we have a known cell, a fake probability. And we can now take all of that and put it together on a plot where we put the experimentally observed versus predicted bias. And there is one sample, which corresponds to two fates over here, which we get completely wrong. And in the last two years, that one sample has been shown that the fax gates, actually, this is not every producible original experiment. So I'm sort of including it here. It was nice. We found this and we sort of got confused. And then we found one paper. And then as soon as we found this, another paper got published, both showing that this doesn't exist. So I just show you here, those cells which don't work are really committed to erythroid lineage on this side here. And the proposal in the first paper to report these cells is that they have a megacaryocyte potential. So we really disagreed with the original observation. And it seems like we're on the right side of history there. OK, so let me see. I'm going to have just a few more minutes. Yeah, so now let's see what kind of predictions we can make. So I'm going to sort of really race through this. So first of all, we can cut out one lineage. And this is actually the motivation for this experiment came from somebody studying erythropoiesis. And we can now walk from the multi-potent progenitor cell state to the end using the probability of fake commitment to help guide us, although you could do this by eye to some degree, and then ask about where does fake commitment occur. And that's colored here in red. So we would predict that fake commitment occurs at this bottleneck region. And then we can start to look. And so just for a minute here, I'm not going to do any. Oh, yeah, so we can now define fax markers for these populations. And this is a bit of a data dump. So I guess that in two minutes we're not going to cover this. The key point is that we have identified from our data a set of fax antibodies that we should be looking at in order to try and isolate out these early erythroid progenitor cell states. And we can then sort out those cells and very quickly validate by QPCR that we're indeed localizing. We're picking up cells which have the same transcriptional profile that we would expect. So that doesn't tell us what they do. It just tells us that our fax is working reasonably well. And now what we can do is we can take these cells and culture them. And again, this is a very dense slide, and ask what their fates are. So this P1, which we would predict, is entirely committed towards erythroid fates. When we look at these cells at day zero, they're already starting to express markers. So just prior to terminal differentiation, and within two days they're already expressing markers of terminal differentiation, this P2 population, which we say should be on the way to commitment. So none of the cells express markers of differentiation early. Within two days, some of the cells have gone through the process. Some of them are still showing early markers. By five days, all of the cells that we can observe look like they're terminal differentiated. But by seven days, we see some of the cells have gone back, and this orange and green correspond to markers of megacarasite and basophil fate. So a small number of these cells is not committed and will proceed, and so on and so forth. So we can test whether our predictions for fate commitments are correct. The other thing we can do is now walk along this process and find genes, which vary, and now we can pick out receptors which are localized to this region, which is this new region of hem erythropoiesis which we're studying, and ask, this is a wind receptor, this is a receptor for cytokine. Here's another cytokine receptor, which is actually not only localized, but it's certainly expressed in the region of interest as well. And we can then go and add these, the ligands and ask whether we perturb fate. I'm not gonna spend time, these are very dense slides, and the key point is that these ligands actually have an effect to all of them. We can also then ask about cell cycle signatures. I think I'm going to skip this bit as well just for lack of time. So here's really the next thing. So that was all looking at one lineage and finding that we can make a lot of new statements about biology. We're not going very deep. It was very superficial. We picked out a number of pathways and perturbed them, and maybe we looked at some fate assays only in vitro. There's a lot more to be done. But it looks like this structure, these fate probabilities might actually encode some real information. So now let's go back and ask, since these fate probabilities seem to explain the data that's been previously collected in the last 20 years, except for this one outlier. And since we seem to be able to make novel predictions about fate commitment, what does this structure tell us about differentiation? And we can now answer that by looking at the vectors of gene expression and asking for fates which are more likely than random, highly probable in the same cell. So for example, it's no big surprise that basophils and erythroid cells are enriched because there's a population here which is opposed to become one or the other. Here there's a bit of a surprise, and this is the loop that I was suggesting exists, where we see that monocytes and dendritic cells are coupled. Monocytes and granulocytes are coupled, but dendritic cells and granulocytes are not coupled. And formally the way to think about this is if you now hierarchically build this up, you can build a picture of the tree. This tree actually agrees very, very well with what people think, that you can reach a monocyte in two different ways where your fate increases towards a monocyte in one way or the other. And this can be understood very, very simply in terms of just the combinatorics of two transcription factors. So this type of convergence is not too surprising. Okay, so we can now reconcile this view with the types of tree-like structures that people have generated so far, but we don't have to use the tree, we can use the full structure of the data. And in fact, this tree is almost certainly an approximation which is just based on which fates happen to be more correlated. Okay, but now we can use this and I go back and be by informaticians and we can use this fate structure in order to now find receptors and transcription factors and we could extend this list which are strongly correlated with fate choices of different branch points. And many of these are known regulators but there's also a handful over here of new ones which we're now in the process of validating. So, or testing I should say, validating is very bit presumptuous. So that's hopefully that gives you a sense. We did some very, very formal work. That was fun. That formal work allowed us to efficiently encode fate probabilities or fate probabilities look like we can really make predictions about the commitment of cells at least in vitro so far. And it also provides us with an efficient parameterization of the data structure which allows us to pull out which genes might be correlated to different fate choices. So I think I'm gonna, this is me with five minutes to go. So, okay, no, this is just a very random summary. So I think I'm gonna stop there. And yeah, I'll take questions. Thank you very much.