 Okay, so let's all start. So we're going to talk now about a single cell RNA sequencing, and the next few slides are going to be mostly experimental, and then a bit about technology, and then get back to thinking about data analysis. So a starting point here is maybe just to rewind a few years. This is a fairly young field, is that the first of all sort of start off with the bad news. Wait, am I allowed to start? No, that's great. That's good. Okay. All right. So the single cell RNA sequencing can be noisy, expensive, and in the early days, where early days is a few years ago for a young field, it worked with very few cells. And this plot over here is actually a very good way of taking a first look at single cell data. What we're, every point on this plot is a gene, and we are looking at the coefficient of variation squared of a gene, which I assume you are going to be familiar with. So this is the variation in a gene divided by the expectation of a gene squared, and plotting that versus the average, okay, so versus the expectation. And this is a log-log plot, and what you can see is intuitively something that makes sense, which is that genes that are expressed at lower levels are also more noisy. That's what you would expect from Poisson statistics, and the genes that are expressed at higher levels are less noisy. But what's somewhat depressing about this is that it looks like it's a very broad distribution. And if you are now, say, looking for genes which are much more variable than you'd expect by chance based on Poisson statistics, you're trying to shave off the top of a distribution here, and it's really questionable how well you can do that. These blue points over here show spikens, so these are synthetic mRNA which have been added to the sample, and you can now ask how variable they are, and you can see that they're not very tightly localized around the line. And certainly this line is not what you'd expect from a Poisson process. So it looks like using this now to look for population structure might be challenging. And the problem with few cells is also a serious issue. So here's, this is from actually a very nice paper which did some lineage tracing of the type that we discussed earlier, but also included some single cell data, looking at the airway of the pavilion, so it's looking at cells from the trachea, and they've isolated over here 17 basal cells from the trachea, and now if we're trying to say something about population structure, you get a heat map, and you know, what does this mean, right? I mean, you don't get a very clear view of what's going on. You can probably say that some genes are formally more variable, but it's a noisy measurement, and there's very few cells, and it sort of raises the question about whether there's any big picture here at all. So now this is getting to be a bit more research seminar, so I'm going to talk a bit more about my own work. So I started my lab two years ago, and in my postdoc, I was interested in doing some single cell analysis, and I was thinking at the time that it would be useful to do this on a much larger scale than a few cells at a time. So I collaborated with a soft matter physics group, this is David Weitz and two postdocs in his lab, to adapt the technology which Dave had developed in his lab known as droplet microfluidics for use for single cell analysis. So this is a really cool technology, and the idea in this technology is that we have a microfluidic device which has an aqueous inlet and an oil inlet, and the oil is mixed with surfactant, which stabilizes the oil aqueous interface. And there is a nozzle through which these two phases are forced, and because water and oil don't mix, you get a hydrodynamic instability forming at the nozzle, which leads to a very mono-dispersed formation of droplets. So this is now slowed down about 400 fold, you wouldn't be able to see this by eye, and you can see these beautifully mono-dispersed droplets forming. And these droplets remain stable, here they're being collected, and you can heat this up and do a polymerase chain reaction in this. And the devices are very cheap and simple to produce. So the challenge was, if we can now take an entire cell suspension and capture it in droplets, maybe we can use this such that every single droplet acts as a separate reaction vessel, and we can then run a separate barcoding experiment where we can label the contents of every cell separately. So this was a project that I worked on for a few years, transitioning from theoretical work to experimental work in my postdoc. And there were actually two papers that were published with similar efforts, and they have slightly different versions, and I'm just going to talk about one. And there's now actually commercially available versions of this. And actually, this is, I think, a point where I also have to give a disclaimer. I'm a founder of one of these companies, so now you know not to trust anything I say. So I'm not going to talk actually really about the commercialization at all. So here's the general principle. We have droplets, we've just shown you droplets, those are easy to make. We're now going to take cells and we're going to take a little bead, which is made of polyacrylamide to it's squishy. And every bead carries a payload of about a billion primers for reverse transcription reaction, and each one of these is a single strand of DNA, which has ends with a poly-T tail, which can be used to capture RNA. And then a unique barcode, and these colors are sort of representing the idea that there's different barcodes on each one of these gels. And now we can co-encapsulate those into droplets, and then lies the cells, run a reverse transcription reaction, and at the end of this reaction, the material from every cell gets a different barcode. And then we can sequence the material and work backwards and figure out which RNA came from which cell. So the main point over here is that we need a very large number of different barcodes, and we use a combinatorial synthesis method. So we first of all make these blanks, and then we add on a fragment of the barcode, and then we add on another fragment, and so on. We actually do two steps now, but it can be extended. And this is actually the device. So we now have this nozzle, and here we have all of the reagents. So now we're making water oil droplets just like before, but we now have three aqueous inlets coming in, not just one. And in one of these, we have these closely packed squishy gels. And then we have these cells coming in with a plus on process, essentially very rarely does a cell come through. So about once, one every, say, ten droplets a cell will arrive. And then we have some goodies that have to come in the Lysus reagent and so on. This is just a cartoon, just to make it a bit easier. So these pale blue spheres are the squishy gels, and these red dots are the cells, and we can now mix it all together and capture. And the nice thing about these gels being squishy is that they can be tightly packed, and therefore we can pretty much get one gel into every droplet, so we can synchronize the arrival of the gels with the formation of the droplets and get a very high capture efficiency. So it's a single plus on process. Most of the droplets are empty, but occasionally empty of cells, but occasionally a cell comes through, and pretty much all of the, about 80 or 90% of the droplets will have a gel in them. So here's a movie I'm gonna play, this is about 100 fold slower. So we've got the three phases, and these big orbs are the gels. And now if I play that, this has been slowed down. You can see here a gel is coming through, and there's some oil in between two droplets. And at some point a cell will float by. Let's see if I can, there comes a cell right now. So there's a tiny little punk to there, that's a cell. And here comes another one, that tiny thing here. So these have actually been loaded in slightly higher density to make a movie, because otherwise we'd be sitting here and watch 50 droplets pass before we'd see much happening. But you get the general idea, and now we can collect these. So this is just a collection outlet. And you can see that occasionally you've got a droplet which doesn't have a gel, but most of them have gels. And then occasionally a cell comes through, there's a cell right there, that little dot, and so on. And then this is collected, this is again about 100 times faster. Just collected, we got about four cells a second, something like that. And we collect them into a test tube. And then we put this on a heat block, and we run a reverse transcription reaction. And then we can break the droplets and everything is now being bar-coded, okay? That's a general idea. So that now allows us, right now at about 15,000 cells an hour, about 80% of cells get bar-coded. And we can work with reasonably small samples, which means that if we're, say, working with a rare cell population, or with a single embryo, we can put that into a system. Okay, so that's the technology now. And there really aren't many other single cell technologies. So what I'm gonna show you now is not uniquely dependent on that technology. But now you have some idea of how you can generate this data. And this is now what happens next. We're gonna really focus on the, so we have our reverse transcription. We prepare a library. We submit it to a sequencing facility. And then we get back this huge data files full of sequence fragments. And then there's a really, really dull and fairly automated task, which is you have to take those fragments and you have to align them to a reference. And then at the end of that, you've got a giant table where for every row, every column here is one barcode. And every row is a number of times that we found a particular gene with that barcode. So this is our gene expression table. And the barcodes are random, so we don't really know which cells. We can see cell profiles, but we don't know anything about those cells, except that they were in our system. Okay, so now everything is really gonna be about, I'm not gonna talk about this low level stuff, this is very, very boring. It's really gonna be about this, what we can say about this table. So just a few things that, again, just to remind us. This data is very, very sparse. So if we were to spike in a molecule with known concentration. So we know the absolute number of molecules coming in. And we count the number of absolute molecules coming out. There's a very low conversion rate. In this case, this is about 7% conversion. And often it's less than that. Okay, so we're not looking very deeply in a cell. Now, this is pretty much a pure Poisson statistic. So if we had say an average n molecules of a particular gene expressed, then the number of times that we wouldn't detect it would be e to the minus n for a Poisson process. And therefore the probability of detecting a gene is just trivially just one minus e to the minus n. So what this means, actually, sorry, this would be if we could detect every single molecule. And now if we're actually only detecting a fraction beta of molecules, then we are really looking at very low detection efficiencies because this number is now gonna be very, very small. So to fast approximation for very low counts, this might be of the order of beta n, right? So now what that means is that if we had a gene which was, say, expressed at 100 copies, we can detect it. That's up here. We can detect it. The actual average number that we detect is only seven. And that means that we would have about only seven out of, sorry. So that's seven copies. So e to the minus seven is a small number. So for 100 copies, we'd be fine. If we had only 10 copies, we'd have less than one copy on average. And at that point, we would really miss out on single genes. So at a single gene level, these methods are very, very noisy. There's another issue, which is that we may not have precisely the same detection efficiency. So this beta is essentially a binomial sampling. And we have to assume that we're randomly sampling, but beta itself can be noisy between different droplets. It's actually not very noisy. But there's probably about a 20% variation in the efficiency between droplets. So if we assume that we can correct for amplification errors and there's a technical reason why, we can now work through some very, very simple statistics and ask, how does the coefficient of variation that we actually observe relate to the coefficient of variation of our true biological sample? And it turns out that this super Poissonian component, so the difference between the coefficient of variation and one over the mean is linearly related to the biological super Poissonian noise. But it is also affected by the measurement error itself. So these types of simple, these are very, very simple calculations. These types of calculations are helpful for understanding how different experiments done on different days could affect your readout for variation. And maybe a sort of a more telling way of looking at this is thinking about the Fano factor. So the Fano factor, you may know, is just these are just different flavors of the same moments. So it's just the variance divided by the mean and for a Poisson process. So we know that this way. So the coefficient of variation should be roughly, the coefficient of variation squared should go roughly as one over the mean. But the Fano factor should just be constant, it should just be one. So the nice thing about thinking about a Fano factor is that if you're interested in knowing whether a gene is more variable than you'd expect by chance, you're just looking for Fano factors greater than one. But of course, with measurement noise, you have to now be very careful because the efficiency of your measurement and the noise in your measurement can affect what you call a variable gene. So the way to read what's going on here is really actually also intuitively simple, which is that if I were to say plot the mean versus the Fano factor of a gene, and I had a pure Poisson process, I would get a line of one. And if I now add measurement noise even to a pure Poisson process, I get divergence. And if I now have a gene which is more variable, then I would get from a simple Poisson process, it will sit above the line. And as my efficiency drops, as beta drops, I'm gradually sinking down onto this line. So of course, the trivial observation is that in the limit that I can't detect anything, everything looks like a Poisson process because I just detect it once in one cell. And that just looks like a very inefficient Poisson process. So this is also very important. Different experiments done on different days can give different levels of super Poissonian noise, so okay. So that's what we're looking at here. We're looking at the effect. So that's one gene at a time and asking whether we could pull out a gene and ask whether it's variable. What about if we take pairs of genes? So here also, this is an important thing that will affect us later, is that if I look at the correlation between two genes, then this is approximate in the limit where there's low sampling efficiency. The correlation is going to be proportional to the biological correlation. Now there's an error there, there should be a pre-factor beta. So the correlation is approximately down sampled by the efficiency of the method. And it's rescued if the two genes being correlated have very high phano factors or rather it's not dampened. But if they have very low phano factors, then you see even less of a correlation. So intuitively, obviously if you have two very variable genes, then you should be able to pick out whether they're correlated or not. So this also means that whenever you look at single cell data, you should expect to see very weak correlations in the data, which are not because the biology is weakly correlated, but because the measurement method gives you weak correlations. Those correlations are still incredibly statistically significant. They just look very weak. So here's now an example. If we just take spike-in genes, so these are synthetic genes, and we plot their CV versus the mean, we can now see that they do sit on this curve. The CV squared is one over the mean. And there's some tail where we look at the method noise, which is fairly low. Okay, so these are just sort of controls to convince ourselves that our data is good. And now if we look at a real population of cells, we'll have genes. This is looking one gene at a time. We'll have genes which are more variable than one over the mean would predict for, or in this case, one over the square root, because we're looking at the CV, not the CV squared. So we have many genes which are sitting above this line. So now this is now very similar to the first plot I showed you, but now done with many more cells. And because the droplets are less variable than wells, we get higher uniformity, so lower measurement noise. So we can pick out these variations. Okay, so this is one gene at a time. Of course, the real point is not necessarily to just pick one gene and follow it up, but to think about the structure of the data. So how should we visualize this data? And now just for a quick vote of hands. So how many of you are familiar with principal component analysis? I would assume probably many of you are. How many of you are not familiar with principal component analysis? Okay, so there's a few of you who are not familiar. So the general starting point is that we have a very high dimensional data structure, and we can't possibly look at the full data structure. But more than that, we don't actually believe that just because there's 20,000 genes, each one of them is behaving independently of the other. So what we're going to do is we're going to try to pick out, we're going to try to rotate our space and pick out directions, which contain most of the information about the population structure. Principal component analysis is a simple linear method for picking out directions in gene expression space. And formally, what we're doing is we're looking at the eigenvectors of the covariance matrix ranked from the highest eigenvector, the highest eigenvalue to the lowest. And the idea being that the highest eigenvalue is the direction which explains the most variance, then the next eigenvector is the direction which explains the next most variance. And these eigenvectors, because the covariance matrix is real and symmetric, these eigenvectors are orthogonal. So we essentially are picking out directions which are each one 90 degrees to the other and gradually covering space. In this very simple example, this is just going from two dimensions to one dimension. You've got a scatter of points, and the first principal component will just be along this elongated streak. And the second principal component will be per precisely 90 degrees to it. And we'll explain this sort of, that's the small variation. So this is a way that we can start to reduce dimensionality. So now the question is, how many principal components do we need? So we can certainly plot two of them. How many principal components, how would you go about answering how many principal components you need to? Sorry? Say that again? For an arbitrary, I give you an arbitrary data set, which I have 20,000 genes for. And now I ask you how many principal components contain the meaningful variation in this data set? Three, why three? Okay, so I could certainly visualize three. That's a good reason to go for three. But it doesn't necessarily mean that three is the number, right? So we have to, we need a formal approach here. And this is actually a nice piece of old physics which comes into play. So to answer this question, what we can do is to go back to some problem in high energy physics from the 50s where people were looking at sort of the eigenvectors of essentially the energy levels of a nucleus. And they were just asking, well, what happens if I had random interactions? I could get a spectrum of energy levels, which was essentially random. So this gave birth to a field of random matrix theory. And in random matrix theory, what you start off with is saying, what if I just took a giant matrix of gene expression in our case? And instead of populating it with real data, I'm just gonna throw noise into it. So normally distributed noise, it doesn't actually really matter what the noise structure is. And let's just make this matrix really, really large. So I have essentially an infinite number of cells and an infinite number of genes, but they're all noisy. And the ratio between the number of genes and number of cells is not infinite, so it's q. At this point, I look at the spectrum of the eigenvalues, and it turns out that they have a very, very clean limiting distribution. And this work was first done by Marchenko and Pasteur. So this is the Marchenko-Pasteur distribution, which has a hard cut off, which is this lambda plus. And it has a hard bottom cut off, which is lambda minus. And there's a very particular shape to this distribution between the upper and lower limits. So what we can now do is we can say, well, look, our single cell data has a lot of noise in it. Let's go and look for this limiting behavior in our data set. Or if you just want to do bioinformatics, let's just randomize our data and generate eigenvalues. That's essentially the less elegant way of doing it. And you get a very, very clear cut off. And anything that's to the right of this cut off are eigenvalues that you could not possibly have found by chance. You can now count those, and that gives you an outer box on the number of meaningful directions in your data set. So in this case, this is from embryonic stem cells. It turns out that there's 14 eigenvalues which are sitting above noise. So that gives you some idea, okay? It may or may not have a meaning. It's certainly, it's only defined from a purely statistical perspective as the directions which maximize the variance in a population. Whether those actually turn out to correspond to meaningful gene regulatory modules is really a very interesting question. And it turns out in many cases they do, but they don't have to, okay? So when you look at each one of these is an eigenvector, each one of these eigenvalues is an associated eigenvector, which corresponds to one set of genes coherently changing in expression. Okay, so this is good news, but it's also bad news. Because what are we gonna do now with 14 principal components? How do we visualize this data? So the other point to say is, we've now drawn a box around the data. It's a 14 dimensional box. That's very useful because we've gone down from 20,000 dimensions to 14. In fact, in many of our, the ESLs are very low complexity. In many of our data sets, the actual box is 100 or 150 dimensions. So it's much higher dimension. But there's, we have to realize that this is what I might call the extrinsic dimensionality of the data. It's, if I were to just draw a box, what box would I need? So to give you an idea, if I take this cube, this cube is three dimensional. But if I ask about the surface of this cube, it's two dimensional. So in fact, I could probably parametrize all of the surface of this cube using two coordinates and not three, okay? So that means that there is another, there's a more restrictive set of dimensions, which is the intrinsic dimensionality of the data. I can have a very complex structure that's been folded in on itself into, into a, say, 14 dimensions. But in fact, that's because I have, I mean, here's a very simple example. Here's a one dimensional structure, but I have to draw it in two dimensions because I snake. I just need one coordinate to know where I am. But of course, the extrinsic dimensionality is two dimension. Okay, so that's good. It turns out that these problems of intrinsic extrinsic dimensionality have been thought about in the mathematics, mathematical field for a while, and there's really practical problems in machine learning and information per visual, say, feature identification on images. Where the features would be, say, pixel values, but the actual number of ways that the pixels can be ordered with respect to each other is much, much lower than all of the possible pixel values. And when you think about every measurement being a single point, you now face a practical problem. So on this box, there's really no ambiguity about whether you're in the cardboard or outside of the cardboard, because the atoms are very densely packed and there's a very clear separation on this line. This line is a continuum. You're either on the line or off the line. But when I draw a point cloud, you might start to notice that the scale at which I discuss dimensionality depends on the length scale at which I'm focusing, so if we have a look at a single point, it's essentially just a zero dimensional point. If I'm very, very locally zooming in, this looks like there's no real features here. I basically would probably need two dimensions to describe this. It's just a random scatter I could make a surface. If I zoom out further, actually it looks like a one dimensional curve would describe this data pretty well with some noise around that curve. And if I zoom out even further, I might say, well, this entire structure is a feature. And maybe I actually need two dimensions again because it's part of a wider picture and there's another feature out here and it would be inefficient to describe this as a curve because I need another set of coordinates to tell me whether I'm on another feature. So it could be that actually at a much larger scale, this is if I zoom out even further, this would look like part of a bigger two dimensional picture. So clearly there's no single answer for data on what is your intrinsic dimensionality. So one of the challenges is now to ask, can you discover meaningful features by looking at different length scales in data? And the length scales here are in gene expression space. Okay, so let's go back to this idea that we're in 14 dimensions and we're trying to discover some sort of lower dimensional structure. So again, this is a problem which we can just borrow freely from other fields from. What we really want to do is to find, so principal components and this idea that we can stick to three principal components is all about projection. I'm going to just take three directions in space. I'm going to rotate and only look at those three directions. I'm going to throw away all of the other dimensions. So I'm going to collapse all my data onto a single or onto a very small number of projections. A more general approach to take would be to rather than to project, projection is a very special form of mapping, is to find a way of mapping high dimensions into low dimensions. So for example, one challenge we could pose there are different ways of formulating this, find a two-dimensional layout such that the error in the pairwise distances between any two cells i and j in two dimensions is minimized with respect to the pairwise distances in high dimensions, okay? So if two cells are close together in high dimensions, I want to put them close together in two dimensions. And if cells are far apart, I want to keep them far apart. Now clearly there's no unique solution to this. A very simple example is imagine a tetrahedron. These are four points where all of the points are equally spaced. If I now try to draw a tetrahedron in two dimensions, I can never put four points equally spaced in two dimensions. But there's no solution of that type. So whenever I reduce dimensionality, I'm going to introduce some error. And now it becomes a bit of an art or rather call it what you will live with. That's the most generous. Which is that we can choose different distance metrics, different error functions. So minimizing, that means that we're minimizing some error. We can choose L2 norms, L1 norms, Frobenius norms. We can define distances based on correlations, cosine distances, Euclidean distances, Mahalanobis distances, many, many different ways that we can think about distance. And you can invent your own, right? So there's many, many different ways of doing this. And the different ways of doing this give rise to different techniques, which each one of them have their own acronym. So multidimensional scaling is the most basic form. And then stochastic neighbor embedding is another one. T-distributed stochastic neighbor embedding, which treats the penalty as having a fat tail, a t-distribution, is currently one of the most powerful methods. And I'll show you some data with t-snee plots very soon. It's very, very effective. And then local linear embedding and so on. And of course, each one of these has its own parameters. So because of the idea that scales really matter, you can choose which scale to focus on for each one of these problems. Yes? We'll read all the dedications, so we're going to ask that for you. Yeah, so actually, this raises a really, really good point, which I wanted to raise and I forgot. So there's a really big question, even when you're doing this, even when you're doing principal component analysis or this about scale. In this picture, if I rescale my x and y-axis, I will rotate my principal components around. I could, for example, by making the y-axis, amplifying it 100-fold and compressing the x-axis, I can almost entirely align my first principal component with a y-axis. So all forms of dimensionality reduction are sensitive to scale. And that is a problem for gene expression, because we do not have a natural scale to think about gene expression. Is a low-expressed transcription factor less important than tubulin? Tubulin will be expressed at very high levels. So what about ribosomal proteins? They're basically much of the mRNA in a cell. If we don't think about scale, we're just going to be doing ribosomal protein profiling whenever we look at data, because that's the most abundant genes and small fluctuations in those genes will contain much more variability at an absolute molecule level than other genes. Just to bring this point to show that there's really no unique solution here, let's think about different ways that we could treat our data. So let's say that we have pure Poisson. Just take Poisson data as a starting point. So we know that the variance of a Poisson process is just equal to the mean. That's basically Poisson. So if we were now to just take Poisson distributed variables and we don't normalize them, our principal components are going to be dominated by high-expressed genes. All right, well, that's no good, because we don't want to pay too much attention to the high-expressed genes just because they're in a less variable. So what we can then do instead is we can say, well, let's instead look at the coefficient of variation of the genes. So actually, what the practical suggestion here is that we take x and we rescale it by the mean. All right, so let's just call this y. All right, well, now if we wanted to work out what the variance in y is, it's just going to be the variance in x divided by v squared. So it's just going to be cv squared of x. But that's no good, because we know that this now goes as 1 over the mean. In other words, now all we're going to do is pay attention to the lowest variable genes. So if you just mean normalize your data, you're just going to be looking at the noise. If you don't normalize your data at all, you're going to be just looking at the most abundant genes in your data set. There's no right answer. And this is sort of, there's many different normalization schemes, so this is my own preference. I'm not making this an official recommendation, because I haven't thought about this deeply enough, is what we do is we define our data as x divided by sigma x. In other words, this is the equivalent of z-scoring. Defining a z-score where you normalize by the variance. Yeah, so that's it, but a subtraction shifting the whole distribution is, I mean, yeah, sure. That's actually irrelevant for a translation doesn't change. Translating your data set doesn't change anything, right? So you're absolutely right. What we're doing is we're zero-centering the data around the mean. That actually is less important from the point of view. Principal components analysis would be insensitive to that. So this is actually a really important point. When you look at data, you have to make a choice. What this does is it essentially treats every single gene as equal. And now principal component analysis is not looking at the eigenvectors and eigenvalues of the covariance matrix. It's looking at the eigenvectors and eigenvalues of the correlation matrix. So what it's saying is the most important directions are the ones that couple the most genes together, irrespective of whether high expressed or low expressed. That's a very particular decision that we're making here, and it could be wrong. Why would a process which involves 100 genes be more important than a process which involves 10? This is going to rank those principal components according to how many genes are implicated. Okay. Anyway, those are, that's actually sort of the tricks of the trade, or the dirty tricks of the trade, yeah. X is, so yeah, so that's, I didn't define my terms. So X is a number counts in a cell. So X i, say, is in cell i, and the idea is that X is sampled from probability distribution over X, right? And I was just giving some intuition by thinking about a Poisson distribution, but of course it could be sampled from a different distribution. So this was just, because a major feature of our data is Poisson-like, this gives a very good intuition for what's gonna happen, but of course some genes are much more variable than you'd expect. So some genes are very highly expressed in a very small set of cells, and then of course this is not quite true. And in fact for those cases, it doesn't matter how you normalize, right? You just get a good features anyway. So if your data is just very block diagonal, that there's genes in specific subsets of cells, it just doesn't matter how you normalize them, because you're basically looking for presence or absence of a gene. But in many cases, this sort of does matter. Okay, yeah. Okay, so now let's get back to where we were. So we're now going to have to, we're going to try to minimize a distance. The choice of distance depends on the choice of scale, right? So two genes will be further apart under one normalization than another. And this is just a sort of an example of what the aspiration is. This is actually a real example, you know, where we have say, we start off with a three-dimensional space, but there's really an intrinsic manifold which is just one dimensional at a particular scale. And using a technique called T-distributed stochastic neighbor embedding, which was developed in 2008, we can unfold this structure and have a look at it. And you can imagine now trying to do that for very complex structures. Now, all of these methods, they introduce their own distortions and biases. So, you know, you have to be very careful with them. Okay, so another point here while we're sort of doing the very technical aspect is you can ask, well, how well have I really captured the relevant linear subspace? And there's a very nice paper from Matt Thompson who's just joined at Caltech which is published in Cell Systems which looked at this problem from a sort of formal perspective. You know, Matt also has a training as a physicist and it comes across because he asked a very simple question. He said, well, let's imagine that we have a covariance matrix and then actually a measured covariance matrix which is noisy. And we can just to get an intuition for the problem we can ask how do the eigenvectors of a matrix change when I make a perturbation to that matrix? So this is just first order linear perturbation theory. It's exactly what you would, you know, have done in quantum mechanics in your undergraduate and you can write down that the difference in the two, between the observed eigenvector and the true eigenvector is essentially the projection of the, sort of the projection of the eigenvectors onto the covariance matrix and it depends now on the difference between the eigenvalues. So the key insight over here is that if you have very well separated eigenvalues you don't get much error but if you have degenerate eigenvalues then measurement error can make your principal components swing around. It doesn't necessarily mean that you've lost the subspace though. So here's a more practical way of looking at things. If you're now looking at, this is real data which has been down sampled. So down sampling is introducing the error here and now we look at how far, what is the error, the fractional error in the principal components. This is the first principal component, the second one and so on. And you can see that as you down sample the first principal component can undergo the most down sampling before it starts to show errors. Whereas by the time you're at the 10th principal component you're at a much smaller principal component you're much more sensitive to noise. So down sampling, this is a way of sort of asking how deep do you need to go. And here's a practical way of looking at this which is that if I take some, if this is Matt Thompson's work if you take some real data and you down sample that data and you ask, I have multiple subpopulations of cells which I can see here by just looking at two of the principal components, PC1 and PC3. And now I can ask how well can I cluster them and resolve them? And you can see that I can down sample almost a hundred fold before I start to make serious errors in the classification of these cells. So this shows that really the linear subspace can be very, very robust to down sampling. And the reason for this is that even though we have 20,000 genes and we're measuring each one of them in a very noisy way the actual dimensionality of the data is much, much lower. So when you're down to 10 dimensions you have huge amount of information to decide where you are in principle in that number of 10 dimensions. So that's really good news. It means that we can use very shallow data to very rapidly characterize the structure of a manifold. And in fact one of Matt's sort of controversial statements from this paper has been that there's not a single, single cell data set published today which if you didn't down sample at a hundred fold you wouldn't be able to make all of the conclusions that they made in the paper. So people are overspending enormously on their single cell data analysis. And you can get by with very little information per cell and still see what there is to see. Okay. So that's been very technical and I haven't really shown you some biology. It's been sort of more about methods. We sort of last before the break we reviewed we've very basically mentioned, very briefly mentioned different single cell profiling methods and maybe the considerations in choosing them. I told you a bit about droplet microfluidics as a platform. We showed you, we discussed a bit about developing models for thinking about noise in single cell data. We introduced ideas of linear or extrinsic dimensionality and intrinsic dimensionality of data sets. And the idea of nonlinear dimensionality reduction to have a look at data. And I guess that we also discussed a bit about some of the arbitrary decisions that have to be made in looking at data sets. Okay. So let's now just, let's see. So it's, so I don't know how far it's gonna go. I'm gonna show you some pretty pictures I guess. And feel free to stop and ask me questions. And I don't know how, you know, this might end very, very quickly. In which case we could start tomorrow's lecture and I'll just think about what to do tomorrow. Well, let's just see how this goes. So we're gonna start off with a very simple paradigm, you know, places where we just need a cluster data. And I'll just sort of show you how this comes together now. So, so this is related to the pancre. I showed you a cartoon and this cartoon of the pancreas. It's a work that's just been published. So over here, experimental design. This is not a very sort of, it's a very, very standard. So this is really nothing very challenging over here. So I have a few human donors and we get cells from them and we run a Joplin microfluidic experiment and we profile the cells and we get some single cell data, which is sort of shown here as a heat map where every column is one cell and every row is one gene. And now what we can do is we can try to cluster this data. And actually if we go down to the bottom over here, this is one of those T-SNE plots. So this is non-linear dimensionality reduction in action and it pulls apart these data set into discrete clusters. This is very large cluster in the middle, which is beta cells, which is the main body of these pancreatic islets. So I should actually focus. This is not the whole pancreas, it's just the endocrine system. So it's the pancreatic islets. But we're decorated by a number of different cell types. And we can then, this is a very, very well-studied system. We can go and look for some marker genes and figure out what these cells are and eventually make a catalog of cells. And then we can cut out these clusters and try to recluster and try to find internal states. And so on and eventually we can make a big catalog. It's an atlas. So this is a stamp collecting at its finest or at its 21st century version. Certainly not the finest. This is how we do stamp collecting. So this is useful. It's a reference. Yeah. Because sometimes I feel like you can maybe use these letters to really pick them up in the final product. Yeah, so in this particular. So you have a little cluster. Which is a mixture. Then my guess is that these are cells which are highly stressed. And therefore they're, what's clustering them together is stress signature. Okay, these were very poor quality samples. They sit 48 hours on ice. And the pancreas is very good at digesting itself. It was a very, very challenging sample. Yeah, so there was some stress signature there. Yeah. You know, in this case, this is a fairly well-studied system. I'll show you some examples where there's surprises. In this case, the important aspect of this is that it has not actually been possible to generate a whole transcriptome catalog for these cells up until now. For example, epsilon cells, which are incredibly rare, there's just no way to purify them from the rest. So what's been useful is that people actually working in this field care, because now they can look and see, for example, let's say you wanna make epsilon cells. You now can look up which transcription factors are expressed and force that expression in an embryonic stem cell, or in an early endothelial progenitor you've made from an embryonic stem cell, and try to make these in a dish. Or, you know, if let's say you can now, you can now look at GWAS studies and find which genes localize the different cell types based on their expression and maybe hypothesize which cells are implicated in different disorders. So there's a practical use for this. I don't, I mean, this is very important for some people. I don't think that this room is gonna get very excited by the data here, but maybe, you know, the challenge of getting to this point might be interesting. Yeah. That was embryonic stem cells. I haven't shown the equivalent over here, but it would be, it's almost certain that the number of dimensions is greater than the number of cell types we see here because every cell type is really an orthogonal dimension and then there's some internal heterogeneity within these states. So we have some, we have some small dimensions which just explain heterogeneity in one state. Oh, yeah, are there 14 here? It's a pure coincidence. These are different data sets. Pure coincidence, yeah, yeah. So I can't remember how many dimensions we're here and we don't, you know, we don't look at that so often anymore. We usually include a few noise dimensions and throw them in as well so that we sort of overshoot and those contribute to, you know, that there's a trade-off sort of between finding features and getting noise. So, okay, so this is, I made the mistake of getting involved in some immunology work. So this is immunology data from the lab and this is now looking at cells isolated from healthy lung and from lung tumors based on the expression of markers which mark these cells as hematopoietic in origin. So these are immune cells and you can see again, this is a nonlinear dimensionality reduction. It's a slightly different technique. It's not so important what we're using here and you can see these different clusters appear and there's actually a mixture of both healthy and tumor cells over here and some of them really separate by states and neutrophils undergo major gene expression changes. So they sort of, they completely separate in the tumor and some of these, for example, macrophages come from monocytes. The monocytes are sort of over here and there's no other healthy cells in this cluster except for these monocytes but then in the tumor you get these activated macrophages and dendritic cells that emerge. Actually, these are alveolar macrophages which are tissue resident and they only appear in the healthy, I mean, they're actually, they drop to about, they drop 10-fold, they don't disappear but they drop about 10-fold in the tumor and that's because these tissue resident cells are being excluded from the tumor and monocytes are infiltrating in and differentiating into macrophages but they're quite different in their gene expression and so on, these are T cells and you can see that there's internal structure and there's natural killer cells and B cells and so on and then there's dendritic cells, there's a plasma cytoid dendritic cells, there's a whole zoo, this is an immune cell. I mean, there's a whole zoo over here. What is sort of interesting here is that even though the first approximation what you see here is distinct cell types, you can see that, oh, oops, you can see that there are these elongated structures and these elongated structures correspond to very distinct changes in behavior. So as a sort of a side observation, this is again storytelling, so neutrophils in a normal healthy lung will, or normal healthy tissue will enter the tissue, only live for about three days and then they die and they're extremely abundant, so the flux of cells into the tissue is very high and as they enter the tissue they undergo gene expression changes. So what you're really looking at here is essentially a neutrophil differentiation trajectory that's been laid out and you can really identify genes which are closer to blood neutrophils on one side and then these are markers which have been associated with neutrophil sort of death or maturation on the other but now we can really walk along this trajectory and start to see which waves of gene expression as the cells transition from being a immature to mature state and that's really not even what this project was about, it's really about tumor, so these data sets can be very, they're fun, they're very overwhelming because you're looking at this and you suddenly see this interesting differentiation trajectory but it's actually what's going on in the tumor that we're thinking, there's many, many, many different hypotheses that emerge from this for follow up. Okay, yeah, previous figure, right? That's a really, really good question. So for certain problems like mice, it's pretty easy to combine samples together. Under very controlled conditions there's absolutely no batch effects but typically you do get some global changes which are even to do with the fact that you did the experiment on different days and so on. With humans and the technical term is we're screwed because the amount of variation between people in different ages and ethnicities, there were different ethnicities, there were different ages, the story behind these are obviously pretty sad so you have somebody in a car accident and they've donated their organs to science and you get something 48 hours later and it comes from a young child or it comes from an adult and you just get whatever you get. So there's real differences. So what we did in this particular study is we just, this is one donor and we did this for different donors and then we just made sure that things looked the same. There are more sophisticated methods so we have a simple method where we define a principal component subspace for one donor and then we've defined principal components and we then project the other donors, we use those principal components to define the subspace of all of the other donors. So that forces us to look for the variation in the subspace defined by one donor. Now of course if there's some very interesting rare subpopulation we wouldn't see it that way so we also have to do it separately to make sure that we're capturing all of the information. But if we see that basically we get the same structure this idea of using principal components to project data sets onto each other seems to work very effectively. Yeah so for example here there's multiple healthy mice and multiple tumor mice and they mix pretty well. There's some batch effect I think that let's say this lobe has more of one's mice over here and more of another mass over there but by and large they still cluster together and separate, the structure is mostly the biology here. That's a very good question. Okay so more pretty pictures, fine you know for the heck of it. So we've been sort of turning the crank for the last year and people come with a sample and we just throw it out. So this is our neighbors downstairs, our neurobiologists. I guess the reason I'm showing this pretty pictures is because hopefully it'll excite you about a problem that you're working on where you think hey these tools could actually be pretty useful to open up some new biology. And maybe also think about some, I can talk a bit about the technical problems which are unsolved. So we've been borrowing tools from other fields but they're really unique challenges to this data which are completely different from machine learning. This is not Netflix trying to figure out which movies you're like. We're using those techniques but these data sets have very different features. So there's a lot of interesting, okay now back to this and then we can talk about that a bit later. So this is an example, they're now up to about 100,000 cells but the general structure this is a t-sneep plot again. So again it's very good at finding these clusters. This big purple blob is actually, if you recluster it you'll find out there's internal structure there. So this is one of the major challenges. Can we essentially see all of the structure at all scales with a single picture? Right now it's a sort of a manual job of cutting out a cluster, reclustering it in its own intrinsic subspace and so on. And there's good reasons why it's a hard problem to solve. Here if we say look at, so this is a neurobiology lab, they're interested in neuronal transcriptional responses to light stimulation. So they're now taking mice, they're exposing them to light and then they're taking a time series of single cells and asking which subsets of cells are responding to light excitation. And it's a bit faded here. So the blue is showing low expression and what looks like what you cannot see as pink is showing higher expression but just a quick way of orienting ourselves, you can take marker genes. So the first part of the talk was all about using single markers to define cell types. Here we can take these marker genes and overlay them and we can see that some cells just correspond to a single cluster. Other marker genes like oligodendrocytes actually light up multiple subclusters and so on. So we have these different clusters and that gives us some sort of idea. Also confidence in the data and also some idea of if we now look for differential gene expression between experiments, we can relate different cell types to each other. Okay, so this is really all about clustering. It's a very, very simple form of data analysis where we're just looking for distinct clustering. So, okay, so now let's talk a bit about developmental differentiation and this is sort of maybe a sort of major a point I wanna make. So, we're just looking at pretty pictures and I think for the rest of today, we're just gonna continue doing that. I wanna separate between the idea of looking at data and actually using it in a more formal sense. So when we look at data, we need something low dimensional. You know, that's where the three principal components came from, right? We need something low dimensional. We, there's probably not a unique way to look at data. But different dimensionality reduction techniques are gonna look different. We probably wanna explore the data and get some intuition for it, right? So to look at it, which genes are expressed? Which ones do I know about? So to get some set of sense. When we look at, when we're trying to say something more formal, for example, what is the structure of a differentiation hierarchy? I can make, get things very, very wrong if I look at a two dimensional picture of my data. Because I would have to rely on the fact that the two dimensional picture did not distort the key aspects of my data structure. So when we're really trying to make more formal or sort of more stronger statements, it would be useful to stay if we can as much as possible in high dimension. We'd like to have a unique answer with clear assumptions about what it is we're assuming, what it is that's led us to a particular point. And it probably requires some sort of handle, some sort of formalism, right? So okay, fine. So we'll basically next session I'll sort of tell you a bit more about that. But so how should we visualize a hierarchy? So here's an example. I'll just show a few slides on again. So this is from a collaboration where we've been looking at the airway of Pythelion. And in the airway we have, so we discussed in test in before having a hierarchy of cell types. The airway also has a hierarchy of cell types. It's organized very differently. We have a basal, it's a pseudo stratified of Pythelion. I guess I can draw it. Maybe it'll give a bit more appreciation for it. So this is a race through different tissues, but hopefully it also brings across the commonalities to some of these challenges. So if we're looking at the, this is a basement membrane and there's some fibroblasts down here which we're not gonna look at. And then we have a pseudo stratified of Pythelion. So every single cell in this epithelium is in contact with a basement membrane. That's why it's pseudo stratified. It's not really separated into our regions. But some of the cells have a luminal surface and some of them are entirely, don't have a luminal surface. And these cells are actually very, very tightly packed. So sort of a better picture would be maybe as follows. And they're so tightly packed that their nuclei squish together. So if you just look at a section, it might look like it's stratified. Cause it looks like there's nuclei running all the way up and down this thing. But every single cell, if you trace it back, has a basal footprint and only some of the cells have an apical footprint. So in this tissue, we have these basal cells and these cells are known to self renew and then give rise very broadly to secretary cells and multi-stilated cells. And putting this in the context of function, you all know this from, if you've had a cold or a cough, is that you have secretary cells producing phlegm, mucus, and then you have multi-stilated cells which are beating and pushing the mucus out upwards. And so it doesn't clog the channels. Okay? And also that way we're clearing infectious bacteria. And some of these secretary cells are also producing antimicrobial agents. There's a whole system for producing hydrogen peroxide and there's other antimicrobial agents. So we take this, this is a structure. We have the spatial structure. We have an idea for lineage. And now we can look at this data and we can just look at a heat map and that's the data. So this is one way of looking at it. It obviously doesn't give us too much information. We could cluster it maybe. Maybe I should add to this. We can also label these with marker genes. So this might be keratin five. There's actually another state here which is keratin 14 positive. So this is very much in the spirit of the first part of the talk. It's sort of a way that we think about these cell populations, we find a marker. Now it's pretty amazing, right? Keratins are intermediate filaments. What the hell are they doing here? Actually, you know, they just use as markers. Nobody's really thinking about their role in this process. So secretary cells, well we may have some mucins here which would label these cells. There's actually an early luminal cell which is keratin eight positive. Again, it's just highly expressed. There's a good antibody for it. It was discovered to label a subset of cells. And multiciliated cells which have markers of multiciliated cells as you'd expect. FoxJ1 is a master transcription factor of multiciliated cells. And then there's tubulins and so on. Okay, so we have different ways. There's other secretary proteins that we can put in. Okay, these are just a subset. Okay, so now we could try to look at these clusters and see what they are. We could apply T-sneed to this. You get something. It looks like there's some patterns here. How do these two relate? Well, you know, you could start to tell a story. Maybe these are differentiation trajectories. You know, if you see a skull, please leave the room. And so we were trying to think about, so the problem with T-sneed is that it's really a professional clustering algorithm. It was actually identified and developed in machine learning with an example of, say, doing face recognition. So there's nobody halfway between me and Stefano, right? Either I go in one bin or I go in the other. And it's also been shown to be very effective for handwriting, for digit recognition. So again, there's nothing halfway between an A and a B. So it's really trying to find those differences and break things apart. So we were thinking about ways of doing this and a student in my lab came up with a very simple scheme. So we're gonna do dimensionality reduction for us to say the extrinsic dimensionality of the problem, say 20 to 100 dimensions. We're then going to do something very, very simple, which is to link every cell to its neighbors on using a Euclidean distance. It actually doesn't really matter which distance matrix you use, but of course you could get slightly different results. And now we have a graph where the nodes of the graph are cells and we're gonna draw edges between neighbors. And now we have essentially gene-free representation of the data, which is just the topology or the structure of this graph, okay? So if it has branches, how many branches does it have? It has loops and so on. Is it hierarchical? Yes, question. So we could, there's many things you could do because at this point we have no fundamental theory for what we're doing. So we could come up with any representation. This is just, I'm just giving you a grandmother's recipe for visualizing data. You could pick any other recipe you wanted and it may work better for you or worse for you, okay? In this particular case, we took unweighted edges. So what we're basically doing is we're saying there may be regions of space which are very sparse and there may be regions of space which are very dense. We're going to correct for that by just putting an unweighted edge between nearest neighbors. So if you travel very rapidly, there's still just going to be the same strength of edge. But there are many other ways that you could come up with a graph. So the Euclidean distance is on a principal component analysis of the correlation matrix. So the Euclidean distance here is to sum, it's actually Euclidean distance on the Z score standardized data. Yeah, it's a choice, okay? No, this is a K nearest neighbor graph. So in a K nearest neighbor graph, I take every set, I make a distance metric and I draw an edge between two cells. Actually, I should define it, so. So I graph, you know, graph G has, let's have, let's have these, this is, I've got a gene expression space which I have coordinates X, okay? And I've got a set of cells which are I, and this is just the vector positions of those cells, okay? And I've also got a set of edges. So the edges are basically, I have essentially, I have an adjacency matrix, okay? And a adjacency matrix is going to be, is an, before normalization, okay? It's going to be essentially a kernel which is going to be to say that if the distance between X i and X j, let's make this say, you know what, I'll do as follows. This is equal to one if this distance is less than, and now I'm going to, I'm going to just write R i j and we'll define R i j in a minute, okay? And zero if it's otherwise, right? Greater than R i j. And now R i j is simply then it's the, this is a mutual, so it's an inclusive graph, it's symmetric. So R i j equals whatever is greater between the distance d i k or d j k, where d i k is the distance, it's basically X i minus X k, for ranked nearest neighbor, okay? So basically I make a list of all distances, I take the kth distance, anybody that's closer than that distance gets an edge which is exactly the same weight one and anybody outside of that doesn't, right? And that's basically it. And again, this is a choice of a particular graph. Tomorrow I will show you that this graph, there's actually a very deep reason that this graph is a useful thing to look at because of particular continuum properties of the operators acting on this graph, okay? So there's gonna be a nice, it's gonna, there's gonna be a good reason why this is a good graph to look at, okay? So the minimum number of neighbors every cell has is k, but because you're taking the max, you could have a cell which has more than k nearest neighbors. So this is pretty simple to imagine. So imagine that I had five points over here and one point over here, okay? So, and I said let's say k is four. So all of these nodes, all of these cells are gonna be tightly connected together, right? Because that's basically the full connectivity. But for this guy, the nearest neighbors are here. So I'm now gonna have one, two, three, and okay. So k is the minimum, but some of these nodes now have more than k, okay? And there's other recipes. You could come up with a mutual k. So this is an or. You can come up with an and, which means that only if you're my k, you're within my k nearest neighbors and I'm within yours, do I make an edge? Then you got a much sparser graph. It breaks apart a lot more. So there's different recipes that you could come up with for this, okay? You can make a directed graph, but then you have to, you know, but then the arrows have to mean something, okay? In our case, we're interested in an undirected graph, okay? And at this point, our graph has essentially topological properties. Yes, bars. So what I'll show tomorrow is that in the limit of large numbers of cells, k is irrelevant. But for a finite graph, qualitatively, basically the heuristic is that you want k to be dense. What you really want to do is to try and capture the topology of the structure. So if you make k too high, you'll start connecting regions which shouldn't be connected. And if k is too low, then you'll break apart regions that shouldn't be. So it's really sensitive to the, so the weakest links in your structure, the lowest density regions of your structure are very sensitive to the choice of k. In the limit of large numbers of cells, no region is undersampled, and then you have a wide range of k which will give you very similar, both representations, and I'll show you tomorrow also predictions about behavior. Yeah, so I mean, the simplest example is, for example, in this graph, if I reduce k, well actually, so let's imagine that I had now actually two cells over here, or maybe three. So these cells will first of all connect to each other. And they will all connect to this cell here. If I now reduce k by one, I will break the graph apart. If this was a continuum trajectory, I would now, and I would try to predict any dynamic process occurring in this graph, I would of course have now suggested that there's two completely independent lineages which never interact. If on the other hand, I have the graph then, but alternatively, if these were originally disconnected and I made a connection, I would suggest that one cell type can transit into another, and that is of course not a good description either. So there's, all we can say is that if you sample enough, you can make k an unimportant variable, but you depends on the problem how much is enough. And of course it's not a local problem because if I have a very high density of cells in one region, if I sample more cells, I'm just gonna throw more cells into that region. So I need the region where the interesting dynamics are occurring to be sampled enough. And often those are very rapid transitions which have few cells. So often you need a lot of cells to generate a meaningful picture. And I'm sort of hinting at some of the analysis we'll do tomorrow. But for now, we're just gonna use this as a way of making pictures. So, yeah, just a very, this is sort of a advertisement. There's a very nice TED talk for those of you who want to look at it about why it's really important to play around with high-dimensional data. It sounds like a silly thing, but a lot of bioinformatics is about trying to come up with the picture of how your data looks. There is no unique way to show this data. And having a method which allows you to play with your data is really a good starting point. It sounds sort of very unscientific. It's very serious. So here's just a force-directed layout in action. So here we have our force-directed layout. So this is our human airway data. And this is just labeling some genes which mark basal cells. You can see that there's a gradient here. So now the way to think about this structure is that it's actually in some way reflecting the hierarchy here. And now this is self-organized. We haven't done very much. I have to actually say we didn't even develop this method for this data set. And we were sitting on those T-SNE plots for months. And once I just later, night, I decided, let me throw the data into this and we got this beautiful trajectory out. So this was a very honest self-organization of the structure which I completely caught me by surprise. And now this is just for fun. So we've developed a web tool where you can upload your data and make these spots. And this is not really for you guys, but for biologists, they like this kind of thing. You can look at which genes are enriched in different cell populations and then you can figure out that those are multi-siliated cells. They express tubulin and they express Fox J1. Fox J1 is also enriched there and you can see that. So you can sort of convince yourself very rapidly what's going on. And actually, what was the surprise here for us is this branch, which is a cell type which nobody had really discussed in the lung before. And now the paper, this is how papers go, the paper has actually really been about defining what this cell type is. So fine. One thing I won't show you is this is from human cultured cells. We have the same thing for mouse primary tracheal cells and we also have them after wound healing where we've essentially killed off all of the cells except for the basal cells. And it's quite interesting because of dynamics. We then have a time series and the dynamics do not quite reflect the steady state picture. So these multi-siliated cells, they really come out of a, this is an early luminal compartment and what happens after wound healing is that there's a different branch which reaches here. You go directly from a basal to ciliated cell. So it's sort of fine, you can then see how these things change. And because we're looking at a K nearest neighbor graph which is based on distance, there's nothing stochastic about this. So the structure of the graph, it's layout of course is arbitrary but the structure of the graph is purely deterministic. And that's one of the other reasons we like it more than the existing dimensionality reduction methods which are often trying to find an optimal solution and therefore exploring different local minima. So you don't, and it's sort of hard to rearrange them once you've got it. You just have to go with what you have. You don't have that graph to use as a reference. So we like it, it's sort of very reproducible and so on. Okay, so this is the data set I'll talk about more tomorrow. So yeah. Yeah. I think you contract the, you've got to make it a matrix of system Oh, how does it lay it out? So we're using some very, this is known as force directed layouts. This is a very, very well established approach for representing graphs where every node is treated as a repelling charge and the edges are treated as springs. So you now have an interplay between the fact that the springs are being stretched and they want to, they want to compress and the nodes are being repelled and the system relaxes into some sort of local minimum. There's in fact, you know, none of these are global minima. So one of the things we do in the interaction is you can push and pull your graph and rearrange it and get an intuition for whether it's well projected into two dimensions or whether it's not. So sometimes a branch might look like it's coming out between two branches but you really find that it's coming from a completely different place in the graph and it just, you know, that's the local minimum. So this is a nice way of sort of before you make any strong statements based on your visualization, you can quickly pull on the graph and see what happens. Yeah. Yeah, absolutely. So the, no, we didn't sit, so it's interactive. So once we've laid it out, we look where the basal cells are and we pull them up to the left and so on. So we would be very, I would be very, very careful about making any formal statements based on how the graph looks. But we're not really making any formal statements, you know, we're showing us the pictures and we haven't distorted this graph. It's not like I've taken in cells and pulled them. This is a layout. It's roughly the equivalent of running t-sne until you get the picture you like, which is, you know, what a lot of people do. And then the formal statements come from asking about the properties of the graph itself. So for example, spectral clustering, we'll look at the modes of the graph and we'll identify branches. And that is completely formal. There's no, it doesn't matter how you push and pull because it's insensitive to the length of the edges and so on. So we can cluster the data using very formal methods or we can look for, we can look for, say the mean first passage time between any two nodes and obtain an effective distance between two nodes. So there's many, many ways we can formally interrogate this structure. Clustering is really a... Yeah. Yeah. Yeah. So in topology, there's sort of topological data analysis. There's this idea of persistent topology. So the idea would be, you could, I mean K is a perfectly good example of this. It's often done with a distance epsilon. So this is known as a rib graph. Okay. I think it's, I may get the spelling wrong. Rib graph. Where what you do is I vary K. Okay. And I look for topological structures. So they could be a loop or a branch, which, and now at a particular K, each row over here is one distinct structure that I've defined, one topological feature that I've defined. So this might be, for example a cluster. And it's very, very low K that cluster appears, but as I increase K, it disappears. Actually it doesn't drop. It just exists for this region. This now might be a continuum. And now I see that that continuum coalesces when these clusters disappear. And now it's very, very persistent over a very long period of time. Okay. I might have a loop and the loop might form just at the right length scale. But then once I connect too many points together, it disappears. Okay. So because now I've connected points through the loop. So formally you can start to ask, I mean we haven't, we don't actually do, we do this by eye as it were. In principle you could apply formal tools from topological data analysis to ask under what conditions are your structures persistent. Our experimental approach is to say, if we get it three times more cells and we vary K by a factor of three, I mean basically we just try to answer this experimentally. If we look at three donors, if we look at three mice, if we perturb the system. So we're trying to get a more experimental view, but formally you could try to get a topological view of this. This is sort of persistent homology groups. Okay. So yeah, any other questions at this point? I guess we're almost out of time, right? 6.30 is the end, right? Yeah. So it's gonna be beautifully timed. We're basically almost at the end. Let me just, just so that maybe we can get to the end. So here's now taking, so I discussed him at a point in the earlier part of the talk. Here what we've done is we've isolated the top 2% of the differentiation hierarchies. So these are cells that they express a marker to receptor known as KIT in the mouse. This labels anything from a stem cell up until fairly early on in differentiation. So 98% of the cells in the bone marrow do no longer express KIT. So, and we now create one of these k nearest neighbor graphs and in two dimensions, the projection looks like a blob with radiating branches coming out of it. So one of the things we're gonna ask tomorrow is how does this relate to all of the careful in vitro and transplantation assays that people have shown? And I'm gonna argue that the structure that we see over here very nicely recapitulates and predicts what we, it doesn't look like a hierarchy at all. It looks like a radiation, but it very nicely contains all of the fade information which people have looked at before. And in fact, we can make novel predictions based on the structure of this hierarchy. So the one thing to point out is that some of these branches look very long and some of them are very short. This is purely a feature of the way that we're purified the cells. KIT is switched off at a much later stage in erythropoiesis, which is marked by some of these cells which have started to turn on hemoglobin, which is shown in yellow. And granulopoiesis, which is marked by cells that are starting to switch on lysozyme. I forget exactly what it is that switched on over here, which is a very late marker. So these other branches, which are basophils, megakerasites, dendritic cells, lymphoid cells, which are gonna be BET and NK cells. And then monocytes, they switch off KIT very early, so we just see a small stub of differentiation and then those cells are lost. Okay, so again, we think that this performs nicely compared to T-SNE and to just looking at a heat map. Okay, so now just for fun. This is now, other experiments are going on in the lab where we're differentiating ES cells and we're trying to make motor neurons or rather we're just looking at a motor neuron differentiation protocol. And these are motor neurons, but you can see these other branches that split off. So it's a very evocative of the fact that we're not creating just the cells that we want, but now we can look at those branch points and see what's different. This is another piece of fun, so James Briggs in my lab, where we're now taking two different protocols for making motor neurons. And if we put them together on a graph, we can see that they do contribute cells. One of them is a very low efficiency method. It makes less mature cells and then there's some differentiated cells. And the other one gives rise to many more motor neurons, but along the way, we can see that there's a divergence and as you get this loop and two different trajectories. And one of these really recapitulates in bionic states very nicely. And the other one generates a set of gene expression profiles which would never appear in a normal developmental context. So we're going to an incoherent state, okay? So I showed you this immune data set before. So this is just taking monocytes from a tumor and we see that they differentiate into dendritic cells and into macrophages, but there's actually multiple branches so we can start to look at heterogeneity of macrophage differentiation and so on. And actually there's two guys in the lab who this is, I don't really have much to show over here, but it's one of the most exciting things for me where we've been looking at whole embryos and dissociating them over a time series. This is one piece of work with Sean Megason's lab looking at the zebrafish and one with Mark Kershner's lab looking at Xenopus. And then we now have a more complex data structure because we have time points and as time progresses, you can imagine that clusters look like they're very, very distinct, but we can work this all backwards and start to make a differentiation tree of development. And this is now pre-clustering the data. It's a bit hard to see, but we have a sort of a branching trajectory where we've mapped the clusters we see in every time point to the previous clusters. And we can now start to ask many, many questions like reuse of transcription factors over time and over different lineages, the speed at which different lineages emerge because we have multiple, we have two organisms and we're hoping to scale this up. We can start to look at evolution. So how do new cell types emerge? Where are they connected? What is, you know, how are things? So these are examples. This is still sort of early days for this project. Yeah, no, no, so we lose the spatial information and that's true for, I should really say that's true for all of these current existing single cell transcriptomic methods is that you lose that information. Yeah, so there's a lot of developmental biology, particularly for these two organisms, early development. We're looking first 24 hours of the fish in the first 22 stages of the form. These are pretty well studied so we can orient ourselves to some extent, but we do find some cell types which are completely new here. And when you look at the in situs for these, they typically end up being cells which are distributed across the embryo so it would have been hard to pick them up as a distinct. So what do you mean by determined? So this tree partly involves very careful manual curation. There is a simple algorithm behind it. So I remind the other guys where the... Yes, oh, I see, I see. So we wouldn't get that from this data. What we're getting over here is how deterministic is a gene expression profile which we're saying, and that's no big surprise, right? That there's particular trajectory. If you ask about individual cells, I think it's fairly well established that these cells are not nearly as stereotyped as in the C. elegans. And actually the extreme case I think is the zebrafish which has very rapid differentiation and there's a large amount of cell movement. And Sean Zab, who I've been doing this work with and actually Tom here is from Sean's lab so you can ask him questions later, do live imaging and they can really see cells move between different domains. For example, the neural tube, which James Briscoe discussed, you can see cells which move between different domains and only commit at a later point to one domain or another. So it's certainly not C. elegans. Okay, so this is actually 6.30. I'm sort of running summary over, summary so far and then we'll continue tomorrow. I've sort of broadly discussed now the idea of clustering versus thinking like continuum structures. The idea that visualization is very ad hoc. We discussed how many arbitrary choices you could make and I've sort of promised you that we'll do something a bit more formal. So far it's just been mostly visual. The idea that K and N graphs are cute for exploring data. I've shown you some pretty pictures and I've sort of suggested that maybe visual heuristics are a good initial way of generating hypotheses for differentiation. All right, so maybe you don't need very formal approaches because you just look at these and you make a guess and you go and do an experiment. But anyway, hopefully tomorrow. So tomorrow what I'll do is I'm gonna tell you how we can use flux conservation to make predictions based on single cell data. The idea that the number of cells at a particular point in space is not an arbitrary but there's a balance between the cells entering and leaving and we can use that to link state and future fate. We're gonna have to go through a bit of spectral graph theory for that. I'm a total amateur so I can't go very deep but I'll take you through what I know and we're going to formally, we're gonna in this calculation a potential field will emerge. We're not gonna take it too seriously but we're gonna explain exactly what assumptions involved in this potential. And then we're gonna show how this applies to hematopoiesis and how we're gonna use it to discover new progenitor cell states to find some growth factors or regulated hematopoiesis and to identify a cell cycle dependent switch. So that's tomorrow, hopefully that gives you sort of a feeling for what's happening. Okay, thanks very much and sorry for running over.