 Welcome all to the new season of this SIB Virtual Computational Biology Seminar Series. Today we have the pleasure to host Bar De Planc from the Laboratory of Systems Biology and Genetics at the PFL. This group is also affiliated to the SIB Swiss Institute of Bioinformatics since 2013. Bar grew up in Belgium, where he studied biochemical engineering at the University of Ghent. In 2002 he got his PhD in Immunobiology at the University of Illinois at Yobana Champagne in the US, but then did a first postdoc jointly at the Department of Cancer Biology, Dana Farber Cancer Institute, and the Department of Genetics for the Harvard Medical School in Boston, Massachusetts. In 2003 he did a second postdoc at the University of Massachusetts Medical School in Worcester in the Massachusetts as well, and in 2007 he decided to move back to Europe and he became assistant professor in Systems Biology and Genetics at the Institute of Bioengineering School of Medical of Life Science at the PFL. And since 2014 he is an associated professor in his lab. So at the Laboratory of Systems Biology and Genetics, Bart and his team use high-strapute sequencing, single-cell genomics, microfluidics, large-scale yeast screening, and computational approaches to characterize the regulatory code in Drosophila and in MAMOS. And also is interesting in examining how this variation in this code affects molecular and organismal diversity. So in addition to their research interest, the group is also actively interested in developing new research tools and pipelines that enable a better characterization of gene regulatory networks. So today Bart will give us a primer to single-cell gene expression analysis. So Bart, thank you again for accepting this invitation and the floor is yours. Thanks for watching, Ayala. So I actually look forward to this because I think it's not going to be your conventional talk talking about their own data, but in fact I decided to give a more technical talk to introduce you to the difficulties that are typically associated with single-cell analyses. I think many of us are really excited about the potential of these single-cell analyses, but there's a lot of caveats that one needs to take into account. It's also a very rapidly evolving field, so I kind of predict that many of the methodologies that I will show now that we're trying to implement in order to deal with these very complex data might be already outdated actually in a couple of years. So many methodologies, as we speak, are being developed to deal with the single-cell analyses. Now also I really want to thank very much two senior postdocs, outstanding postdocs in the lab, Petra Schwalli and Vincent Gardeux, who actually set up a single-cell biology course last year, and it's from this course that I assembled a series of slides. In fact, condensed the series of slides and hopefully will introduce you to some of the complexities of single-cell analyses. So again, thanks to them, a big thanks to them. So why single-cell analyses? Well, I think we're all aware that basically the genome encodes a remarkable complexity of phenotypes, cell types, and organs, systems, and so forth, right? There's a remarkable sort of diversity in specialization, and you can see some of the well-known systems here, and there's slowly, but steadily actually coming to grips with the sheer complexity that we're facing in that or body tends to have more than 100 trillion cells, and so far we're kind of defined in the 200 types of cells. Well, I think we all realize that so far we've only splashed the surface, and more and more do we see these very intricate interactions between cells that seem to be driven by different identities of the interacting cells, however we have a very little understanding of what these identities particularly are. And that this needs to be mapped and resolved has been recognized actually by many funding agencies, not only official ones, but also actually private ones, such as, for example, the newly-generated Chan Zuckerberg Biohub, which actually has pledged $600 million actually to start mapping the human cell atlas, and this is an initiative that probably we'll go under way by next year. At least part of the $600 million will be devoted to that. So going back to these tissues, indeed, so far gene expression has been really good at defining what tissues look like, at least molecularly, but this tissue identity is captured by gene expression. To the extent, indeed, that gene expression patterns are more similar across homologous tissues of different species than between diverse tissues of the same species, okay? And so there's been great advances also to actually try to understand the regulatory mechanisms that lead to these particular systems of tissues, but there's also great limitations that are associated with genomic applications, namely minimum starting material requirements, and the technique is typically applied on millions of cells, which therefore leads us to the big caveat in all these analyses in that everything is averaged, right? We're always looking at a liver in a way as an average system, not a system of hundreds of millions of different cell types, okay? And moreover, rare cell types in states are typically not analyzed because they typically disappear basically in the mass of gene expression data that are being generated. So again, I want to emphasize here, each sample that we look at with bulk RNA sequencing or bulk transcriptomics is an average, and we have no idea of underlying values in single cells or of the heterogeneity of the specific tissue. And this is also somewhat nicely illustrated here with a simple cartoon from a nice review in that here, for example, you have the bars representing bulk analysis, but if you look a bit more carefully, you can see that gene blue, for example, is only expressed in those three cells and not in the other. However, the bulk analysis, you will not notice that, right? And I recently came across an interesting analogy in that bulk RNA sequencing is actually compared to trying to dissect this moody, right? What you really want to know is actually the fruit that actually is in this moody, right? I think it kind of brings the message really nicely home, right? And so what is single cell analysis trying to do? Well, we're trying to actually identify rare cell types, right, from early development to stem cells or circulating tumor cells in the blood as a diagnosis, our diagnostic tool. We'd like to look at heterogeneity and tissue composition cancer. We also'd like to understand temporal processes of differentiation, which I think can be better resolved using single cell analyses. Gene regulatory networks can also better be inferred using single cell analysis because you're dealing with nonconfounded correlations. And finally, something that I'm not going to cover here, of course, there's also very interesting phenomena at the single cell level that need to be studied, such as gene expression, stochasticity, or the appearance of monodilic expression. A couple of already very nice examples. For example, here, one looked at actually early development programs in bulk mouse and human. And what we found basically is what the researchers found, not really, but the researchers, is that transcriptome group cells according to their developmental stage. So you can see this here. You can already see a kind of temporal pattern here, and that you go from the other side actually to the marilla. And if you actually want to have some more information, this is illustrated here. And it seems that also each developmental stage is characterized by a small number of functional modules of co-expressed genes. And this is illustrated here. You can see that in these early developmental phases, first it seemed to have enrichment for protein transport, GTPase signaling, then transcription regulation, RNA processing translation, and then mitochondrial synthesis. Now, something interesting that also appeared is that the developmental phases between human and mouse are in fact not entirely conserved, as can actually see here between the different cartoons. So here you have the pre-major zygogenomic activation, and here this is the embryo genomic activation. And so you can see that the stages actually differ a little bit as to when this is actually occurs. So the temporal developmental pattern added is clearly different between mouse and human. Again, something that you can only grab when you actually study these things at a single cell level. Now, also recently a very impressive paper actually by the Stan Linnerson group is actually the analysis of whole tissue from the brains, which is, for example, the somatosensory cortex, the hippocampus, where they actually analyze thousands of cells. And this allowed them actually to, for the first time, get an incredible resolution of neuronal cell type hierarchies, which are actually listed here. And for each, basically, then there's actually specific transcription factors that were uncovered, again, showing you how you can actually then use single cell analyses to truly infer, basically, gene regulatory mechanisms. So an interesting finding also that came with these kind of analysis is that, for example, interneurons were found of similar type, but they were found actually in the similar regions of the brain. They were also able to identify oligodentrocytes, subtypes, which is illustrated here. And they also were able to actually identify microglia associated with blood vessels and distinguish them from perivascular macrophages, all features, basically, that are particularly pertaining to the power of single cell analysis. So that brings me, basically, then, to the core message of this presentation, which is to basically go through the kind of analysis that one needs to do in order to actually come up with these kind of findings that are illustrated here. And so I will assume that most of you here in the audience or will be listening, at least have some basic understanding of both RNA sequencing. And what I will do then, therefore, is to focus, basically, on what is this specifically particular or peculiar about single cell analysis, and so to emphasize that when you actually do these kind of analyses. So first, a very brief introduction to the experimental workflow. As you know, you need to first capture, of course, the cell before you can analyze it. Then you lyse it. Then you reverse-transcribe the mRNA. You pre-amplify it, because you actually work with very little amounts of mRNA. Then you do the live-reportation, and then you proceed to sequencing. A couple of pointers here. There's actually many different techniques nowadays, from micromanipulation to facts, microdorporates, and microfluidics, to capture these cells. At this point, what you can do also in reverse transcription, something I will come back to is actually the use of artificial RNA spikens in order to control better your data, or at least control the noise that will be in your data. And what you can also do is include unique molecular identifiers. I will come back to that in a moment, again, for reducing technical noise. And you can also include cell-specific barcodes that allow you then to multiplex, for example, thousands of cells in one sequencing run. Now, an often one is basically how many cell sequencing reads one actually needs to, for example, really truly identify a cell, or characterize molecularly a cell, and there's some consensus in the field, at least, that one million needs might be sufficient. In fact, there's even reports that 50,000 are already sufficient, at least if you just want to identify the cell type. If you really totally want to molecularly characterize the cell, probably one million is a more conservative estimate. OK. So single cell RNA seed versus population RNA seed. So the global workflow, as I already stated, is actually similar. Both experimentally and analytically. But there's really fundamental differences between the single cell and population RNA seed experiments that we need to consider in the analysis. OK. First of all, here, population RNA seed. You typically deal with less than 10 samples, and you also include replicates. And the starting material is typically millions of molecules which are stable across a sample in time. And again, what we already discussed is that, of course, these are average measurements. Most genes also will be measured because you're actually contemplating an entire system, and not one specific cell. And we assume that most genes, the expression of these genes, is unchanged across conditions. Right? These are a couple of assumptions that actually go into most of the analytical tools that you would use for population RNA seed. Now, the story is very different when you deal with single cells. Here, you go to from 100 to nowadays almost 14,000 cells that you analyze. So each cell actually contributes to transcriptome. OK. Each cell, in a way, is unique because it will actually present kind of different transcriptome, but whether this also is a different cell type is another question. The starting material is, of course, very limited. So you will typically deal with less than 1 million molecules. And it tends to be highly variable across cells in time. There's also very highly variable measurements, such as physical stress. For example, certain cells experience stress when you fact sort them into a plate. The amount, of course, is also different. And again, all of these things can, for example, be controlled for using spiking controls and unique molecular identifiers. And then, very importantly, you're typically dealing with sparse data, because many genes will be non-observed, either because they're not expressed, or simply because given the limited amount of starting material, you will simply miss them when you amplify your material. They're there, but after processing all your data, they basically were lost in the experimental or computational and virtual. So that means that actually for the statisticians among you, you're dealing with a lot of zeros. And that presents, actually, a lot of problems. OK. So this is going to be the red line through this talk. So we're going to go through the computational workflow that involves single cell RNA sequencing. And we're going to start with the first layer of quality control and filtering. This is relatively straightforward. If you, for example, use a microfluidic device, such as the fluidine, what you need to do is maybe make yourself lessons, ideally, so that you can really detect it and make sure that there's indeed one cell there. You may have heard of the fluidine problems that actually they thought many of the cells were indeed unique. But then when they looked a bit closer, it turns out basically that cells were stacked. And so if you look from above, you would not see two cells. OK. So again, this actually just reveals a couple of biases. And of course, if you have 2001, this is a very different picture. OK. If you, for example, sort into plates, again, make sure you do a couple of run experiments so that you can actually look at your plate and see that most of the cells have one cell. If you, for example, see a lot of wells with no cells, that means that probably there's one well out there that has a lot of cells. So again, not good. So you need to control for that. Assuming that everything there goes well, then you can move through sequencing. Quality control, I'm not going to dwell on this because this is typically the same that you would do with both RNA-seq, you know, the kind of popular tools such as FastQC and so forth that deal with low sequencing beds, phase cycles, GC biases, low complexity, and so forth. So I'm not going to cover this too much. So then, of course, we want to go to gene expression estimates, right? So we want to derive account matrix, which is basically gene by cell matrix. And so here, of course, you will have maybe thousands of cells. So this will be a very large matrix. And so how can you really get good count data? So what you of course need to do is align your reads to the genome, to the transcriptome. So you have reads and tag counts, OK? And here is also then where I will introduce unique molecular identifiers, because of course, you can only map those reads after you basically reverse transcribe your material and you are sequencing CDNA. So what are those unique molecular identifiers if you haven't heard of them? What you can do is actually simply, in the OligoDT, for example, that aligns to the polyethyl of your mRNA, you can include this kind of unique barcode, so to speak, UMI, right? That only incorporated in each CDNA that is actually generated from each mRNA. So each mRNA, in a way, will have a unique UMI, right? So that means that basically, if you then look at your reads, and so, for example, as illustrated here, you have a lot of reads that represent the same transcript. But what you see then is that you actually count four UMI's. That means that there's an amplification bias there, right? That means that you actually amplify more than there is in reality, and so in reality, there are actually only four transcripts that are represented by those UMI's, OK? The nice thing about those UMI's is that at the end, therefore, you get absolute count data. You know for sure that there are actually four mRNAs there, basically, of this particular species, OK? So this is the same here. So here you have 11 reads, but this is actually only corresponding to three, because there's actually three UMI's that are actually represented there. So that means every time you see a UMI, multiple times, it means there's an amplification bias, OK? Good. One little detail, though, is that UMI's also can have sequencing errors, OK? So most tools nowadays actually incorporate and allow for one mismatch, and this is important, and this is also how your primer should be designed, OK? And so now we have this molecular count matrix, which you can actually base on UMI. So UMI's are great. What is the problem? The problem is that actually some of the most routinely used the standard methods for single cell analysis, the most accepted, for example, such as SMART-C, which does an entire transcript analysis in single cells, does not allow you to use UMI's, OK? Because they actually sequence both from the five prime end to the three prime end, right? So some of the most routinely used methods, unfortunately, do not incorporate UMI's, something to think about when you design your experiments. We would fail you, for example, only sequencing the three prime end because that allows you to incorporate UMI's. But it also means, of course, that any kind of splicing information you will miss because you were only sequenced from the three prime end, OK? So going to quality control and filtering, the second step, OK? And here the goal is what is really critical is that we go on the hunt for low-quality cells that may skew downstream results, OK? So we really want to eliminate any kind of cell that looks suspicious. And this is a very difficult exercise because sometimes certain cell types look suspicious by nature, and we might actually eliminate them inadvertently, OK? So what can we use for that? We can use the number of fraction of map reads. We can use the number of expressed genes per cell. So if that, of course, is low, then you might think that the cell actually is not of very high quality. You can also wonder where we do reads map, certain genomic features. For example, if many reads map to microcombal genes, then something may have happened to the cell. For example, the cytoplasm may have leaked out. You may have lost the mRNA from the cytoplasm, but since the mitochondria still incorporate the mRNA, the only thing that you would actually see is actually RNA from the mitochondria. So if you see a lot of mitochondrial map reads, there's probably a quality issue. Similarly, if you actually spike in a known amount of mRNA and you see that most reads map to your spike ins, well, then you know again that probably your endogenous coming from your cell were either absent or very low quality. And so again, if there's a disproportional amount of reads mapping to your spike ins, you want to actually probably eliminate that cell. You can also use the expression level of housekeeping genes, which is also a way or which is commonly used for normalizing. Again, if the expression levels of many housekeeping genes are suspiciously low, you probably want to eliminate this cell. Okay, and so actually the Titan group actually developed this kind of vector machine base approach to identify low-quality cells. So it's kind of an all-in-one approach and it automatically checks for all features and flags that basically allow you to remove low-quality cells, okay? And this has been done by training it on a lot of single cell RNA seed data from multiple studies, okay? So here is actually just a couple of depictions of what a low-quality cell may look like. For example, here you have a fraction of a line of reads per cell in relation to the density. Of course, here basically this is suspiciously low. So again, there's a low mapping number per fraction which can indicate low-quality cells. And here you also have the number of mapped reads per cell versus the unmapped reads. And so for those where you have much more unmapped reads, again, those cells probably look very suspicious and you might consider actually eliminating them from your analysis, okay? So at the end basically a good quality measure also is how many genes actually do you see expressed? And this was somewhat counterintuitive to me in the beginning and it was actually nice to see that. In fact, if you now take, if you have a decent number of cells at least, if you take all the genes that you detected across all cells, it turns out basically that at an individual level, single cells you have about four to 6,000 genes if the assay is done properly. But when you actually then take into account all the genes that you detected in those single cells, you typically find more genes expressed than if you would have taken all the cells and do a bulk analysis, okay? So that again shows you the power of single cell analysis in a way that you detect transcript that in a bulk analysis are completely missed, okay? So a good quality measure is exactly that. What you wanna see at the end of the day indeed that if you were to or have the capacity to compare it to a bulk RNA-seq sample is that the single cell reconstructed population should in a way be in terms of complexity be equal or even better than your bulk RNA-seq sample, okay? Also an important measure, okay? And then we go to this very intriguing point of single cell analysis, which is basically that there might be systematic differences in the number of RNA molecules per cell. This is really common. And why might this be? Well, cells might not be equal in size and there's of course a scaling factor here, right? Smaller cells might simply have less RNA, right? And therefore less CDNA might be more difficult therefore to actually amplify that CDNA, okay? So this is illustrated here. Here you have a small embryonic stem cell and here you have a fibroblast, okay? A mouse embryonic fibroblast. And this cell actually tends to be larger and therefore you basically find on average much more mRNA molecules per cell than your smaller embryonic stem cell, okay? The problem though is of course here I nicely showed you the depiction of the two cells so you will figure out one is smaller than the other so I understand this data. The problem is that many of the cell types we're dealing with we simply don't know what they are and so we don't know a priority whether they're smaller or larger or we don't know anything because we just took actually sometimes 10,000 cells from the brain and we don't know actually how they look like, okay? So the real goal here is actually trying to distinguish technical artifacts which could happen. Maybe this cell type is equally large but maybe more difficult to lies. You would have less CDNA and you would kind of assume that the cell is smaller even though actually in reality they're equal in size, okay? So the goal really is to find out whether this is technical or experimental and the amount of RNA of course in a cell is fundamentally linked to the cell type but at the same time to the technical ability to obtain measurements. And so this kind of fact therefore can be both purely technical or biological and so this has enormous downstream implications during analysis for example for normalization, clustering and differential gene expression. And again we would argue that spikens actually can aid in resolving this issue. So let's go to spikens, okay? So artificial RNA spikens can be added at a normal concentration and volume and so we know then for these spikens what the number of molecules are per cell, right? And so here we make a big assumption namely that the endogenous transcript will behave similarly to the spikens. Now there's already been reports that those spikens do not behave like the endogenous transcripts unfortunately and that basically the use of spikens can introduce yet another technical bias. But for good measures what we assume here is that they do actually resemble basically the endogenous transcript relatively well. And so now you have the number of molecules for each endogenous transcript and there's the total number of molecules per cell can be estimated which for mammalian cells we ranges between 100,000 to 1 million molecules per cell. And so here what we pop basically is the mean estimated expression versus the spike in concentration, right? We know this information. And so what you can do here then is to actually give a mean expression level of X a transcript will therefore have a concentration I which can be represented by a gamma distribution as is actually can be really seen on this graph, right? To model this kind of distribution of molecular concentration. So the log concentration given the law for example fragments per kilobase per million, okay? So using your spikens you can therefore estimate the number of molecules per cell, okay? And so now putting back together you have on the one hand you may have on the other basically spikens. So with the UMI's you can actually estimate the number of transcribed molecules that is independent of amplification bias. And then using the difference between the UMI molecular count and the initial spike in molecule number you can also allow the estimation of how much sample material you lost after or during the reverse transcription, okay? So now we go to probably the most important part which is then eventually normalization and noise removal. And so again you will see that the use of spikens can be useful here. So as you know expression typically needs to be normalized to make samples genes comparable. This is true for both RNA-seq of course as well, okay? So why there's between sample difference of course starting materials can be different. There's differences at each experimental step, you know, several people might be involved or one does it a bit different than the other. There's of course cell states, maybe one cell is actually dying in apoptosis or the other one is actually going into G2M phase, right? There's cell cycle differences, hand engagements as already mentioned. Now of course the sequencing depth can also be different. You know if you for example sequence samples in two different Illumina lanes again it might introduce a bias and then you also have within sample differences such as gene length and gene GC content, okay? And so of course here we argue and again this is true for both RNA-seq as well is that the experimental design is of course critical for efficient normalization. And I really be happy here to refer to this excellent paper by Higgs et al which was published actually last year which really kind of give you a good overview of how you should actually design your study in order to avoid as much as possible any type of batch effect of technical noise. And so just to make sure that we understand so basically many people sometimes when I hear them discuss they actually consider single cells each of them a replicate. So if you have thousands of cells one assumes you shouldn't actually design your experiment in terms of batch effects and so forth. This is of course false, okay? So it's actually really good to for example divide yourselves actually over different runs and try to minimize basically as much as possible any type of technical effect. So by incorporating biological replicates in the experimental design and processing the replicates across multiple batches the observed variation can really be attributed to biology or batch effect, okay? So what about normalization then? While here you will see a couple of very favorites or popular tools actually that are used to actually normalize population RNA-seq data. For example, you can use counts for total library size or the TPM transcript per million map reads. You can do counts per gene length per total library size, okay? Or fragments or reads per kilobase or transcript per million map reads, right? Which is basically the one is relative proportion comparable across samples and the other is relative proportions that are not comparable across samples. Anyway, so these are very familiar tools probably to some of you who are involved in both RNA-seq. What is the issue now when you're dealing with single cell RNA-seq? Well, first of all, we redefine in a way what we mean with total library size. This is assuming that you actually use UMEIs because here you can actually, again, get absolute counts and so you basically can add all the counts, the right from the UMEI to actually get to your library size. So this is quite different. And so again, if you use UMEIs, we're also assuming that basically you're only sequencing the tree primement of many genes and so therefore basically dividing by gene length does not make a lot of sense, okay? So typically those kind of methods then you would actually not use. So what actually is then, now they use and again, there's a lot of problems with those but it's the best that we have at the moment is based on total counts. So library size, you know, UMEI count if available, upper-partile median. So this is then BC twin-medium values, for example, represented by HR. And so there was an interesting study here in briefings and bioinformatics, actually that looked at different normalization approaches and again, some of the more popular tools therefore at this moment actually for single cell RNA-seq are actually BC and HR, even though the entire field acknowledges that they're still basically not the best to deal with actually single cell RNA-seq data. And the assumption for all these tools is that the variation is technical and that the total amount of RNA processed in each sample is approximately the same. And as I already mentioned, of course, this is not the case for a single cell RNA-seq data and that's why basically many of these tools are still not very appropriate actually to deal with or to use for normalization issues. So what is a common practice therefore, if you have spike-ins available, you can derive technical size factors, right? For example, so C could deal with sequencing depth differences and you can also derive biological size factors for example, looking at or normalizing by average gene expression, which then accounts for differences in RNA content, which are directly related to cell size differences. Okay, over to noise. What about cell-to-cell variability rather than sample, both sample to both sample availability? Let us have a look here at the right graph. This is basically a metaplot of single cells and a metaplot of other single cells basically kind of combined together. And what you can see is that this is then reflecting in a way a bulk RNA-seq sample correlation and this is very good. So you can see that the correlation is 0.92. However, if you take out now the meta out of this phrase and you look at actually the correlation between two cells, what you see is a totally different picture, right? So the correlation is drastically lower, right? And this is actually some of our own data here in Mao Zedipo's stem cells. So what are we actually looking at here, right? Here we have for example, high-magnitude outliers, right? In one cell basically they're very high and others they're actually lower. We actually have a lot of dropout events where you basically have genes that are simply not present whatsoever in one cell versus the other and there's also quite a bit of over dispersion, right? So there's a lot more variation basically between individual cells and there ever will be between bulk RNA-seq data. So over dispersion, as I just mentioned, presence of much greater variability in a dataset then you would expect basically based on a given statistical model. So the observed variance is higher than the variance of a theoretical model. And this leads me back to what I mentioned in the beginning of my talk, which is there's a lot of dropout events meaning you have a lot of zeros. So there's a zero inflation, which again has a large impact on many of the tools that are typically used for bulk RNA-seq analysis, okay? So how can we reduce this noise? Can we tell if the noise is biology or simply technology? Is it actually interesting for a question or is it simply unrelated? And so here is actually kind of an overview of trying to deal with this kind of variation. Here is the total variation observed. So we wanna find technical noise which can be estimated based on spikens or a set of non-variable genes, housekeeping genes. Again, this has huge assumptions, but it's the best we can do. And then here you have biological noise which could be interesting. For example, arising from subpopulations which we can detect by clustering. There might be difference in transcriptional kinetics and there might also be differences because of differences in biological processes such as cell cycle. And it's there for the task of the analysis here to actually find out which kind of noise he or she is dealing with, okay? So technical noise, there's been actually ways to deal with that. Again, assuming you have technical spikens, right? This is a bit of a busy graph and let me guide you through it. So here what we plot is basically average normalized read counts and basically here is then the square of the coefficient of variation, right? And the higher something of course is expressed the less you expect this to be highly variable, okay? And so what we can do now is basically also base your overall distribution on the spikens, right? So here you have actually spikens. You know basically how much you spiked in so you can plot this very nicely, right? You can also see that some spikens are actually completely useless, right? Even in the spikens basically there's still quite a bit of noise, okay? But at least you can now do actually a technical noise fit, right? And so then you can actually derive a statistical model to select only genes with high biological variation. And this is actually plotted here. So here basically you can say and build your own threshold of what you then consider is truly biological versus technical variation. So here in purple basically these genes in a way go above this threshold of variation and then therefore you assume of technical variation and therefore you assume that this is truly biological variation, okay? So the principle again here is that you estimate technical variation from spikens and for each mean then you expect a certain variation, right? Based on these spikens. And if that is then above the variation is above this expectation such as this purple genes here then this is considered truly biological variation and these are the genes that you wanna look at for further downstream analysis. And again there's some interesting for example here some single cell latent variable models that actually deal with this kind of noise elimination, okay? So what about biological factors? And again here there's some interesting studies that have already been done but I think we're still only spatching the surface. So for example here this is a study basically that where one looked at T cell differentiation. And we knew of course because of very similar studies beforehand that the cells in a principle should cluster in naive versus TH2, TH2 committed cells as marked by data tree expression. But the interesting thing is when one looked at this data this actually clustering was not observed, okay? So it was really concluded that basically there was a lot of heterogeneity that seems to mask the interesting biology. And so the variant was attributed to different sources some of it biological but some of it technical but then interestingly one assumed that a lot of variation was also due to differences in cell cycle, okay? And so here this is illustrated here. So basically here is the observed expression for a profile and where you should normally see a nice gradient in color as actually shown in from the naive to the TH2 in the differentiation plot here. These colors you can see they're clearly intermixed, okay? So what was done then basically is to create again this kind of latent variable model to try to actually remove any kind of confounding factor out of the analysis. And in this case this was cell cycle, okay? So when we remove the effect of cell cycle all of a sudden then the cells actually group nicely according to what one would expect, right? So the solution here clearly was the use of these latent variable models to reconstruct hidden factors from the observed data. And this is for example based on expression of known cell cycle genes. But that sounds already in prize against something. It's known cell cycle genes, right? So here we assume immediately that these are the critical cell cycle genes. We look at how they vary actually across the different cells, we take out that variation out of the equation and then we actually found the biological significance in our data. But of course this is again an assumption. Do we actually really know which genes are truly implicated in the cell cycle and are we maybe trolling out the baby with the bathwater, right? Because taking out variation out of your data set is always very dangerous, right? So again this is by no means a solution. Of course one can think of different hidden factors such as stress and apoptosis that maybe one also should take into account, okay? So somewhere here normalization and noise. So single cell gene expression is noisy. Maybe that's the most important message that I can give you today. Really be aware and so several approaches have been developed to address diverse aspects of single cell RNA-seq specific requirements in terms of data normalization. But due to the novelty and the diversity of data methods as I already mentioned, there's currently still no systematic comparison of their performance in particular in terms of their ability to improve to provide improvements towards biological discoveries. So I just listed for you who are, for those who are interested some possible approaches but as I already said, I think these will evolve rapidly and I'm really looking forward to see how the field will deal with this kind of data in the next coming years. Okay. Good, well assuming you did a great job basically in trying to remove all the technical variation and the noise and so forth, what you really now want to do is to look at is there actually some biology in my data, right? So here what you want to do is to look at cell types, identify them, characterize them, do gene network analysis and so forth, okay? So here you have, you actually worked out matrix. These are the genes you want to look at. These are normalized recounts. You did everything properly. So what can you do, okay? So we can categorize this in three types of analysis. The first is cell grouping. The second is finding markers or cell types and the third is functional enrichment analysis, okay? So, let me just focus then first on cell grouping, okay? So there's basically three different labels here again. The three different vignettes. So here, the first basically what you want to do is dimensionality reduction. Again, this should be pretty obvious to you. You're building with a huge matrix of actually thousands of cells with thousands of genes. This actually presents a highly dimensional data set and what you want to do is try to make sense out of it, trying to visualize in a way that basically is intuitive to you. So you want to reduce basically the dimensionality, okay? Then you want to do clustering and then if you're interested or the data lends itself to it, you want to do cell path and hierarchy discovery, okay? What about dimensionality reduction? So here basically we have two possibilities. Either you actually use linear approaches, right? So this is a linear combination of the variables. So basically you assume here the data that's close to lower dimensional in your subspace and there's a couple of popular tools that one can use. Probably the principle component analysis is the most intuitive or the most recognized. It turns out that nonlinear approaches are actually doing very well. So they're getting increasingly popular and so I will come back to that in a moment. In fact, some of the most used methods actually is this redistributed stochastic neighboring embedding that I'm gonna come back to in a moment. And so with stars basically I just indicated here which ones are already employed in single cell or I may seek analysis, okay? So principle component analysis here, you want to of course find the direction of maximum variance in your high dimensional data and so that allows you then to project the data into smaller dimensional subspace while keeping the most information, okay? So the principle components are simply linear combinations of initial variables, okay? If you don't really know any more how PCA works, there's a couple of very cool tutorials here. For example, doing a PCA in a teapot is particularly insightful to understand how it works. So this is for example a data set from Barbara Troitlein and Steve Gregg's group that was published in Nature in 2014 where PCA actually was applied in order to basically visualize in a way the data sets that they were actually dealing with. This was for actually mouse long development. I'm gonna come back to this data set in a moment. Now it turns out that basically using non-parametric tools where you actually don't really put a lot of significance to the value necessarily, you're just trying to actually find ranks is also very useful in actually trying to visualize the data and so to project it into two or three dimensions while preserving the high-dimensional relationships. The advantage here is that basically it works well with large amounts of heterogeneous data. The negative for the disadvantages that this is an optimization based method so there's no unique solution. And so the T-SNE or this method that I already alluded to this nonlinear methods was actually really developed not at all for single seller or NAC analysis. It was actually developed to simply solve this particular question, which is can we actually group all handwritten digits of the same type in an intuitive fashion, okay? And again, here there's no systematic value you can assign to these digits. You have to somehow sort through all the sixes which kind of loop somewhat differently to actually make sense of them and say, okay, all of these are indeed sixes. And so this was actually developed in 2008. This method would actually did a good job in grouping the different types of digits that one was dealing with. You can see this with the color. So each color basically represents a different digit, okay? And so not going into too much detail we can now see already several seminal papers here in the field that actually indeed use this redistributed stochastic neighboring embedding this nonlinear way of visualizing the data basically to make sense out of it. For example, cell types in the mouse brain, cell types in the mouse guts and cell types in the mouse retina, okay? So this basically then deals with dimensionality production but of course, this is just to visualize your data, right? What you don't know yet is actually to which cluster basically does each of the cell belong. All it does basically is to visualize different types of clusters but where does each cell actually belong to? Just to kind of point this out here, this is of course highly visible because it's actually the user of the analysis who actually has colored this, right? In reality, you don't actually have colors from your dimensionality reduction analysis, right? This actually is shown here. Here you basically have the raw data or the raw clustering and so is this cell now actually belonging to this cluster or is this cell actually belonging to this cluster? You don't really know, right? And that's what you exactly want to find out using the clustering methods, right? So how to set boundaries between individual cells can we access systematically grouped cells into types which then are represented by different clusters, okay? Yes, you can do this. This is again an intuitive for the both RNA-seq analysis. Here you can, for example, use hierarchical clustering based on similarities of the data points and this is again this Troika and Craig data that I just showed for cell types in the mouse learn. So using hierarchical clustering this finally then push basically cells in different clusters which actually had to do with the different precursors that actually lead you to a fully functional lung cell. So the typical cell gene matrix here is therefore displayed as a hierarchically clustered heat map. So this works really well. So here you basically have the genes on the horizontal axis and the vertical axis. This is where you put the different cells, okay? You can also use K-means. I mean, again, there's actually different types of approaches to cluster your data, to group these cells into discrete neighborhoods. And this, of course, is based on trying to minimize basically into a cluster distances. Here, of course, in the case of cells and to maximize basically inter cluster distances, right? So K-means is one of the very many partition clustering methods that one can use, which aims to directly decompose the data in a set of this joint clusters. And again, there's a lot of different approaches one can use and I'm not gonna be the one who actually tells you to use one over the other. You just have to play around with the data to actually see what works best, okay? So finally, basically cell types and hierarchies. What is actually being done here? So what is the cool thing about single cell data is that you basically can imply some pseudo-time, right? If you take an entire organ, which is, for example, under development, it's one snapshot of time in principle. But what you get is actually different cells in terms of their development, okay? So what you can try to do is to rank those cells basically according to pseudo-time, as if basically you did an analysis over an extended amount of time. Here, you just do one analysis of all the cells and then you try to make sense out of the data in a sense that you say, okay, the most likely sense are put first and thereafter actually I try to apply this sort of pseudo-time to understand the trajectory of that cell from a basically undifferentiated cell into a differentiated cell, okay? And so there's again different methods to do so. I'm just gonna briefly mention monocle, but there's also wonderful, wonderful, wonderful scuba and so forth that have been developed exactly to map cell types and hierarchies, okay? So again, this brings the point home. So can we actually automatize the building of these different differentiation trees? So here, you need to provide bioinformatics or the construction of the transcriptomic shift that you actually union to try to make sense in terms of the fundamental timing or differentiation timing. Of course here, the bias is that the user should define what is the first cell, the early cell or the origin of the trajectory? And so again, here is for example, one example where the monocle framework here was used to actually go from, for example here in this case, a proliferating cell all the way to for example, a differentiated myoblast, okay? So what is this using? This is an independent component analysis, right? You see the two different components and then a minimum spanning tree is constructed. So what you have here is a graph of N nodes, right? And what you then try to build is a tree with N plus one edges that actually have the lowest weight. So we try to find the fastest path, so to speak, through your data in order to actually find the trajectory of how, in principle, the cell basically beginning here will move all the way through the different transcriptional states to its N states. And that's what you're actually trying to find. And this is, I think, really cool in something that really single cell analysis allows you to do. So I really am going actually to the single cell genomics conference last week. There's a lot of new tools that actually are coming online which seem very powerful and I think it's gonna be exciting to explore those and apply them on the data. Okay, what about finding markers of cell type? Of course, now that you actually have clusters, what you wanna know is what genes basically distinguish these particular clusters in terms of their expression, right? And that tells you then about markers. Are there actually markers that really allow you to, for example, identify this cluster over other markers that basically are characteristic for the other cluster, right? So the question here is, I have three groups of cells and I need to find more of the characteristic genes for group B. That is, they have a specific expression in group B in contrast to A and C. So what method should I use, okay? Again, here, it's still very rudimentary. Many of us actually resort to DEC and HR if you're doing both on ACQ. We'll recognize them. There's actually newer tools that are coming online here. I'm mentioning one from Karchenko et al. This is single cell differential expression analysis. What is this? This is actually a two component mixture analysis where you basically use a Poisson distribution to model the many zeros in your data, which is unique to single cells. And then you actually have a negative binomial to actually model the non-zero values in your data to actually better basically model the overall distribution of gene expression values. So again, I expect that several new tools actually will come online to better deal basically with single cell gene expression differences between the different clusters. I'm not gonna go into gene oncology. I think most of you probably will be familiar. So once you have actually a set of genes that are characteristic for your cluster, then you can of course do gene oncology in Richmond and find out what these genes actually represent. What kind of biological process? What kind of pathway? And this is actually what this tool, Pagoda, is actually really good at. So what Pagoda does, and it actually says in its acronym in a way, it's pathway and gene set over dispersion analysis. So what it's looking for is actually for genes that seem to be co-varying, right? And then it relates that co-varyation basically to specific pathways and then it can actually really find out which pathways seem more enriched in one cluster versus the other. And so this was a recently published paper actually in Nature Methods, okay? So I'm coming to the end of my talk here. So basically a very interesting and intuitive review also to give you a general overview of all single cell RNA-C aspects is listed here. A genome biology paper that was recently published. Okay, just one little pointer. I think what I try to convince you of is that single cell analyses are highly complex. I think there's still gonna be a lot of computational developments required to properly analyze the data. The only thing we argue is that there's probably already some degree of standardization possible that perhaps allows non-insured users to actually get their hands wet and start doing single cell analyses because we've heard of many labs that are actually in principle are interested in single cell analyses but simply feel that they're lacking the bioinformatics skills in order to dive into the data and so they're actually simply not doing them. And so realizing that, we've actually started to develop a new pipeline which is basically a SAP in the lab. It's an automated single cell analysis pipeline. Again, it has, of course, its caveats, right? It's nowhere near gonna give you the complexity that many of the recently developed tools are providing to you, but we argue that at least some standardization can be done and you can actually already have some full classes in your data. That could already be very interesting, right? So there is existing software out there but the disadvantage is that this is restricted set of algorithms and methods. There's a lack of interactivity and visualization. This is very important. It's very difficult to actually play around with your data to visualize it, to show it to your experimental colleague. The required knowledge of programming and our statistics is still very high and there's the non-standardized inputs and so forth and so forth, right? And so here again, I have to really thank Petra and Vesson who came up with the idea actually to establish this kind of pipeline and we had a very talented master student actually who really got his hands wet on developing this pipeline, Edwin Shakovchi and a master student at the EPFL. So what is this pipeline there for? It's a SAP, so it combines state-of-the-art single cell specific algorithms which are either written in R, Python and Java. It's meant to be very interactive and user-friendly with the web interface with 2D and 3D visualization. It has a robust processor for a wide range of input data so it's really not restricted in that end. There's nothing to install for the end user. We're gonna make it possible to install it on your own computer where in principle you can use a web interface, okay? It has centralized computational resources and it's meant to be a multi-user platform, right? So what took considerable time for a skilled bioinformatician at this point and again, most bioinformaticians will actually dive into the data and provide you with complexity analysis that goes way beyond this platform but at least some of the initial steps in your analysis can now be done in a few minutes without any prior bioinformatics knowledge. So that's the pipeline. So here you can of course do the input file. This is the read-count data which is already a normalized matrix but you can also do some normalization within the platform. Then you can actually do filtering algorithms such as expression-based coefficient of aviation. As I said there is normalization based on simple scaling. For example, if you calculate the average mean across your data from one cell to the next you can try to scale them accordingly. You can do 2D and 3D interactive visualization, dimensional reduction, you can use clustering and you can also do some initial differential gene expression. And finally also, as I already mentioned in the overall pipeline gene ontology. So for any user basically now looking online, you're not gonna see the video but I was promised that it will be added at the end of the YouTube video which will come out. So I wanna thank any online user who is watching right now because it's coming in here but for those who actually were brave enough to come to the hall, we're gonna show you basically how this works in real time by just showing a little video of the entire pipeline. So this is better than doing this, right? So here basically she has already her matrix, right? In our case we're interested in adipose stem cells. So this is a text file, we can now upload it, right? We indicate that this is mouse and now we basically just process it. Here you can do some initial filtering, we actually decided not to do any filtering in this case, right? And so the goal is now that we do generate a working matrix by for example doing a scaling which tries to actually normalize the data across all the different cells. And there's also a little bit of a legend for each. So now this already allows you to look at your data in two-dimensional, even three-dimensional space, okay? And once you've done that, basically then you can indicate whether you actually wanna apply on this data some kind of clustering, right? So here we used PCA, we don't know actually how many clusters we wanna use, we use K-means and we say, okay, maybe between three to eight clusters, let's have a look what the tool finds, okay? So now it's actually processing it. And so what we found basically are three different clusters, right, so they're actually colored currently. You can zoom in, you can now really play around with the data and already kind of tell you which cells basically, this of course will more be more intuitive because I'm assuming you will know which cell is in which plate and in which well, at least, right? And then you can actually do gene expression. You can actually visualize certain genes, for example, here you visualize a specific gene in your data if you already have some hunch as to what you might expect to see, right? And here then we perform a clustering on the data, we did hierarchical clustering which is perhaps the most intuitive, okay? In order to kind of see if we really can now put cells in specific clusters. Here now we see the different clusters with each of the cluster containing a couple of cells. There's actually a third cluster in this case, a third cluster but it only contains two cells. So we kind of eliminate it or you can assume that this is not really important. And then you can actually do differential gene expression basically on your data to, for example, if you say, okay, cluster one and cluster two, you want to compare them and immediately basically gives you a list of genes that are clearly differentially expressed. And then what you can do is the genes that you find are differentially expressed. You can look, for example, a genontology biological process and it actually will tell you which process is the most enriched. Importantly, you can actually export all the data and you can again go back and visualize it. We can also save the plots in all kinds of formats. So this basically concludes this talk. This platform will hopefully soon be online. We're just making it robust enough so that we basically, many users can use it at the same time. And so we're hoping to actually do it somewhere in October of this year. So we will keep you posted on that one. But if you would have any feedback on that platform already there's already some toy data that you could play around with. So you can already get your hands wet but wait a little bit in order to actually upload your own data. Okay? So with that, thanks for your attention and I'd be happy to answer some questions.