 much for the introduction Kevin and thank you for inviting me here to give to present my work at BIDS. It's been a pleasure to meet you know the fellows and member of BIDS and I look forward to more interaction with the group. So today I will talk about a project that's been funded by the Obama Brain Initiative and it concerns the classification of cortical neurons using single cell transcriptomics. So I've had a great time working on this project with my postdoc David Eriso and my PhD student Kelly Street and I especially like to thank the principal investigator for the project John Nye here in molecular and sign biology. He's motivated all of our work in this area and he's also he has also assembled a great team of colleagues to work on this project. So Elizabeth Pernum in statistics near Yosef in computer sciences and on the biology side hello Alistair, Helen Beta, Jennifer Downe, Daniel Feldman, Dirk Hockemeier and Russell Venn. So here's an outline of my presentation. I'll begin with a few words on the Brain Initiative. Then I'll discuss some experimental design conditions and the main steps in the analysis of these single cell data. I'll present one of the main data sets that we've been working with so far. That's cortical neurons from layer four and five and the rest of the talk will be devoted to a step-by-step analysis of these data starting with pre-processing all the way to cluster analysis. So I should say that this is very much work in progress as we kind of try to make sense of these large data sets that raise a whole bunch of problems. So the brain in brain initiatives stands for brain research through advancing innovative neuro technologies and it's a new presidential focus aimed at revolutionizing our understanding of the human brain. So we're concerned with classifying cortical neurons using single cell RNA-seq or transcriptomics and our goal is to provide an approach that can be scaled up to obtain a complete sense of cell types in the brain. And we're focusing as a model system layer five cells from the mouse cortex. We're applying single cell transcriptome sequencing or RNA-seq in short to identify and characterize neuronal cell types and also obtain transcriptome signatures for these cell types. So we're using the Fluodime C1 and Helumina IC platform and once that step is done we'll be using genetic engineering to create transgenic reported lines of mice that express specific neuronal subtypes. So just as a recap for those of you that have very little biology so genetic information is expressed in two main stages following the central dogma so there's DNA that's transcribed into RNA and RNA is translated into proteins so protein is the product of the genetic information. So here this technology RNA-seq is a way to obtain for thousands of genes at a time or entire genome the expression levels in terms of RNA, RNA abundance. So you may have heard of microarrays so that was developed 15-20 years ago so that was a high throughput approach for obtaining transcription level for entire genomes and then single cell RNA-seq is another approach as well for achieving that same goal but at the resolution of single cells as opposed to having expression for mixes or pools of cells. So at the end of the day what we want to have after all the preprocessing step is for each gene and a bunch of different cells an expression level for RNA. So in any of these high throughput assays we have to face a variety of experimental design considerations. Proper experimental design is essential to make sure that we can actually answer the question that motivated the study and we provide an accurate answer to that question. Unfortunately experimental design is often neglected so that trivial easily available errors invalidate conclusions of many studies. So in the case here of the single cell RNA-seq assays you open the literature and then you find a whole bunch of studies where the biological effects of interest are completely confounded with Newton's technical effects. In particular you have biological effects with different cell types that are confounded with batch, C1 batch effects. That means that when you see a difference between two different cell types you cannot conclude that it's due to the biology it could be very well due to the fact that you have just two different batches. By the end of the day you can't answer the question that motivated the study. So among the design considerations that we're facing here the first one is of course which sort of cells do we want to sequence. So what kind of mouse genotype, treatment, time point, brain regions. The issue here is of biological replication that affects the generalizability of the results and also you know the power and there's confounding issues at that stage as well. Once we've selected the kind of cells we want to sequence then it's how do we allocate these cells to components of the sequencing process. So these experiments have a whole bunch of steps that are very important. So there's facts, fluorescence, activated cell sorting, the C1 on high-seq run. So how do we distribute the cell? So it's really experimental design. You can just go back to the original principles that were laid out by Fisher when he was doing agricultural experiments will the same principles apply here. There's a variety of sequencing protocols that are available. So for example what sequencing depth do we want to use? Do we use pyridine versus single line reeds? And then of course the inclusion of quality control elements. It's useful to have spike in control sequences to provide two positives and true negatives and obtain ROC curves and so on measures to validate our results. So here are the main steps in the analysis pipeline. When we do an RNA-seq experiment what we end up having is basically chopped up pieces of the transcriptome. Okay so it says if you have taken the RNA in a particular cell and just cut it up and then the first step is of course to find out where each of these little pieces originates. So you have the alignment steps. You take the short reeds that are a few hundred base pairs and you figure out where they originated in the transcriptome. So you allocate them to genes. Once you have that you compute recounts for each of the genes. Okay so that's the first set of alignment and estimation. Then you want to look at the data for quality control purposes. Normalize the data and then quantify the expression for different genes and different cells. Find which genes are differentially expressed between different cell types. Identify new cell types using clustering techniques. Okay so here's a little bit more details about these different analysis steps. So the raw data for our purpose are FASQ files that provide all the sequencing rates. We also have gene level annotation metadata by metadata I mean data on data. So you know the GC content of the gene, the length of the genes, what chromosome there are four functions on the genes. And we also have sample level annotation metadata. So which region of the brain these cells came from. Which batch the date of collections and so forth. And as output for the pipeline we want to have cluster labels that correspond to different cellular subtypes and confidence measures for the clustering. As well as lists of differentially expressed genes that will give us the transcriptome signatures for the novel subtypes. So the first step in the analysis is preprocessing. So we want to align the reads to the transcriptome and quantify the expression level of each of the genes in each of the different cells. And we also have quality assessment, quality control measures for these recounts. Explore Tory data analysis. It's essential to look at the data before we just throw them in a black box algorithm. By just looking at very basic numerical and graphical summaries of the data we can identify major problems with the data or just find the main features of the data. We'll see that it's important to do sample filtering based on quality control measures as well as gene filtering because there's a lot of zero counts with these data. Then comes normalization to adjust for unwanted technical effects like batch effects. Cluster analysis so that includes dimensionality reduction because we have data for thousands and thousands of genes for each of our cells. So that involves also the choice of a distance measure between cells, a clustering procedure, and then how to assess the validity of the cluster assignment. And finally differential expression analysis to find out which genes are differentially expressed between the different cell types. So those are the main steps that we'll be looking at. This is something that I care deeply about and I think I'm preaching to the choir here so I'll just go very quickly through this site. So reproducible research and here I'm just using the very narrow notion of computational reproducibility not biological reproducibility which is much more ambitious. So we're using a version control system for software development and data analysis we're using Git. We're storing the data in our packages that are centered object structures so in particular the S4 class expression set from Bioconnector. That keeps together the experimental gene level expression measures as well as the gene level and sample level annotation metadata. And we're also using systems like Sweave, NIDAR, RStudio to produce our reports and papers and that's a way of keeping together all the pieces of a publication that tags the code and the data. Okay so a little bit about the data sets that I'll be presenting today. So here we're focusing on a very specific region of the brain. We're looking at layer 4 and layer 5 cortical neurons. Our goals are as follows. As a proof of principle we want to see whether we can recover the known distinction between layer 4 and 5. Next we'll try to identify new subtypes within layer 4 and within layer 5 and identify differentially expressed genes between the different cell classes. So for this data set we have just a little over 700 cell samples, individual cells, so almost 300 of a layer 4 and a little over 400 of a layer 5. They come from 26 mice that constitute our biological replicates. There's one dissection, one fax neuron and one seron neuron per mouse. There's eight sequencing runs, so 96 cells per lane and we're using single ends reads of 15 base pheromones. So you can see right away that having said all these things about experimental design, there's experimental design issues with this data. We have confounding between biological and technical effects and also batch effects are nested within layer effects. And I'll show a picture that will make that a little bit clearer in a moment. Also we've compiled a list of marker genes for layer 4 and 5. We have about 80 genes that we know are either expressed in layer 4 or in layer 5 and we have a set of about 300 housekeeping genes that are expected to be constantly expressed across cells. So these would be our negative controls and these would be positive controls to validate our results. So this is a diagram of the cortical neuron layers. It's a slice that's taken from the center of the brain. There are six layers starting at the top with layer 1 and here layer 5 is the one that's highlighted in red. And using different types of mouse Cray driver lines we can distinguish between the different layers of the brain. So here we'll be looking at layer 4 and layer 5 using two different types of markers. Okay so now let's start looking at some data. So this is a bar plot of the number of cells per batch colored by layers. Okay so we have 20 something batches. The first batch here has a little over 70 cells. The first batch consists of entirely layer 4 cells. Here we have the fourth batch that has about 5 cells and they come for layer 5. Okay so that's what I mean. The batches are nested within layers. Yes, C1 runs, fax C1 runs and they're very large batch effects. So if we had had only one batch per layer we wouldn't be able to conclude that gene was differentially expressed between layer 4 and 5 because the difference could have been due to batch. Here fortunately for each cell type we have several batches. So we can get to the question of differential expression between layer 4 and 5. However the batches are nested within layers. A better design would have been a factorial design where each of your bars here would have had some red and blue. In this case it's also there's mouse and yes. So it's one extraction, one C1 run, one fax. A bunch of cells because we have 700 cells, a little over 700 cells from 26 mice. Okay so that's the data that we have. So batches and then layers. So looking a little bit deeper at the data. So on the left is a bar plot of the total number of reads per cell. Okay so we have about 700 cells and we've colored, so we've colored the cells by batch here. Okay and you can see here that the total number of reads that you get for a particular cell varies greatly between batches but also within a batch. So this red batch here, some cells have way more reads than others. There's a batch here, the green one that has fewer reads than other batches. This is before you do the mapping. We just have little pieces of a few hundred base pairs. We haven't mapped them to the chance of tone yet. The bar plot on the right represents the percentage of reads that were mapped for each of the cells. And again that varies between batch. This batch here has a pretty low percentage of map reads and it can vary also within batch. Okay the colors correspond to different batches. Okay now what we've done, we've taken all the genes in the mouse genome and we've counted how many reads mapped to them for each of our samples. Okay so you can think of having a data matrix that has as it's rows the genes and as its columns the cells. So tens of thousands of genes and then 700 cells or so and you count in each cell the number of reads that maps the particular gene. So here this bar plot on the left is for each cell the proportion of genes that have zero count. It's huge. Over 80 percent of the genes have no reads. This is just another way of looking at the same data. This is the number of cells in which a gene is detected and that's very low. So what do we do when 80 percent of the genes in a sample have zero count? It's not much we can do so we're going to do some filtering for zero count for zero inflation. So single cell RNA-seq have many more zeros than bulk RNA-seq. The zeros could occur for both biological reasons or technical reasons. So biological reason would be the gene is simply not expressed. Technical reason would be there's low capture efficiency so it's a problem with the technology. It's essential to filter out some of these zeros because any downstream method will have problems. So when we do normalization we typically scale the data so scaling a scaling factor computed based on a lot of zeros won't work. A widely used normalization method is the one that's implemented in this biconnector or package DC and this method discards any gene that has zeros in any sample. So if we applied this normalization method to our data we would base the normalization on only 38 out of almost 50,000 genes. The zeros are also problematic for more aggressive normalization methods like full-quantile normalization. Full-quantile means you just match all the quantiles between your samples. So all the zeros cause problems with, cause problems with type. So we just applied a pretty simple filter we decided to retain only the genes that have at least 20 reads and at least 20 samples. So we're left with just below 10,000 genes out of the 47,000 genes. And that's the same plus I showed before by now having filtered out the zeros. This is for each cell the proportion of genes that have zero counts. So it's a little bit more manageable but still pretty high right? 40% here some still pretty high over 80%. So it's a problem that we'll keep having over and over again so we'll need to address it later. Okay so as part of the mapping steps we also record some quality control measures for each of the cells and here's a look at them. So one of them is the total number of reads the other one is the percentage of map reads. So these are box spots of these quality control measures stratified by batch. So in this red box spot for example it would be the box spot of the total number of reads for that batch. So a lot of variations between batch. We look at the percentage of map reads this green batch has a really low percentage of map read compared to the rest. Other QC measures are the number of duplicate reads the reads that map that occurred twice Jared? Oh it's the number of points that go into it because the batches have different numbers of cells in it. That's a good question yeah. So there's batches that have very few cells in it like this one that's narrow and there's a thicker one here that has more cells in it. It's kind of going back to this plot here that was showing for each batch the number of cells in it. Okay here again this green batch is unusual compared to the other batches in the proportion of reads that contain the primer. It's also unusual in terms of the proportion of intergenic bases. Okay and we could keep going on and observe similar patterns with other QC measures. This plot also illustrates similar findings. We've done principle component analysis on the QC measures. So this is a plot of the second principle component versus the first colored by batch and this is a box spot of the loadings of the first principle component by batch. So substantial variation between batches. Now we're relating the sample QC measures to the count data. So we've done principle component on the count and principle component on the QC measures. This is a scatter plot of the QC measure first principle component versus the count first principle component. Okay the counts are the expression measures. So what this plot shows is that you have a very strong correlation between the quality control measures of the sample and the expression measures. Okay the expression measures are what you care about. You don't want them to be driven by quality. This is another way of looking at the same phenomenon these are bar plots of correlations of the count PCs and the QC measures for the first count PC the second count PC and so on. So just a recap of what we've seen here. So the distribution of the QC measures can vary substantially between batches. Some of the QC measures clearly point to low quality samples. So there's this batch DS 11 that has had some PCR amplification problems. There's a strong association between QC measures and read counts. So it seems reasonable then to filter samples based on the QC measures because normalization methods may not be able to adjust for these QC measures and some samples just have very low quality to begin with. So in addition to this zero count filtering we also use the following QC based filter. We filtered out the cells that have either less than 25,000 reads or less than a 50% alignment rate or less than 20% highly expressed genes. So this yielded 583 cells out of the 704 that we started with. Yes. So we define them we define here in the front of this is at least 20 we've done at least 20 samples 20% of the sample. So these are pretty ad hoc right why not say 25% that's fine as well right. So it's more we have to do some kind of make some kind of decision so we do some kind of sensitivity analysis to figure out how much these choices pardon me affect the results pardon me. And then we'll also consider a normalization method that directly incorporates the QC measures. Okay so after we've applied this QC based sample filtering this is the number of cells that we filter out by batch. Okay so in blue are the cells that we retain and in red are the cells that we filter out. So that varies substantially between batches. Again our weird batch, batch 11 here has a large proportion of cells that are filtered out. So these let me just jump to here. So now we've done both zero count gene filtering and sample filtering based on QC. So let's look at the gene level counts. So we have this matrix of genes by samples and we have counts. So these are box plots of the gene level counts on a log scale colored by batch and we'll want to do comparison of recounts between samples. And you can see there's substantial variation in the recount distribution between samples within the same batch and also between batch. A better way of looking at these data is to look at the RLE that's the log ratio of the recount to the median recount across samples. If the samples had similar distributions of recounts the RLEs would be centered at zero and the box plots would have approximately the same width and that's clearly not the case here. A lot of the distributions are not centered at zero and there's a lot of variation in scale and that reveals the need to normalize the data. So this is the same kind of display but looking at only the 332 housekeeping genes. Those are the genes that are expected to be equally expressed across the cells. So that's clearly not the case here. The RLEs are not centered at zero and the box plots have different widths. Now we're looking at the data for the layer 4 and 5 markers. This is the second principle component of the counts versus the first principle of the count for the markers. We would hope that the cells would separate according to layer. Layer 4 is in red here and layer 5 is in blue. That's not the case. There's a lot of mixing. So we're not able to recapitulate with the data here before normalization the distinction between layer 4 and 5. And also here the same plot as before. Scatter plot of the first principle component of the quality control measure versus the first principle component of the counts. There's still substantial association even after having done the QC-based filtering. And also now when we look at the count principle component 2 versus the count principle component 1, this is colored by batch, colored by type. There appears to be an outlying group of cells here. That group of cell here is a mix of layer 4 and layer 5. So we were a little bit puzzled by that group of cells, so we decided to take a closer look at them. We just applied a quick and dirty clustering method, PAM, partitioning our own medoid. This is a measure of the average silhouette width. This is a measure within to between cluster similarity. And it can be used to guide the choice for the number of clusters. So this measure tells us we should pick two clusters. And these are the two cluster one that has this very small and a more and much larger cluster. And when we look at the composition of these two clusters, the mix of layer 4 and 5, we ran this faster collaborator and he was able to conclude that these cells that were outliers compared to the other ones were glial cells that had somehow snuck into the process with the neurons. So we decided to filter those out before looking more deeply at the layer 4 and layer 5 neurons. They looked at the genes, they looked at heat map and the genes. And that's something that's happened also with other projects that were funded by the Brinensive. They somehow were not able to purify the neurons. There were a lot of glia in them. So yeah. So that's why all these different plots really, I think, show that you really have to look at the data before you throw everything down a clustering algorithm. So we could have just taken the matrix of genes by sample and then clustering. But then we would have had all these glia. What would we have done with the zeros? So it's kind of detective work that we do at this stage. Okay, so just a recap of what we've done so far. So after gene and sample filtering and before normalization, there are large differences in gene level count distributions within and between batches. There's a lot of zero inflation still. We're not able to get the layer 4 versus 5 class distinction, even looking only at the markers. There's still a strong association of the counts in the QC measures. So it's really essential to do normalization before any downstream analysis like clustering or differential expression analysis because we want to make sure that when you see a difference in expression measures for the same gene between different cells, that difference is really due to some biology of interest and not technical effects. Okay. So I'll present the general model that we've been exploring for both normalization and differential expression analysis. So we have to deal with two striking features of these count data. One of them is the zero inflation. That was obvious from the plots I showed. Even after zero filtering, we have cell samples that have over 60 percent of zeros. The other one, and I haven't showed a plot of that, is that the count data are over dispersed. So the default model, the Poisson model for count data doesn't apply here because your variance exceeds the mean substantially. So we'll be looking at a class of model that are called zero inflated negative binomial models. So they can handle zero inflation and over dispersion. And also, by using a suitable parameterization of the mean for the negative binomial component and the zero inflation probability, we'll be able to perform normalization and differential expression within the same framework. So here's the model. We let YHA denote the observed rate count for gene J in sample I. Z is an unobserved detection indicator. So it's going to be one when the gene is not detected and zero when it is detected. That's the form of the zero inflated negative binomial distribution. So the zero component here and the negative binomial component. Pi is the zero inflation probability. Mu here is the mean of the negative binomial component. So when pi is zero, you get the usual negative binomial model. When phi is zero, that's the dispersion parameter, you get the Poisson distribution. And here's how we parametrize the two key parameters of the model. So the mean of the negative binomial distribution, mu, pi, the zero inflation probability. So the log of the mean would be parametrized in terms of four components. So XB corresponds to the covariance of interest. So if we have two cell types, X would be like a design matrix that has first column of one for the intercept and then an indicator for the cell type, layer four or five. So beta would be the parameter of interest representing the effect of layer four versus five. And the other terms here, the blue terms and the red term deal with normalization. Those are all unwanted effects. So U here are known sample level unwanted factors. Gamma is the associated nuisance parameter. V are known gene level unwanted factors like GC content or length. Delta is the associated nuisance parameter. And W is a matrix of unknown factors of unwanted variation. And alpha, the associated nuisance parameter. So if there was no normalization needed at all, we could just go ahead and use the X beta component and use standard GLM techniques to fit the model and look at beta and find which genes were differentially expressed. If we believe that the only effects that needed to be normalized for were quality control measures, we would just look at the U term and forget the rest. Okay, but maybe there's other factors that we don't know about. So that's the W component. And we can estimate the W component using factor analysis on control gene. It's been observed that the zero inflation probability depends on gene level features like the proportion of GC bases or the length. So that's why we have this term with V. So the term of interest that we care about is X beta. All the other ones are nuisance parameters that we're going to try to adjust for. Okay, so that's the general model. Let's look at the results of the different normalization methods that we've applied to the data. So we've applied six different normalization methods. These are block slots of the RLEs for each of the cells colored by batch. The first normalization method that we've applied is the one implemented in the package DC. And it's a global scaling method. It means you take the expression level for a particular cell and you divide by the same number for each gene. Okay, so it's not very aggressive. It's just scaling by one number, for example. The second method here is some sort of regression on batch effects. It's implemented in the package combat. The third method here is full quantal normalization. So you match the quantals across samples. This is much more aggressive. And then the three normalization methods at the bottom are based on the model that I just showed you. Okay, so there's full quantal normalization regression on a non-wanted factor that was estimated by factor analysis on control genes, the housekeeping genes. This we also add regression on batch effect. And here we do regression on the first three principle components of the QC measures. So it's very different normalization methods. This is the least aggressive and then they become more aggressive as you move to the right end and the bottom. Okay, so you can see that in terms of the RLEs, just global scaling is not sufficient. It's a little bit of improvement when you do full quantal normalization and then these regression-based methods. This is looking at the RLEs, the relative log expression of the housekeeping genes. So again, global scaling is not sufficient, a little bit of improvement with the regression-based methods. This is looking at the correlation between the QC measures and the counts. So a lot of correlation still with global scaling, with batch normalization, full quantal. Here, of course, this correlation goes away when you explicitly normalize for QC. And this is looking at the principle component analysis of the counts, PC2 versus PC1, colored by the layer. So it's still a lot of mixing of layer 4 and 5 for the methods at the top. Better separation of layer 4 and 5 with the RUV-based method, the regression-based method on unwanted factors and the QC measures. So in the remaining analysis, the cluster analysis and the differential expression analysis, I'll use one of these methods that were done with RUV. Okay, so just a summary of normalizations. So different normalization methods can lead to very different distributions of the gene-level counts within and between batches. Not all methods are aggressive enough to adjust for the large differences in gene-level count distributions. Even after preliminary QC-based sample filtering, not all the normalization methods can adjust for the substantial correlation between counts and QC measures. So in the remaining analysis, I'll be looking at the data that were normalized by full quantal and then RUV normalization housekeeping gene. So factor analysis on the housekeeping gene. Okay, so still haven't done cluster analysis, and we still haven't done differential expression analysis. So now what I'll be looking at is can we find genes that are differentially expressed between layer 4 and layer 5? But I'll revisit also this zero inflation issue. This is a plot of the proportion of zero counts versus the average count on a large scale. The fit, the lowest fit to the observed data is in red, and the three other curves are fits for different models. So in cyan is the fit for the Poisson model, so horrible fit. In green and in blue are the fits for the zero inflated negative binomial and the negative binomial. So here it seems like you do just as well with negative binomial. Why zero inflation? Well, we look deeper at the data and then there's much worse fit when you don't account for the zero inflation. So this is a mean difference plot of the estimated versus the observed mean count for negative binomial and for zero inflated negative binomial. And note the very different scales on the two plots. Okay, so there's very poor fit here with just negative binomial when you don't account for zero inflation. Here are similar plots looking at the estimated versus the observed zero probability. Poor fit for negative binomial, much better fit for zero inflated negative binomial. So it doesn't matter whether you account or not for the zero inflation. So we're going to be using in our differential expression analysis the zero inflated negative binomial model. So in the left here is a plot, a mean difference plot on the large scale of the estimated mean for the negative binomial of layer four versus layer five. And here's the mean difference plot of the estimated zero inflation probability for layer four versus layer five. And I've colored in magenta, I've indicated in magenta the genes that were differentially expressed for the mean of the negative binomial and in science the genes that were differentially expressed for the zero inflation probability. Okay, so if we go back to that model, if I didn't have the zero inflation probability, this would be the model. The gene counts would be just negative binomial. So when I will look for differences in expression between layer four and five, I would test for the mean of the negative binomial. Now that I have the zero inflation probability, I can also ask which genes have different zero inflation probabilities in layer four and in layer five. So I'm looking at really two types of differential expression. Differential expression in terms of the mean of the negative binomial and differential expression in terms of the zero inflation. And that's what's represented in these plots. So the genes that are differentially expressed in terms of the mean are in magenta, so they tend to span the whole spectrum of expression. And they tend to have low zero inflation probability. The genes that terms tend to be differentially expressed in terms of zero overall mean expression. And that appears also when we look at heat maps. Okay, so on the left here is a heat inflation. You can have high zero inflation probabilities, but they tend to have fairly low up 100 genes that are differentially expressed in terms of the mean of the negative binomial. Here on the right is a heat map for the top 100 D genes in terms of the zero inflation probability. So we tend to pick up here a on or off type of differential expression. Okay, here are your layer four and your layer five cell. This is heat map two. It's built on top of each class. There's the heat map with the dendrogram on both sides. It's called heat map two. I forgot which package it is. It might be base R. It's on top of each class, yeah. I think it's more, I think you can cluster very many, it's displaying them. That's the issue. So here it's kind of, we're kind of reached the limit because it's like a few hundred and it's like it's already ugly. Right, you can't really see the tree too well. It's more representing them than actually clustering them. So I'm sorry for running very quickly in these. I'll try to sum up, you know, and make the main points clear in a moment. So let's look at some of the genes then that are differentially expressed in terms of their zero inflation probability. So one of them is Cox one and that's a known marker for layer four. So in this plot on the left is the luck count for the gene Cox one. In red are the cells that are layer four and in blue are the cells that are at layer five. So you can see for the layer four cells there's pretty high expression of Cox one and for the layer five cells a lot of the cells have zero inflation. I have zero count. Okay, so it tends to pick up on the off kind of differential expression and this is a picture of the inside two hybridization. So expression of Cox one in layer four. This is a gene that's differentially expressed in terms of the negative binomial mean only. This is a gene PCP four and it tends to have a more continuous type of differential expression. Okay, it's more highly expressed in layer five but there's still a little bit of expression in layer four and inside two of the gene. Okay, so just a recap of this. So the negative binomial models which were really successful for bulk RNA-seq, so that's RNA-seq where you sequence RNA for mixes of cells that don't work anymore for single cell RNA-seq because of the zero inflation. So the zero inflation negative binomial model works much better because it gives us more stable estimators of the dispersion parameter and better fit for the mean and the zero inflation probability. The genes that are differentially expressed in terms of the zero inflation probability are quite different from the genes that are differentially expressed in terms of the negative binomial mean. They reflect binary differential expression between the cell types, so on or off expression. By contrast the genes that are differentially expressed in terms of the mean reflect more a continuum of differential expression. And we can see the on off nature of the differentially expressed genes in terms of the Zi probability when we look at the inside twos. Okay, so try to show you a little bit of the what the results we've had for cluster analysis. So you know we started with a simple method for clustering which is PAM, partitioning around medoid. We did dimensional reduction using principal component analysis, so here we applied PAM to the Euclidean distance matrix for the first 20 principal components. These are bar plots of the average silhouette width. Large silhouette means a better clustering. So PAM silhouette width suggests picking three clusters. These are individual silhouette widths for three clusters of PAM. When we look at the composition of the PAM clusters, the three PAM clusters, we get the first cluster that's mainly layer four. The second cluster and the third cluster are mainly layer five but still quite a few layer four in them. Okay, so we were hoping we would get homogeneous cluster layer four and layer five. That's not quite the case. So we looked a little bit more deeply into these clusters and it turns out that a lot of the cells of layer four that end up being in the layer five clusters could actually be layer five cells that express the layer four marker or they could have been layer four cells that have moved to layer five. So you know PAM is kind of a nice simple method. It's intuitive but it has a lot of limitation for this problem. Okay, so in particular for finding more subtle differences in layer four and five, it's problematic because it can be sensitive to the choice of samples. If we had done PAM on the entire set of cells without having filtered the glia, we would have only picked the glia and it would have been really hard to refine the clustering. PAM is also sensitive to the choice of normalization method, the number of principal components and the distance function and also we have to pick k, the number of clusters. And it turns out that in our application and many other applications the number of clusters k is really not of primary interest. So we'd like not to be forced into that choice. So instead we're exploring right now methods that we're hoping will lead to more stable and tighter clusters and they're resampling based sequential clustering methods. They're similar in spirit to a method that was developed about 10 years ago by Sanyin Wong and they're similar also to back clustering methods. So here are the goals that we're trying to achieve with this approach. So we want to have clustering that's robust to the choice of samples. With PAM we had to successfully prune out the cluster to get to the finer structure. We had to get rid of glia and then we would have had to also first cluster into layer 4 and 5 and then zoom into each of layer 4 and 5. We want robustness to the choice of tuning parameters. We don't want to force all the samples into clusters. Some samples may be atliers so why cluster them? So we'd like to be able to leave them out and that way that might also improve the quality of the remaining clusters and we'd also like to not have to focus on the number of clusters k. So the method we're exploring is related to consensus clustering and that's the idea of aggregating multiple clustering so yes yes yes no not me no I'm focusing on the statistical analysis and that's going to be more like the ball you might want to contact John Nye or yeah okay that would be great thank you okay so consensus clustering so it's aggregating multiple clustering so it's essentially a non-supervised version of ensemble methods that you've seen in supervised learning so what we're doing basically is I'll just say in a couple minutes and then wrap up is resample the data apply the clustering methods to a bunch of resampled versions of the data and aggregate the results and we'll also be applying the clustering methods with different parameters aggregate the results so that's the main idea and it seems to lead to more stable clustering more tighter cluster and also more robust clusters so I'm sorry not to have had a chance to run through these results and running over time but I thank you very much for your attention I'm happy to answer questions yeah yeah I'd love to talk more about the clustering