 An Introduction to SCRNA-SEC Data Analysis Before diving into this slide deck, we recommend you to have a look at the following. How are samples compared? How are cells captured? How does bulk RNA-SEC differ from SCRNA-SEC? Why is clustering important? To understand the pitfalls in SCRNA-SEC sequencing and amplification and how they are overcome, know the types of variation in an analysis and how to control for them, grasp what dimension reduction is and how it might be performed, be familiarized with the main types of clustering techniques and when to use them. Greetings everybody and welcome to the Galaxy Single Cell RNA-SEC Analysis Workshop. Here we will walk you through some of the basics and concepts when dealing with single cell data. Let's start with what the differences are between bulk RNA-SEC and single cell RNA-SEC data. With bulk RNA-SEC we compare two tissues by looking at the average expression of each gene detected across each of the tissues. Due to the number of RNA molecules being considered, the sequencing depth and the strength of the analysis is reasonably high. The differential expression is then measured as the relative expression of a given gene between one tissue and another. With single cell RNA-SEC analysis, the stage shifts away from measuring the average expression of a tissue and towards measuring the specific gene expression of individual cells within those tissues. Here we are no longer comparing tissue against tissue, but cell against cell. Each cell is assigned a gene profile which describes the relative abundance of genes detected within it. Many cells share the same gene profile, where a gene profile ideally describes a cell type. Sometimes we need to compare single cell data sets across tissues, and we see that many cells across tissues share the same cell type. For example, look at the purple and green gene profiles which are shared across both tissues. New technologies means new methods and techniques to harness the new features that come with them. Single cell RNA-SEC data requires different means of library preparation, sequencing, quality control and analysis. For example, how are cells captured and sequenced? In bulk RNA-SEC analysis, the process involves taking a sample, removing unwanted molecules and sequencing everything else. For single cell analysis, the process is much the same, except that each sample is a cell, and must therefore be sequenced separately from other cells. Once isolated, unique barcodes are added to each cell, and then sequenced. The level of resolution in single cell is at the cell level, and each cell is unique. Therefore, the concept biological replicates is not quite the same as that in bulk RNA-SEC. Cell isolation can be performed in different ways. One method is manual pipetting, where wet lab scientists suction up individual cells using a long thin tube. They can do this hundreds of times to isolate hundreds of cells, but it is error-prone, and often multiple cells are isolated together. Another method is flow cytometry, which reduces the human error component of this stage. Flow cytometry floats cells in a shallow liquid bath and streams them along a narrow channel, just narrow for one cell to pass through. Cells can be screened by a variety of properties this way, such as by their light scatter properties, and from fluorescent cell labelling, cells can be tagged and isolated in this manner. Optical scatter properties can be used to probe size and consistency of the cell, where cells with a smaller size than the laser wavelength yield lower intensities and more inconsistent scatter patterns. There are two main types of optical scatter, forward scatter and side scatter. Forward scatter is aligned with the main laser and measures the diameter of cell, which is ideal for distinguishing different cells by their size profiles, for example monocytes, which are typically larger than lymphocytes, as seen on the x-axis of the example image. Side scatter is perpendicular to the main laser and measures the granularity of the cell, ideal for distinguishing cells and defined internal structures, such as the granular sites on the y-axis of the example image. Cells can also be gated and characterized by their cell surface markers via fax. By plotting different surface marker intensities against one another, cells can be separated, gated, and labeled based on these fluorescent properties. Once isolated, cells can be barcoded. Barcodes are unique sequences that are added to each RNA molecule. They are not unique to the molecule, but unique to the cell such that any two RNA molecules will be tagged by the same cell barcode, should they exist in the same cell. RNA molecules from different cells will have different cell barcodes. Once the RNA molecules have been tagged by cell barcodes, they can be amplified, either separately or pulled together, where the amplified products share the same cell barcodes as their original counterparts. PCR amplifies the gene products to make them more easily detectable during sequencing. When there is a lot of gene product to amplify, as is the case for bulk RNA sec, PCR works quite well in amplifying all products in a reasonably well represented manner. However, in the case of single cell products, the amount to amplify is very small, and many unique reads might be missed during this phase whereas others may be over amplified, as shown in the blue and red transcripts in the example. To guard against this type of amplification bias, we can add a random element to the barcoding. These random barcodes known as umis uniquely tag transcripts such that any two transcripts of the same gene are likely to have different random barcodes. Let us consider the example to the left. We have two red transcripts and two blue transcripts inside the cell, which after amplification equate to six red transcripts and three blue transcripts. If we were to compare the differential gene expression between the red and blue transcripts, just by looking at the amplified reads, we would come to the false conclusion that the red transcripts are expressed twice more than the blue. However, if we group the reads by their umis and then count only the number of unique umis per transcript to duplicating the reads which share the same transcript and umi, we arrive at two red reads and two blue reads that represents the true number of transcripts. Umis are relatively random, but not truly random. Notice that the pink umi appears twice, once in the blue transcript and once in the pink transcript. This is due to there being often more transcripts than available umis, both which are dependent on the number of transcripts in a cell and the length of the barcode. Consider a set of barcodes of length 5 and an edit distance of 1 between adjacent barcodes and another set with an edit distance of 2. The former is not robust against common sequencing errors of one base pair, but the latter only allows for half the number of barcodes. This trade-off between the number of available barcodes and guarding against sequencing errors is instrumental in the design of cell barcodes and umis. In the context of amplification, umis do not need to be unique, random enough to deduplicate transcripts in order to give a more accurate estimate of the number of transcripts within a cell. So let's just recap what we've learned. First each cell has cell barcodes added to each RNA molecule in each cell. Then we add random umis to all transcripts, which further tag the molecules. These can then be used to duplicate the transcripts after amplification. After amplification we need to perform some quality control. One way to do this is to set thresholds on the limits of detectability for genes and for cells. Consider an analysis governed only by three genes, G1, G2 and G3, and five cells, A, B, C, D and E. The first row of the top table defines the library size, which is total number of messenger RNAs across all genes in each cell. The subsequent rows are the thresholds of gene detectability, displaying how many genes are detected in each cell for genes greater than the threshold amounts of zero to four. We see that even a threshold of greater than three transcripts detected in a given cell still keeps three cells in the analysis, B, C, and E. In the lower table, the opposite is represented with the total number of transcripts across all cells for each gene. By setting thresholds of detectability, we can see how many cells are described by the gene for that threshold. In both cases, we can see that if we set the thresholds too low, then we risk keeping low quality genes or cells, but if we set the thresholds of detectability too high, then we risk losing too many. Filtering can be a luxury, however, as many single cell RNA-sec data sets have typically low sequencing depth compared to bulk RNA-sec. During the process of normalization, samples are scaled against one another to make them more comparable. This is normally performed by using median values. For example, for de-sec normalization, the geometric mean count for a cell is taken, and each gene value in that cell is divided by it and by the median value of all geometric means of all cells. If median gene expression is high, then this normalization method works quite well. But if the median gene expression is zero, as is often the case with single cell data, then we have the problem of dividing by zero. There are methods to get around these zero counts. One such method is the SCRAN method which works by creating overlapping pools of cells such that any individual cell is characterized by cells of similar library sizes. The method involves splitting all cells into an odd and even group by their library size and arranging them onto a ring structure where neighboring cells on the ring have similar sizes. Overlapping pools of fixed sizes are defined, resulting in each cell being defined by multiple pools. A linear model for that cell can then be built by the pools it occurs within, and normalization factors for all cells can be determined this way. By this method, the issue of low sequence coverage is worked around by turning cells with low library sizes into useful components of a size factor that can be applied to similar cells. Such novel normalization methods were commonplace a few years ago, but as sequencing technologies have improved, the issue of many zero counts in a matrix becomes less important, and normalization size factors can be derived using bulk RNA sec methods once again. Other factors that we need to take into account during a single cell RNA analysis are the unwanted factors that can confound the analysis. Ideally we wish to see the gene profiles of different types of cells are driven by biological variants. There is however confounding variation from both technical and biological sources that are not useful to the analysis, but do contribute to the variants. Confounding biological variants appears in two forms, transcriptional bursting and cell cycle variation. Transcriptional bursting is a phenomenon that occurs in cells in which transcription occurs in discrete states of active and inactive, where the interval between these states is hard to model. In bulk RNA sec, this phenomenon is unnoticeable as the effects are averaged out over many cells. But in single cell, two cells of the same type may exhibit different gene profiles simply because one cell was actively transcribing and the other was not. This is not something we can control for in the analysis, but it is something we should be aware of when understanding why cell clusters can be noisy. Cell cycle variation on the other hand is a much more well understood process, where the amount of RNA in a cell is approximately double that from a cell of the same type due to one being in the early G1 phase and the other being in the M phase during the cell cycle. There are genes which are known to co-vari with the cell cycle, and so by regressing the effect of these genes, we can control against the cell cycle. Confounding technical variants appears in the three forms, amplification bias, dropout events, and library size variation. Amplification bias can be mitigated by umies as demonstrated before. Dropout events give rise to the prevalent zeros in the count matrices, and their effect can be reduced by using clever normalization techniques such as the pooling method shown previously, as well as by using better sequencing methods. Library size variation arises for a variety of different reasons, but is the main source of variation within an analysis. Like bulk RNA sec, this is reduced with good normalization methods. Once we have removed unwanted confounders from the analysis we have the issue of quantifying the relationships between cells. From a data analysis standpoint, we treat each cell as an observation and each gene as a variable. For large genomes this means extremely high-dimensional datasets. Cells exist as points that are extremely sparsely populated high-dimensional space, making it difficult to see the natural groupings. The high-dimensional space can be reduced a lot by simply filtering out genes that do not appear to be differentially expressed across all cells. To find the relationships between these cells however, we need to define the distances between cells. A distance matrix does just this, defining the distance between any two cells by a single score. We can use the Euclidean distance on a three-dimensional dataset of three genes, G1, G2 and G3, and three cells, R, P and V. The distance between any two cells can be calculated as the sum of squares of the difference in gene values. Note how the distance matrix is symmetrical along the diagonal, confirming that for example the distance from cells R to V is the distance from V to R as expected. Once a distance matrix is generated, we can perform k nearest neighbors upon the distance matrix where directed edges are generated between cells. For each row of the distance matrix, k of the cells with the smallest distance values are selected representing the nearest neighbor that current row cell has to the selected column cells. If the edges are mutually shared between neighboring cells, then this is called a shared nearest neighbor approach. We can represent this three-dimensional space easily as three independent axes with points that denote the cells, extrapolating this relatively low-dimensional example set to a real dataset which thousands of dimensions is beyond the scope of human possibility. Dimensional reduction is a type of technique that takes a high-dimensional dataset and produces a low-dimensional representation, usually two-dimensional, that tries to preserve the distances between the data points. Here the relative differences between cells are defined in both the high- and low-dimensional representations. There are many different kinds of dimension reduction techniques, each with their own strengths and weaknesses dependent on the type and the dimensionality of the data. Once the number of variables of the dataset have been sufficiently reduced via filtering and dimensional reduction, clustering can be performed more easily. Here in this 2D projection, each circle is a cell, and the unique colors to pick the clusters are defined to. The physical distances between the groups of colored cells tell us how good the clustering is for this projection. By inspecting the top-differentially expressed genes in each cluster against all other clusters, clues to the type of cell that the cluster describes can be found. Cell types are often characterized by the expression of specific marker genes, and the presence of these genes are strong indicators of type. Marker gene discovery can then be used to identify the clusters. We can also further derive the relationships between these clusters by computing lineage trees based on the amount of noise in each cluster, with the expectation that stem cells have noisy expression profiles yielding broader clusters, and mature cells have very clear expression profiles yielding tighter clusters. The types of clustering you are likely to encounter in an analysis is dependent on the input datasets, where cells taken from late stage samples are less likely to be bunched together and are more likely to yield large visible gaps known as hard clusters that clearly defined different types. Earlier stage datasets are more likely to yield softer clusters, where neighboring clusters share soft boundaries as clusters intermingle slightly with one another. Soft clustering is to be expected, since although clustering is a statistical method for discreetly partitioning data, the underlying cell biology of the data is a continuous process, where cells transition from one well-defined state to another through intermediate stages, which are represented in between two cluster centers. Because of the continuous nature of these single cell datasets and the extremely high dimensionality of the data, discreet partitioning is often a poor model for partitioning the data. If we instead assume that cell clusters are related to one another via transitional cells which would naturally lie in between clusters, then manifold learning techniques are better suited. These techniques derive an expression landscape that can not only be used to relate clusters to one another, but also can be used to infer lineage and hierarchy. To actually perform the clustering there are three commonly used methods, K-means, hierarchical and community clustering. K-means and K-medians follow the same method. The number of clusters are defined beforehand and initialized in random positions. The positions are then updated by the contribution of the cells more closer to it than to other positions. This process occurs multiples times until the positions no longer significantly change or until a set number of iterations have been achieved. The final assignment of each cell then becomes the cluster assignment. Hierarchical clustering is more flexible and does not need an initial parameter to define the number of resulting clusters. Here the two closest points in a distance matrix are joined into a single group, distances are recalculated, and the two closest points are once again joined. This process repeats until all data has been consumed into one. By tracing the process backwards a hierarchy can be established that is represented by a dendrogram. Luvan clustering is a widely used type of community clustering for single cell data. Here each cell is assigned a neighborhood of its own and the number of internal and external links between neighborhoods are counted. For each iteration, a random cell is selected and brought within the neighborhood of another cell, and the internal and external links are once again counted. If the new configuration has reduced the number of external links in favor of more internal links, then the configuration is kept. If the new configuration has instead increased the number of external links then the configuration is rejected and another cell is picked and tested. By performing this multiple times, a community structure of cells is built to whichever degree of specificity the user desires. Single cell analysis is non-trivial and each stage, from the filtering to the normalization to the dimension reduction and the clustering can drastically affect the outcome of the analysis. Due to the variability in the analysis, one should not panic when faced with uncertainty. The goal is to play around with the data until it begins to reflect the biology. This can take many many tries to achieve, and it may never be perfect, but the idea is to try as many different ways as possible to see what robust inclusions you can come to. In this regard, the vast use galaxy resources can be put to good use by testing out the many different paths of the analysis, and the galaxy training network provides tutorials and hands-on trainings to assist you in this regard. Please explore them to better develop your understanding. SC RNA sec requires much pre-processing before analysis can be performed. Groups of similarly profiled cells are compared against other groups. Detectability issues requires careful consideration at all stages. Clustering is an integral part of an analysis. Thank you for watching.