 Single cell data sets can have a lot of technical variability issues. Each cell will generally capture a varying number of reads, which will cause some cells to have too low of a signal to be useful. Additionally, genes range from ever-present housekeeping genes to specialized genes which are expressed only in particular cell types or under certain conditions. Using normalization and filtering techniques, we can pre-process the data to make it friendly for downstream analysis. In this video, we'll analyze mouse embryonic stem cell data set, which contains single cell sequence data for different cell cycle phases. Our data contains a very high number of genes. How many of those are actually useful? We can use the filter widget and retain genes that have been detected in at least 20 and at most 170 cells. This way, we discard the housekeeping genes and the genes that are hardly ever detected. We retained about 12,000 genes, reducing our data set to a quarter of the original size. For each cell, our data still reports the expression strength for each gene. Typically, the total gene expression differs from cell to cell. Let's visualize this with another filter widget. The filter widget can also count the number of expressed genes per cell or vice-versa, the number of cells expressing each gene. Using the lock scale for the count axis, we see that the cells express genes at substantially different rates. If we were to process the data in this form, some cells would have more sane analysis than the others. To solve this problem, we'll use preprocessing. We'll do the preprocessing in the single cell preprocess widget, which we will connect to the output of the gene filtering widget. Let's also rename the latter for convenience and interpretability of the workflow. In the single cell preprocess widget, we can specify an ordered list of steps for data preprocessing and transformation. By default, the widget shows some standard preprocessing steps, which we'll remove to start fresh. Typically, we would start by normalizing the gene expression of each cell. In other words, the gene expressions will sum to the same number for each cell. So far, we haven't done anything about the genes yet. Let's check the distribution of expression values for each gene. We will use Orange's distribution widget. The distribution of expression values varies wildly for most genes. Say for the gene GPR-107, the distribution has a very long tail. Other genes, like PIH-1D2, show little variance and will have a negligible effect on the analysis. We can filter out low-varying genes by adding an additional preprocessing step. Let's keep only 1000 most-variable genes, where we will compute the statistics on the entire set of genes. We can also solve the long tail problem by lock-scaling the expression values. Looking at the data after the preprocessing, we can see that gene expressions are now more evenly distributed. To wrap up, let's look at our data in a t-SNE plot. We can see that the cells separate based on cell cycle stage. Now this may or may not be desirable, so stay tuned for our next episodes, where we will discuss the infamous batch effects.