 Hello, my name is Pawan. I'm from FibreGalaxy team. In this video, I will show you how to preprocess single cell RNA sequencing data from 10x genomics platform. The training material for this tutorial can be accessed from the GTN website or directly within the galaxy itself. I prefer to access them within the galaxy so that I can always choose the correct versions of the tools from the tutorials. First, go to single cell topic and then scroll down to find the tutorial named preprocessing of 10x single cell RNA datasets. Enter the tutorial and in this tutorial you are going to learn how to map 10x data as well as quantify genes and then you will also learn how to filter out noisy data to get a high quality code matrix. The data that we are going to use in this tutorial is of course from 10x genomics that contains thousand PBMC cells extracted from a healthy donor. The data from 10x usually comes in bundles of such three FASTQ files where the first two are two mate pairs read one and read two and there's an additional index file that is used for multiplexing the data. We are not going to use this third FASTQ file in this tutorial so we need only the forward and reverse reads for this analysis. So let's first get the data into Galaxy, copy the links to the Zenodo here and then we will paste them in Galaxy. Before that, first create a history and then rename it and now click on upload data, paste and fetch data and then paste the links here. Click on the start. We also need two more files for this analysis. The first one is a GTF file that contains the gene annotations or also information about the splice sanctions and the second one is a cell barcode white-list file and this contains all possible barcodes that were used during the library preparation. So let's copy these two files links and also paste them here and start uploading the data. The data is now uploaded. Let's first look at the data before jumping into the analysis. So here we have four FASTQ files from the same sample but sequenced in two different lanes. If you look at the forward reads of one of the lanes they have really short sequences because the forward read contains only information about cell barcode and UMIs. So the first 16 bases here represents the cell barcodes and the remaining 12 bases are from the UMIs. In the reverse reads, we have much longer sequences. In this case it's 90 bases long and this represents the CDNA or the mRNA and the lengths of these read one and read two also depends on the chromium chemistry version that was used during the library preparation. So our sample that we are going to analyze today is from chromium chemistry with three version. Hence we have a first read of 28 bases and the second read of 91 bases. If the samples are from chromium chemistry with two then we have a little bit longer CDNA but the read one is shorter because we have a shorter UMI here and here we also have two more files. The first one is a gene annotation file in GTF format. It contains all the information about the positions of the genes on the reference genome and also information on what genotype they are or what gene IDs names etc. Then we also have information about cell barcodes in this last file here and this is a white list of all barcodes that are used in the chromium chemistry with three version during the library preparation. Here we have about 6.8 million lines and each of these lines contains a barcode of 16 bases long and in the end our FASTQ files only will contain about 1000 of these barcodes because we have a sample that contains only 1000 cells. Now let's begin with the analysis part. The first step that we are going to perform is mapping the multiplexing and quantification. For this we will use a tool called RNA star solo. For running this tool we need to choose a reference genome. In this case we will use a human G19 genome assembly and it's corresponding GTF file. We will also need the reads and the barcodes white list file. We have all the files in our history. Let's run this tool. So first we have to select the reference genome here and it's corresponding GTF file is already selected there and then here we need to choose input types as separate barcode and CDNA reads. As we already know that the barcodes are in the forward read let's select only the R1 here and we know that the CDNA are in R2 so let's only select the FASTQ files with R2. Here select the barcode white list file. As I already mentioned the library was prepared using Chromium chemistry v3 version so select its corresponding option. Now we will use chromium sorry cellranger 2 to 4 algorithm because we are trying to mimic cellranger analysis here. We also use cellranger options for UMI filtering as well as matching cell barcodes to the white list and for counting the UMI's you will use gene features. So in this case we will count only the reads that map to the exons of the genes. If you have single nuclear data you would have to select one of these options that start with the full. In that case you will count all the reads that map to exons as well as introns. Now we are not going to filter any cells so we set it to do not filter. In the later steps of the analysis we will use different tools and methods for filtering. For now run the tool. Finally after waiting for a while we have the mapping results ready. StarSolder produce six different output files. We have a log file with mapping stats. There is a genes raw file that copies all the gene IDs, gene symbols from the input GTF file. There's also a barcodes raw file which contains all the barcodes from the white list. There's a count matrix. There's also a BAM file with all the alignments and finally we have statistic summaries of barcodes and genes. To visualize the mapping quality we can use multi QC tool but here I'm not going to use this tool because we have only one sample. It's probably easier to directly look at the log file to check the mapping quality. Here we see that there are about 87.5% of the reads are uniquely mapped which is a good indication that our mapping was successful. And now we will also look at this final file with statistic summaries on barcodes and genes. It contains two sections namely barcodes and genes. In the first section we have to check for this row here that indicates the number of reads that have an exact match to the barcodes white list. In this section this number should be the dominant one and then we can move to the next section here. In the genes section we have to check this number here which indicates the number of reads that have a match to the white list of barcodes as well as mapped uniquely to one of the genes. In this section this number should be the dominant one. There's also another number that is interesting for us which is this row here that indicates a number of barcodes, cell barcodes that are detected by star solo. So you might wonder why do we have 5200 barcodes detected if we start with 1000 cells. So these are all the barcodes that can be detected by star solo. It contains noisy cells also cells barcodes with ambient RNA and so on. So we have to filter out all these noisy or lowly expressed cells with ambient RNA, sorry barcodes with ambient RNA and then to generate high quality count matrix which can be used for clustering and further downstream analysis. Now let's proceed to the filtering of the cells. For filtering out the noisy data we will use a tool called droplet utils. Here we will try out two different methods for filtering. The first one is cell ranger method that filters out low quality cells based on the expected number of cells and the other method is called empty drops method which selects the high quality cells based on the UMI counts per cell. Let's first try the cell ranger method, open the droplet utils tool and choose the format as bundled and we have to choose here the matrix count data and here the genes list and here the barcodes list. Here we filter for the barcodes, use the default drops which is the cell ranger method and let's leave the numbers to the default and we will again produce a bundled output and run the tool. The results from the cell ranger method of filtering are now ready. If we look at the output file this method found 272 high quality cells. If we remember the star solo initially predicted there are about 5200 non-empty barcodes so now we ended up with only a tiny fraction of these 5200 to be of valid cells or high quality cells. Now let's proceed to the next step or next different method of filtering which is empty drops filtering. For empty drops filtering we have to find a cutoff on the minimum number of UMIs per cell. In order to find this cutoff we first have to draw the distribution of UMIs, the number of UMIs per cell. For this we will again use the droplet utils tool again select bundled information. Now select the initial results from star solo not from the droplet utils so again matrix, the genes as well as the barcodes and now the operation is not filtering but ranking. So rank barcodes and now run the tool and this will now produce a plot. Now let's look at this plot. Such plot is called barcode ranks plot. Each circle in this plot represents a barcode or a droplet. On the x-axis we have barcode ranks that are computed from the umi counts and on the y-axis we have the total umi counts per cell. We also have two horizontal lines representing knee and inflection points. Knee and inflection points indicate the transition between the high quality cells to low quality or even empty droplets. In this case we have a knee at 4861. So all the cells with more than 4860 total umi counts can be considered as high quality cells in this sample and all the cells which are towards the bottom of this plot are most likely empty droplets. We can use the umi, the total umi counts at this inflection line as a cutoff for filtering low quality cells from our data set. So in this case it's 260. We can also predict or guess the number of cells after filtering out based on this plot. For example if we choose 260 as threshold for the minimum umi counts then we will have to drop a line from this intersection onto the x-axis which lands here close to 100. That means we will end up with 200 or between 200 and 500 cells if we filter with this threshold. And now let's proceed to the actual filtering. We will again use droplet utils tool. Again select the bundle input matrix. Now be careful here again select the output of the star solo not droplet utils. Again select matrix from star solo, genes and as well as barcodes from star solo. Now we filter for barcodes but the method we will use now is empty drops method. Here we can put a lower bound threshold of for example 260 here from the inflection line but I will go for a little bit lenient cutoff of 200 in this case. And we will also output bundled data. Now let's find out how many quality cells we have after empty drops filtering. Click on any of the output data sets to expand to expand and scroll down. Here we see there are 282 cells which is 10 cells more than the cell ranger method. If you know already how many cells that you expect from your sample or if you prepared the library by yourself then you can use cell ranger method to select the high quality cells. But if you are unsure of the expected number of cells or if you're analyzing a published data set then you can use empty drops filtering method and find out a decent threshold on the total of my counts for filtering out low quality cells. And that's it for this tutorial. I hope you enjoyed it. Have fun with your data analysis. Thank you.