 All right, YouTube says that I'm live, so anyone can hear me, then please say something in chat that you can hear me. So welcome back, session number two. Let me actually switch to the screen so that people can see me as well. So welcome, welcome, welcome, another Sunday. So we skipped last Sunday, and this Sunday we will be doing more RNA-seq building your own pipeline from scratch. So I was a little bit disappointed by the previous time, right, like I was expecting to get a lot further because we were just installing programs and we didn't align a single read or anything. So I hope that today we will actually start aligning reads, and that's what the whole lecture is focused on, so I hope that that will work out. So very good, Leon says I can hear you well. So welcome to the stream, Leon. All right, so RNA-seq building your own pipeline from scratch. So what are we going to do today? So for today, we are going to set up our Gino, which is one of the important steps. I wanted to say a couple of words about that, and then we will be downloading some reads from the short read archive. And then we will be using R to create an alignment pipeline, which means that we will start trimming our reads, cutting off the bad ends, so the ends without a lot of quality. And we will be removing adapters and stuff. And afterwards we will start aligning for this, we will use star, which we installed last time. And of course, we're going to do a whole bunch of QC afterwards, so to make sure that the reads are OK and that the alignment is OK. So the overview for today, very basically, it's making a basic pipeline in an R script. And then I think we will spend like one more lecture of how you can then really make it into a pipeline, right? So take all the individual parts and make it so that they are assembly blocks, so that you can add and or that you can more or less combine them in any way that you want. All right, so with that out of the way, I hope everyone still has their Debian instance that we installed last time or is sitting behind a Linux computer and has all of the software installed. If not, then we have the previous lecture there. Down in the description, I also put the slides already there and the code is updated on GitHub with the code that we will be using today. So everything should be up to date. If there's anything missing or if you're thinking we should have something else, then definitely definitely let me know and we can fix it directly. So unfortunately, we forgot some software. We didn't forget to install software. Well, we forgot to install one piece of software, the integrated genome viewer all the way at the bottom. But we forgot to make one additional symbolic link to the HTS Slipped library. And this is the software tool, Tubics. Tubics is required when you use VCF files, because of course VCF files need to be indexed as well. So I thought that with some tools we could index the genome, which is true, but we also need Tubics to index the VCF file that we will be using to do SNP realignment. Besides that, we actually forgot some additional R packages. So we forgot to install ggplot2, gplots and the GSA library. And these are required in GATK. So when you start doing the GATK part of the analysis, then you need to have ggplot, gplots and GSA installed because it will try and create some PDFs which will fail if you do not have these libraries installed. So let's just do this, right? Just also installing the IGV so that we can look at the reads once we're done. So I will switch to the screen so that people can see my WN window. So again, I just rebooted or booted up the virtual machine, haven't changed anything. So last time we created our bin folder, right, where all of our binaries are. And as you can see, I already linked the Tubics. In case you didn't do that yet, you can actually just use the script which was provided. So inside of the script, which let me open it up myself is called install software.sh. It's on the github. There's just the code. So you can just copy paste it. It starts at line 100. So I put it all the way down so not to mix it in with the code that we already had. So what you can do is do it like this, I will actually call it Tubics 2 so that we have two versions of Tubics linking to the same executable. But now if I would type Tubics, then I could do Tubics, right? And if I do Tubics 2, it starts the same program. So this is the first step that we need to do is link another executable in our bin folder so that we have it. And then we need to do R. So we can just sudo into R because we want to install it for all users, password is one, two, three, four. And then how we can just use the install packages. I think I already installed them just as a preparation. But let me make sure that they are there so it will just unpack it and install it, which is nice because our packages are relatively quick to download and install. Just that we have them, then we need gplots. So let's install gplots as well very quickly. And then the last one that we need to install is the GSA lib, which should be relatively quick as well. So that's the third one. And of course, you can take a little bit more time. So after we've installed these three, then we can just quit R because we won't, well, we will be using R of course to make it. So we go back, right? We want to go into the software folder. And in the software folder, we want to get the IGV. So we are just going to download the IGV from this location, line number 108 of the install software as a script. So it's just downloading a zip file. It's a big program, but it's one of these programs which is really, really useful when you start doing genome alignments because it just gives you a very, very good overview of what's going on with your reads and if there's any other issues. So I always love using it. All right. So after we've, we have it right then, we can just unzip it. So unzipping would not take too long. Yeah, just override everything. So I already installed the IGV, but that doesn't matter too much. So if we want to start it, we can just say IGV, right? And then we say just IGV.sh and IGV.isv, IGV does this age and it will just start up the IGV because it has everything inside of there. And so it cannot locate my genome, which is good because I deleted that to prepare. All right. So this is how the IGV looks. So in this case, we have not loaded anything. So it's just the empty IGV window, but it's fine. Good. So if that is working, then it's time for the next step, of course. So the next step is going to be getting your reference genome, right? Because the reference genome is where we're going to align our reads against. So we need to index this for fast alignment and it comes in different flavors. So we generally have the primary assembly versus the top level when we go to ensemble. So let me just show you guys an example of that. So let me get my Firefox window open and I will just say ensemble. And then I will go to the FTP server, right? Because ensemble FTP. So when we go down, we see all of the different species available. So for example, if we go and look at mouse, not look at mouse like that, but we go and look at the DNA sequence of mouse, right? So we just say DNA in the FASTA sequence, then it goes to the FTP site. And then you see all of these files here. And the first thing that you will notice is that there are three different versions of each file, right? We have chromosome one, and this is chromosome one DNA, then we have chromosome one DNA RM, and then we have chromosome one DNA SM, right? So the SM is the soft marked version of soft masked version. So that means that base pairs that were unreliable have been masked. So they have been transformed to lowercase. And when we look at the RM version, it means that this is repeat masked. So here the repeats are also masked. So that means that when you do an alignment, these files, they don't contain the whole DNA sequence. They contain the DNA sequence, but some parts of the sequence have been blanked out. All right, so if we then scroll here, then we see the whole version, right? So let me get the correct one. So we see the primary assembly and we see the DNA top level file. Now people always ask me which one of the two should I use? You can see that they're around the same size. And this is a relatively easy question. And that is, you should always use your primary assembly, because the primary assembly is the assembly which contains all of the chromosomes, but it does not contain any of the, it doesn't contain any of the patches, right? So sometimes when pieces of DNA are re-sequenced, then the version of the genome is not changed, but they just add a little content to the bottom of the file. So what you want to do when you do DNA or RNA sequencing is not use the top level, but always use the primary assembly, because the primary assembly is just the DNA reads, just the chromosomes. All right, so I explained to you what DNA is, so that's just the DNA sequence. Then we have soft masking and hard masking, which is called SM and RM for ensemble reasons, for some reason. All right, so today we will be setting up our own genome. So we'll be using a sarcomycin-servisier, which is baker's yeast. And the reason why I chose this is because it has a relatively small chromosome or relatively small genome. So the genome size in total is only 12 MB, so 12 megabases, so 12 million base pairs, but it has 16 chromosomes and it is one of the first eukaryotic sequence in 1996. And the reference strain, if you start looking at ensemble, you can figure out that the reference strain is a specific strain of baker's yeast, which is S288C. So just to picture how it looks, so you see the yeast and you see the new yeast budding off, and that's the way that it reproduced. Good, so the problem here is that if you look, then we see that it only got the top level available, right? So I can show you that in Firefox as well. So let's just go in Firefox, go back a page, and here we have to look for baker's yeast. So I'm just going to copy the name and put it in the search field, otherwise it's going to take too long to scroll through all of the different species. So here's the sarcomyceterophysiae genome. So if we look at the DNA sequence, so we just click on DNA, then we see that only the top level is available in the DNA. So we don't want to have the repeat mass, we don't want to have the soft mass version, we just want to have the top level version. But we don't really want the top level because we want the primary assembly. The top level itself has all kinds of junk on the bottom, so scaffolds which have not been placed in the DNA yet, and that is not what we want. So what we want is get the primary assembly. So fortunately for us, it is relatively easy to make your own primary assembly. So that is what we are going to do, so we are going to use R to make our own primary assembly, and it is actually relatively easy, right? Just three steps. So we have to download the individual chromosomes, then we need to unpack these and make one big FASTA file, not one big chromosome, but one big FASTA file. And then we have to repack this big FASTA file using BGZIP. And, well, let's start doing that, right? So I wrote a little R script. I think if you look online in the gist, it is called, let me see, build genome.R. So this build genome.R, let me show you guys, it looks like this, right? So it's a very, very small file. So it does exactly these three steps that I was talking about. So the first step is to download, right? So we can just use Baygett for that when we are under Linux. So it just, they get the URL and the filename. So I make the filename using the chromosome. So I go through all of the different chromosomes that are there. And then I download them. Then I create a new empty file. I paste in all of these chromosomes. So I just use ZCUT because they are, of course, compressed. So ZCUT unpacks it. And then I just add it to the primary assembly file that I'm going to create myself. And then I'm just going to compress this file that I made. But I'm going to keep the original. So I'm just using the system call here for executing system commands. And then I'm going to delete all of the chromosomes at the end because we don't need the individual chromosome files. And it just creates a lot of junk. All right, so let's do this. So let's go to our R window in Debian. So we start up R, right? And I'm just going to use the script as is. So I'm just going to go through it step by step. So the first three lines are there to set up some of the parameters that we need, right? So the URI is where it is located, which is FTP, ensemble, blah, blah, blah, right? So you can just get that from when you click on it in Firefox. And then the base name, so this is the name. Let me show you that in here, right? So the base name is what is every time similar. So that's this whole part up until the Roman numeral 1, Roman numeral 2, Roman numeral 3, right? So that's the reason why I call this the base name. We will be using base names a lot. So the base name is the name, the whole filename, up until a certain point. And after this point, it starts changing. So chromosome 1, 2, 3, 4, and so forth, right? But I'm just going to store the base name. So the base name in this case is sarcomycesurvca, R64, minus 1, minus 1.dna.chromosome. And then afterwards, I need to put the Roman numerals. So the nice thing about R is that I can just say, well, there are 16 chromosomes. So I make a sequence of 1 to 16. And then I say as Roman, right? Because R understands that Roman numerals are Roman numerals. So this will create a list which contains the values 1 to 16 coded as Roman numerals. And then I add the word mito because there's also a mitochondrial genome available. So the mitochondrial genome is there. So if there is a mitochondrial genome, you want to add it to the list. And mitochondrial genomes always get added to the back. All right, so the next step is then to download. So let's just download them. So let me just paste in the text so you guys can see it. And it will just start rolling across the screen. It will just do weigh-get calls, right? But if we focus a little bit on the code because I do want people to understand the code that we're looking at. So I'm just saying for chromosome in chromosomes, so this will just do a for loop, right? The first time it will be Roman numeral one. The second time it will be Roman numeral two. So what I'm going to do is I'm going to paste together the base name, the chromosome, and then the extension of the file, which is .fa.gz. And then I'm just going to download them. So I'm going to say paste together the weigh-get command. So this is the executable in Linux that I'm using, the URI, and then the file name, which I just created. And I had cut here to print it to the screen so I could see what's happening. But in this case, I tested it and I just do a system command. So I just say system, execute this text string as if it were a Linux command. So it's weigh-get, surface-refugee, chromosome one, chromosome two, three, four, and so forth all the way up until 16. Right, so now I have all of these files downloaded. So in theory, if I would look at my hard drive, so let's go to files, then go to chromosome or genome, where are we? Ah, I see I already did something wrong. So let's just quit because I needed, yeah, so I downloaded them into the IGV, which is not what I wanna do. So I'm just gonna delete these. So I'm just gonna delete those. So let me go back first, right? So I'm not keeping to my own script. So I'm going to make a directory called genome and here I'm going to put in the genome. I'm going to go into this folder and then I'm going to start R and I'm going to do the same thing as what we did before and I'm going to download the genome here and not in the IGV folder. So now here we should have a folder called genome and now here the different chromosomes are inside. The mitochondrial chromosome is also there. You would. All right, so then the next step is to use ZCUT, right? So I can use ZCUT to cut these things together. So let me show you guys my notepad window. I think that's the wrong notepad. Yes, it is. So I'm just gonna change that to the other notepad plus plus window. This one, no, this one. Yes, so that's it. So, all right, so the next step is to create an empty file. So I'm going to just create an empty file. I'm going to keep the naming system that Ensemble uses because it tells me which version it is. So it's R64.1.1 DNA and we are creating a primary assembly, right? So I'm going to empty out this file by saying cut nothing into this file. And then what I'm going to do, I'm just going to go through the chromosomes again. I'm going to take the file name that I created just the way that I did. And now I'm just going to say ZCUT, the file name and then put it into the sarcomyce to suffice a primary assembly file that we are creating. And I'm going to use system again to execute the ZCUT Linux command. All right, so let's do this. So let's get rid of notepad plus plus and just paste them all together. Right, so if I won't now look at the drive and I can open this up with a text editor, then it just looks like this, right? So I just took the first file, extracted it and put it in and then I can scroll all the way down and then all the way down at the end should be the mitochondrial genome somewhere. I'm not seeing where the header is, but trust me, the header is in there somewhere, right? It's 16 million base pairs. So it's difficult to find the header lines, although we could just search for them, right? So we can search for here. So this is the mitochondrial genome starting at 201,000. All right, so then the next step, of course, is to zip it, right? So I'm just going to call bgzip on this file and I'm going to use r for that and then I'm going to delete them. So that's what the last part of the script does. So let me show you the last part of the script. It's compress the file, keeping the original because I want to keep the faster file as well. And then we're just going to delete the different chromosomes. And of course, this should work because it's relatively easy code, right? It's just, instead of calling baguette or ZCUT, we're now just calling rm on the file and this should delete them. Good, so now when we quit and I'm not going to save my workspace, when I go here, now I see that I have my primary assembly, I have my gzip assembly and all of the individual chromosome files are gone, just to save a little bit of this space. Good, so that's how you build your own primary assembly when there is none available. And this is, you can reuse the script quite easily. And if you look at the script, then in case you want to do it for mouse, which is not necessary because mouse has a primary assembly available. It is just updating the URL, updating the base name. And then you have to make sure that your chromosomes match the chromosomes of your species because in yeast, they are Roman numerals, but sometimes they're just numbers for mouse and human. They are just numbers. And of course you could add things like X and Y as well. All right, so first step, we have our own chromosome. Let me go back to the PowerPoint. So we have our own primary assembly that we created. So very simple script, you can update it and you could even put it in your pipeline in case you want to do multiple species or work with multiple species at the same time. But we're only going to use yeast, so that's why I only have one folder and put one chromosome or one genome in. All right, so the next step is the transcriptome and we need the transcriptome because we have to deal with the internal and exon boundaries, right? If you do RNA sec, it's slightly different from DNA sequencing because generally when you do sequencing of messenger RNA, you are sequencing mature messenger RNA, which means that the introns are not there. So that means that when a read starts in E1 and goes all the way into E2, then this sequence is not found on the genome because there is a sequence in between E1 and E2 when you look at it on a genomic level. So for this reason, we need to have the transcriptome so that the transcriptome can actually deal with that, right? So when we do the alignment and we use star to align the reads, then we need to tell the alignment program that reads can actually span from E1 into E2 without this being a massive insertion into the read because normally it would try to put the read into E1 and then it would see, oh, this whole part of the read doesn't match and then it would throw away the read because it does not align to the genome. But if it knows that E1 and E2 are sometimes together in a messenger transcript, it knows that it needs to keep the read. All right, so again, we go to the Ensemble FTP. So let me open it up. And this time we are looking for a GTF file which is a file which looks like a bed file and it just describes all of the introns and axons and the genes in Baker's yeast. So it's a massive URL, so I didn't put it here. Again, we're just going to go into the genome folder and we're just going to say, they get this massive URL and in this case we want to extract it as well because Star does not like files which are GZIP. It can only deal with GTF files and not with compressed GTF files. Good, so let me get the W in window for you guys and we are just going to do that. Sorry, I know where this is. So this is actually in, not build genome.r but this is in prep genome.sh. So if you go to the github, to the gist, you can find the code there. So what we're going to do, let me just show you guys is just we get this whole link. So we're going to go here. We are already in the genome folder. So we're just going to head, so this is the URL. So again, we are going to download the GTF file and it's called Saccharomyces of VGA and we want to have the exact same version as the version of the genome. So that's R64. Yeah, of course it's a very small file because although it has relatively many genes, it is a very, very small genome compared to a mouse genome. All right, so we're going to unzip it. Again, we are going to keep the original gtf.gz file so now it looks like this. We now have our transcriptome and we have our genome. If we open this up in a text editor, it looks a little bit weird because it doesn't put everything on the same line but it tells you where axons start and end and it tells you which transcript it is so the gene that it belongs to and it's just filled with intron axons, intron axons. Sometimes it's things like start codons and stop codons and we will use this gtf file later on in the IGV to provide us with annotation of the genome. So when we read through the genome, then we also want to have the genes located underneath and we can use the gtf file for that. All right, so then there's one more thing that we need before we can start building and indexing our genome and the thing that we still need is all of the snips. So all of the snips are also here so this is the VCF file. So if you go to the Ensemble FTP download, then you can just get the URL. So it's not the gtf but the snips in this case. So we want to download the VCF file. So again, we're going to do we get. Fortunately, we don't have to extract all of the known snips. We can just download the file and that is it. All right, so let me do that. So if you're following along, this is again in the prep genome file line number four and it's just a vagate of this very, very long URL where the VCF file is located. So we download that as well. And now when we look either using LS or into the files, we can see that we have the VCF. So this contains all of the snips. We have the gtf which contains all of the introns and axons and we have our primary assembly that we used R to make. And we have, of course, GZ versions of the genome and we have a GZ version. So a zipped version of the transcriptome still available. All right, so then the next step is starting to index all of these things. So indexing all of these things is required because we want our tools to run very quickly. So we want our tools to run very quickly which means that we have to build things which are called indexes. So indexes just tell the program that chromosome one is starting at this position of the file. Chromosome two is starting at this position. So it can very easily skip whole chromosomes to end up at the chromosome where it needs to be. So basically what this means is we use some tools to index the primary assembly file and then we use star to generate a genome and transcriptome index and we're going to use this command for that. So we're going to say star, use two threads. We are going to run star in a certain run mode and this run mode is genome generate. We are going to provide it with a directory where to store all of the files. It says genome essay index and basis. So this is a magic number which you have to kind of figure out yourself and this is based on the genome size. So for very large genomes you need a bigger number than for very small genomes. I think the default is actually 10, but I specify it always. If the number is not correct for your genome, star will tell you. It will say that your essay index number of basis is too large for this genome. It basically is a parameter which kind of determines the camer size of the genome. Right, because when we do genome alignment it cuts up the genome in little pieces called camers and this is where we set our camer size. So I have to give it the GTF file and I have to give it the genome FASTA file. So I'm just going to give it the GTF and I'm going to give it the primary assembly file for the genome. And then the next step is to use topics to index the SNPs and we also need to index the genome using Picard tools so we can use this Picard tools that we installed last time to create a sequence dictionary for our primary assembly. So that's what we're going to do. So again, the code is available. So if you are following along, then hey, you can just look into the prepgenome.sh where we have all of the different commands to do this. All right, so let's go and take this code and I'm going to close this one just so that we don't have it hanging over here. So the first thing is using some tools to index the genome. So I can just paste this in. All right, so I'm going to index the genome. It's very, very fast. It's a very small genome. I chose it for that reason. When we do LS, you will see that it created a .fi and a .gzi and these are the two files that are there which have the indexes in there. So we can open it up. We can actually look into them. .primary assembly, dna, primary assembly, .fa, .gz, .fi and then you can see it looks like this. So it just tells how many characters there are on the line, how many characters, including the enter, there are on each line and it tells the length of the genome and exactly where it begins, right? So chromosome one has this many base pairs and it starts at character number 54 in the file. The FASTA file has 60 base pairs per line, 61 when you count the enter in there. Very basically so that some tools can very quickly skip to a certain genome. So if it would skip to chromosome X here, so chromosome 10, it knows that it needs to skip like 5.9 MB of data. All right, so the next one is star. So I'm just gonna copy it in, right? So again, star is going to index our genome and it doesn't throw any errors. So that's good because if the number 10 that we chose for our camer size is not correct, it would have told us and fortunately 10 works. So it just writes a small folder. So if we go into the star folder, we can see that it actually has a whole bunch of additional files like the length of the chromosomes, the names, the starts, the amount of axons. It has a genome file and it has like a couple of index files so that it can easily do the alignment and doesn't need to read through the whole file over and over again, right? So it can just skip to the file or skip to the position in the file where it needs to go. Topics again, very quick. Topics actually just takes the VCF file so it indexes the SNPs. So if we do an LS, then we see that we get a TBI file and this TBI file is the index file for the VCF. Some of these files are actually available from Ensemble as well. Good, then the last step is to index the genome using Picard. So if we wanna do that, we can just do it like this. So it failed, why did it fail? Warn zero bytes, okay, which is good. This is because there is a big, that's a good question why it failed. Executing, deflating. Oh, no, no, it's just done. So it's just a warning. Interesting. Let me see if it created the dict correctly. So this should create this dict file. So let me see what's in there. So we can just do more, paste the dict file. So, yeah, no, that seems to be perfectly fine. So it creates this file where it tells that where it locates saying that, well, chromosome one, this length, this is the MD5 and this is the URL where it's located. So no reason why it comes up with a warning. It all looks fine to me. All right, so this is our genome assembly, right? So we downloaded the genome. We did all of the necessary steps so that our tools can work with it. So down to the next step, of course, is to download some reads, right? So that is, I think, what people wanna see because in the end, that's kind of what we wanted to do. We wanted to do an alignment where we are using the reads on the genome. All right, so from this point on, we will start automating all of our steps, right? Because we wanna build a pipeline. We have to build a pipeline which is completely computational. So that means that we are, if we wanted to get reads from the SRA, right? The thing that I did is I went to Ensemb or to NCBI to find a suitable data set. So I just took a very small data set that I could find and what I did is when I went to the SRA, I ticked I want to have public data sets because we wanna work on public data, not private data. I wanted to be RNA, I wanted to be paired, right? So that I have left reads and right reads and I want these reads to come from an Illumina sequencing machine. So I looked through the list and I took some data sets which were like 200, 300 MBs. And the way that you normally would download these is, you would say make a directory, a data directory with raw, right? And then I would download the control samples and I won't download the other condition. So in this case, I took a data set, this data set had just basically three control samples which were sequenced and it had three samples which were treated in a certain media or which were grown in a certain media. So of course we don't want to do this, right? Because we are going to write a pipeline. So the pipeline will go and take the name of the SRA archive or the SRA file that we wanna download. It will download this file and then it will do all of the steps automatically from then on, right? Because we don't wanna sit behind a computer like a pleb typing in all of the commands. We want to use the computer to generate these commands for us, right? It's the same as downloading all of the chromosomes. We could have just said we get, type in the whole address or copy paste in the address but then we have to do that 16 times and it's much better to make a small script to do that for you. So in this case, we're also going to do that. We're not going to type into the command line anymore. We're just going to use R and have R execute or generate the correct commands for us and then execute them. Good, so using R to execute, we could have just used the system command, right? We used the system command before. I love Muse as well. What's this Muse thing? Where's this Muse thing coming from in chat? Is there something Muse behind me? I can't see. There is a Batman lamp over there, but. Anyway, but so we're just gonna use R to execute, right? So we could have used use the system command and I'm still using the system command but I wrote a little function which goes around the system command. Oh, written on my mug. True, true, true. Yeah, we got this from a concert where we were. So you could buy the beer mug or the beer in these kind of like 50 centimeter pine things and then, well, you could keep it or you could bring it back and get your Euro back which you had to pay as a fund. Anyway, so we are just going to use the system command. However, we are going to be smart, right? Because these files will become very big. Of course, in our case, they're relatively small because we're only taking a data set which is two, 300 MBs but sometimes you are going to download data sets which is like five gigs or six gigs and then when you're working on your script, you don't want to every time X or download this five or six gigabyte file. There's also going to be steps in the pipeline which take quite a large amount of time, right? So a large amount of time would mean eight hours of alignment. So we don't want to redo that step if we have already done that. The simulation theory world tour, that was the one that we got the mark from, right? So I'm just going to say, well, I'm going to make a small function called execute, right? It will take X and X is just the command that we want to execute, right? So I'm going to give that directly to system at this point but I'm also going to say, give me the ability to define an output file. So if I'm not defining it, the output file will be NA, right? Meaning that it does not exist. And I'm also going to add two more parameters. One of them is intern, which is a parameter from the system. And I'm going to say quit on error and you can, generally when I do development, I set this to true, right? So that means that if there's an error, it will quit R and not save my workspace. So I'm just going to check if the output file is not NA and the file exists, I'm just going to say, the output for this step exists, I'm skipping this step, right? So that means that if I'm trying to download a five gigabyte file, but I already have a five gigabyte file or a file with the same name on my hard drive, then I'm not going to do the download. Then I'm going to write some log. So I'm going to cut and I'm going to cut the command. So the start of the command with four minuses in front of it. And then I'm just going to do the command and then I am going to write the end of the command. So I'm going to say four times symbol greater than and then I'm going to write the whole command or the result, right? So the rest one is the resulting code. So this should be zero, right? Because if you execute a Linux command and the Linux command executed successfully, it will not return an error code. So not returning an error code means the value of it is going to be zero. But if there is an error code, right? So if the thing that we called failed, then it will tell me failure code 13 or failure code nine or failure code something. So if there is a failure code, then I'm just going to quit R, right? Then the external process did not finished or it finished unsuccessfully. So a little example, right? Because we saw this faster queue dump, which is a very easy command, is faster queue dump minus P for progress than minus, minus split files so that it downloads the whole data set and then splits it into left reads and right reads. So read one and read two file. So if we want to do this command, right? Then we can do the command like this. So we can just do the command in the execute function. But this will of course always download the file because we didn't give it the output file. If we, however, give it the name of one of the output files, right? So in this case, underscore one dot fast queue because this is the left read. So if the left read file is there, then it will not execute this command, right? So it's just a way of being nice to your local neighborhood server, right? Because you always want to be nice to the ensemble servers. You don't want to download five gig if you already have the same five gig on your hard drive. All right, so this execute thing is going to be kind of the core. Every command that we're going to do from now on will be executed by any execute function. So first we are going to start building up our script, right? So there are some static variables from the script which will not change when we are building the script. So the first one is the input there. So I'm going to say all of my input files that I download from the short read archive are going to go into a folder and this folder is called home, then data raw, right? So this is the raw data that we downloaded and in case you're doing your own data, then you can put it in here and make this file read only, right? So if you get your data from a company, you make this directory, you download your file, you put your files in there and then you make it read only, right? All right, we have our input base and this is fixed now, but next week we will make this variable, right? So we will make sure that this can come from the command line. So this is the input to the script. So this thing is fixed because we are building up a test script, but it's going to be variable and it's going to come from the command line, so from outside of R and this will be the name of the data set we are going to download. So all of the output files that we're going to create are going to go into home, then the data and then output and then slash and then the input base. So the name of the file that I'm, or the name of the fast queue data set that I downloaded and then I'm going to say dot ALM, right? So this is just going to be a directory and this directory is going to be into the output folder and then all of the files from SR139, blah, blah, blah will go into SR139, blah, blah, blah, dot ALM for alignment. Of course, the genome needs to be specified. So since I indexed the genome using star, the genome is located in home, then E genome star. This is the same path as that I gave when we indexed the genome. The reference, so this is going to be a large path which is home, then E genome and then servicier version 64, 11.fq.gz and I'm going to have a path to the reference snips as well which is the same, but this is of course, home, then E genome and then sarcomycer servicier.vcf.gz. So these are the first parts of the script. So I'm going to use R to create the N and the output folders, right? Because I like using R for these things and in case that I have never ran my script before, I wanted to create these N and these output folders and the input folder is not that clear, right? Because generally you would download your files or get them from an external company but since we're using the faster Q download, I'm just going to make sure that the input directory exists. I'm going to do the same for the output directory and of course since the script is going to be able to handle not just one dataset, but multiple datasets, it's going to make for each dataset, it's going to make its own output directory. So I have to make sure that it makes it. I'm going to set recursive to true and this is because it needs to make the data folder, the output folder and the data and the raw folder. So it not only needs to create one directory but it needs to create three directories deep for the output and two directories deep for the input. So always be organized when you're doing RNAsec, keep track of where your files are stored. My voice is breaking up. I'm getting all choked up here. All right, so we already have a little piece of the code, right? So we have our static variables, we have our execute function. So we have first the execute function, then we define all of the static variables and then we define two times a creative folder for the input, create a folder for the output. Good, so now we can start downloading our file using the SRA. Oh, I don't know what's going on with my voice but it's a little bit annoying. All right, so download the FASQ file, right? So we set working directory to the input folder because I'm going to give command line calls. I need to make sure that we are in the correct working directory. So I'm just going to execute and I'm going to say faster queue dump minus B minus minus split files and then the name of the input file, right? Which is just the name of the dataset that I want to download. And then here is the input file that I'm going to check. So if this input file exists, so input base underscore one dot FASQ then I'm not going to execute the faster queue dump. I'm going to do the same thing for the BG zip because I want to download FASQ files but of course I want to zip these as quickly as possible. These files can be up to like 20, 30 gigabytes. So I want to then zip them down to make sure that I'm not wasting a lot of this space. Good, so, and then we are at retrimming. So let's just do the first step of the script. So let me open up my Notepad plus plus window for you guys so you guys can see. So here we have my alignment script. So again, hey, you have the execute function exactly the same as we described. We have some input static parameters, right? So this one again will come from the outside later on but we're going to have an input directory, input base, output directory and then we are going to create these files and then we're going to set our input directory, right? So we're going to move into the input directory and then we're going to execute the faster queue dump files, right? So we're just going to say faster queue dump minus B, minus split files. Take the input base, the data set that we want to download but only execute this command when this underscore one dot fast queue is not there. All right, so let's take this piece of code. Oh, let me zoom out a little bit and let's go to Linux, right? So when we go to Linux, let me close my Notepad window. Right, I'm going to say R and I'm just going to explicitly start R from the genome folder, right? Just to make sure that when I'm developing the script that I can call R from any position, right? Because I want my script to make sure that the script moves to the correct folder, right? So that even though I started in the genome folder, it should go to the correct position. All right, so let's paste the code, see what happens. So this will take a little bit of time because it's, although it's a small file, it will still take some time to download. It's still a 200 MB file. So in the meantime, we can actually go and check the files. So what's happening on our hard drive. So if we go to data raw then here you can see it makes a temp directory where it downloads it and then it starts outputting the two fast queue files. So it now downloaded both of them. So the one is one gig. The second one is also one gig. So it's really fast, right? It downloaded two gigabytes in relatively short amount of time. And then it will start sipping the files. So it will now start packing the files because of course two gigs of hard drive space is a lot for just a single RNA-SEC experiment. At least if you're only doing a genome of 16 megabases of query. All right, so it will take a little bit of time to compress these files down. I'm using BGZIP, which is the blocked GZIP instead of the standard GZIP. It's a little bit slower, but in the end, it will be better. All right, so it finished the first and every time you can see that in our log file and you can see that it shows you which command it's executing, then you get the output of the command and then it says zero, meaning that the command executed successfully. The same thing for BGZIP executed successfully. No errors whatsoever. And it's now doing the second file. All right, very good. So it takes a little bit of time, but it gives us time to go back and go to the next step. Right, so the next step is going to be read trimming. Right, we're downloading a whole bunch of reads, but these reads still have the adapters on them. These reads also have like the, there might be reads which have very bad quality scores, so we wanna get rid of those reads as well. Right, so read trimming using Trimomatic takes two input files if we do it paired end. Right, so it takes the fast queue underscore one and it takes the fast queue underscore two and then it generates four output files for us. And why four? Well, we have left reads and we have right reads. Right, and in the end, we can end up with a trimmed read on the left side and a trimmed read on the right side which are still good enough. Right, so those go into paired end one and then we have a file called paired end two. However, sometimes Trimomatic deletes a whole read. Right, it could be that the left read is such a poor quality that this read is not suitable. Right, so it throws away the left read but the right read is perfectly fine. Then it ends up in a file called unpaired underscore two. Right, because the right read survives while the left read was more or less trimmed completely. It can happen the same in opposite direction. Right, so that the right read is not good enough so the whole read gets discarded and the left read is still left and then it goes into unpaired underscore one. So that means that some of the reads will not be paired anymore. Right, but all of the paired end reads which have been trimmed will end up in these two files so in paired end one and paired end two. And if we wanna run Trimomatic, we need to know which adapters were used. Right, because it needs to know which adapters are trimmed. Are those also called singletons, the unpaired ones? No, Trimomatic uses you for them, so unpaired. But yes, they are in theory singleton reads. They are, it's more or less like they're the reads which are similar to when you would have not done paired end, right? So in that case, if you have an aligner and the aligner works in paired end mode then the aligner can't use them. But you can always realign them later on, right? Because they are still valid reads because they survive the trimming process. It's just that the mate, the pair on the other side it did not survive. So they are just single end reads more or less. Some people call them singletons. But Trimomatic uses unpaired reads for them. We need to know which adapter we used. So in this case, I looked it up and they use trussec three paired end adapters and then I'm going to use the second version of the file. I am not 100% sure that they use the second version of the adapters. But it seems to work well. But this is always difficult because sometimes SRA doesn't tell you which adapters were used. It just tells you the machine which was used to sequence. And of course the same machine you can use different adapters. But in this case, I'm fairly certain that they are the trussec three paired end adapters and I use the second version of these adapters which is very similar than the first one. I think it's just a little bit longer. So there's a couple more base pairs on this one. And then when we go into trimming options then Trimomatic has a lot of different trimming options. So we can trim like three bear base pairs from the beginning three base pairs from the end or five base pairs from the end. And so we can say always cut off little parts of the read just because we don't trust the first five base pairs on sequencing or we don't trust the last six. And then you need to define the sliding window. So the sliding window is the window in which it will go through the reads one by one. And within the window it will calculate the average quality. And if the average quality of the window falls below a certain threshold then it will cut the read at that point. Of course the bigger we set our sliding window the quicker it runs, right? It's just less data to go through. If we have windows which are 30 base pairs then of course we go much quicker through the whole read. And we have a minimum length. So the minimum length specifies how long a read needs to be after being trimmed, right? So if it's being trimmed and we end up with a read size or a read length which is under the minimum length then the whole read gets thrown away. If the mate does survive then it gets put into this unpaired folder. All right, so trimomatic. And this is generally the way that I write my code if I do this executes, right? So if I set up the code for a pipeline, right there I just say like step one, read trimming. So it needs a whole bunch of things, right? So in each wet lab practice you ask the wet lab folks which adapters they use. Yeah, you definitely need to. You need to know which adapters they use. Well, in theory you don't need to know, right? You could always use a program like FastQC to see if there is a sequence which occurs millions and millions of times and then you could add it. Sometimes the wet lab folks don't even know which adapters they use. They just say, oh, I use this kit, right? And then you have to go to the kit manufacturer and see which adapters they provide with the kit. But yeah, you always want to know which adapters there were. If you don't know, you can always still figure it out because these adapters they are literally in the file millions and millions of times so they are relatively easy to figure out. If you would just open up the FastQ file then you would see that every read in the file would start with more or less the same sequence of like 20 base pairs and that's the adapter. So then you would just copy paste this into a separate file and you would give this file to Trimomatic. All right, but Trimomatic, of course, we need to define the input and the output files. We need to give it the path to Trimomatic. We need to have the executable call, right? How do we call the executable? Then we have to specify all of our options. Then I make the final command and then I execute, right? So that looks like this, I have trim.files. So these, the first two are the input files and then I have four output files. So read number one paired, read number one unpaired, read number two paired, read number two unpaired. I have my path which is home Danny software Trimomatic and then I do the executable. So this is the command to execute Trimomatic without any options, without any parameters, which is just Java minus jar, the path where Trimomatic is located. And then we have the dist jar Trimomatic 0.40 RC1.jar, which we compiled the last time using on end. Then I have some options. So one of the options is using Illumina clip, right? So an Illumina clip means we want to clip the adapters. The adapter file is in Trimomatic, Trimomatic has it. Thank you for the tutorial, really appreciate it. What the idea of RAM allocation for Trimomatic? Trimomatic is relatively RAM intensive. What is the ideal RAM allocation? It really depends on the amount of reads and the length of your reads. If your reads are like 350 base pairs, you need twice the amount of RAM compared to when your reads are 150 base pairs. Fortunately, the RAM is not the main issue with Trimomatic. The main issue with Trimomatic is the CPU because it trims a read, right? So the only thing that it does, it keeps an X amount of reads in memory. So and then it writes it out to these four output files. So it's not like it's going to load in all of the reads and then start trimming. No, it will just go through the files one by one, right? So line by line, it will trim the read, write it to the output, trim the next read, write it to the output. So the RAM is not the main issue for Trimomatic. The CPU speed is really important as well as the hard drive speed. So for Trimomatic, you can get away with relatively small amounts of RAM. All right, so the adapters. In this case, I specified the paired N2 adapters. You give it some parameters, two, 30 and 10. I forgot what they stand for, but they specify where you expect the adapters to find. So I think it's somewhere like we expect the adapter from the second base bear to the third base bear. And I think something like, I don't know what the 10 stands for. And this depends on the different adapters. So you can just look this up where they are and what the number should be after the adapters. I'm going to always cut up the first three base bears. I'm going to cut off the last three base bears just to be sure because these were sequenced at the beginning of the sequencing run. So they are unreliable. The last three are at the end of the sequencing run. So they are generally unreliable. I'm going to take a sliding window of size 15 and then I'm going to move it by four every time. So I'm going to take four, four, four. And I want the minimum length of the reads to be 36. So only if a read is 36 base bears after trimming, do we want to keep it. And then I'm going to construct the command. So the command means that I have the Trimomatic executable. I want to run Trimomatic in paired end mode. I want to then paste all of the files together because I need to give it six files in order. So first the two input files and then the four output files in this order. And then I'm going to say space and then I'm going to give it the options. So I'm going to then execute Trimomatic and I'm only going to execute it when this file number three, so underscore one p fastq.gz does not exist. Right, so this is again, to make sure that we only execute Trimomatic when this file is not there, just to prevent us from trimming over and over and over again. Because it does take some time to trim the reach. All right, so let's go to the Ubuntu, right? So I explained to you guys the code. So I'm just going to copy, paste it in and I'm going to execute it just so that you guys can see what it starts doing. So let's just paste it in. Right, so it prints out the long clipping sequences which are the adapters, right? So the prefix pairs, it's using two threads. And we can actually follow it. It is going to write to the output file, right? So here when we go into the file, we already see that it starts writing the trimmed output files. And this is why it doesn't take a lot of run, right? It just takes two reads, one from the left, one from the right side, analyzes them and then puts the reads into the corresponding files. So if we would press F5 a couple of times, we can see these files growing. So you can see that it discarded a lot more of the right reads, right? Almost double the right reads were discarded compared to the left reads that, oh no, almost double the amount of left reads were discarded compared to the right reads because the left read file is bigger, right? So there's more surviving left reads than there is surviving, there's more surviving right. Anyway, so we'll just do this. We will only use the paired files of course because star is a paired and a liner. So we're not going to use the unpaired files but you can add back in the unpaired files later on. But it's always a minority of the files that have one of the two reads being cut off. All right, so it will run for some time and at a certain point it will be done. The nice thing about Trimomatic is that actually it creates GZipped files by default. So it uses BGC to directly do that. All right, so it says that it detected the quality encoding and then both surviving. So out of all of the reads, 71% of the reads survived at both sides, right? So that's pretty okay. So it actually says that forward only surviving was 1.5. 2.8% was when the reverse read survived and it actually dropped 24% of the reads for some reason. So the question is why did it drop 24% of the reads, right? So that is something that we can investigate later on. Why are 24% of the reads just not of good enough quality? That's probably something that happened with the sequencing run. Anyway, we still have like 71% of our data left. So here we still have like 22.4 million reads which we can deal with. But we did lose 800,000 reads in this step. So the read quality of this data set is not that good because Trimomatic throws away 24% of the data. Good, so next step. So the next step is the fun step because the next step is the alignment, right? So yeah, I think I have the right slide now. So the alignment is going to be done via star and then we're going to do some coverage statistics. Then we're going to remove duplicate reads. We're going to add read groups and we are going to add information like which run it was, which library was used and the name of the sample. We're gonna use the GATK to do base recalibration and then we're gonna look at the files into the GATK. So again, I'm going to use the exact same code every time. Of course, parts of the code will change, right? I'm not calling Trimomatic, I'm calling star but I'm going to build up my code exactly similar to what it was before, right? So the first thing that I need to do is unzip the two files from Trimomatic for star because star does not allow me to have GZIP files. For some reason, star has no internal decompressor. It assumes that all of the files that you're using are unzipped. So you're not allowed to provide zipped files. All right, so I'm going to execute G unzipped twice. I'm going to keep the original files and I'm just going to extract them. And what I'm going to do is I'm only going to do this step. If I'm going to extract the third file from Trim files only when this thing does not exist. So I'm just going to say if the file dot fast Q is there then don't execute this command. All right, so let's do this and then I'm going to save files in, right? Because of course I have two input files that I need to provide to star and those are going to be the third and the fifth Trim file. So the third one is the underscore one P and the fifth one is the underscore two P. So those are the paired left and right read files. All right, so let's go and unzip them. So I'm going to just paste this in, unzip them. We'll take a little bit of time. All right, then I can explain to you guys the next step. So star, star again has an output base, right? So I'm going to write in the output directory and then I'm going to use the input base and then I'm going to call this star output base, right? So this is the folder where I'm going to write in with the file name. So the file name is going to be the output directory with the input file name. And then what I'm going to do is I'm going to put aligned sort by coordinates.out.bomb. So this is going to be my file name. It's going to be the name of the data set and then aligned.sortedbycord.out.bomb. So I'm just going to add what I did. So I aligned it, it was sorted by the coordinates so sorted by the reference genome and this is a BOM file because it writes a BOM file on the output. The executable is going to be star and not just star but in this case, we're going to specify that we are going to use the run mode align reads. It has some options. So the options is that I need to provide it with the genome there and I have to say that I want to have a BOM file as an output and this BOM file needs to be sorted by coordinate. I provided with the files in, which are the files that I got from Trimomatic so the third and the fifth file name from Trimomatic where I changed fastq.gz by fastq. So these are my files in and I need to give it an output file name prefix which is my output base. So this is how the file is going to be called and then I'm just going to execute the command. So the command consists of the executable, the input files, the options and then the output files afterwards. And of course, you can, when we go to Linux, the nice thing about using this execute is that we can exactly see how the command looks that it generated, right? So this is the whole command that it generates and so we don't have to type all of this in and that is the advantage of using R to kind of construct your own more or less commands because otherwise, if you would have done this by hand then you have to type it in star run mode, align read, refile. So now you can just, hey, you can generate these commands from the input file names and you can generate them from the output file names. Good, so star started, it loaded the genome and it's now mapping all of the reads to the genome. So probably in the background, you can hear my laptop spinning up because it is quite heavy. Fortunately, I only gave it two cores and it's only like a couple of million reads. So it's going to align relatively fast but this is the most computationally heavy step generally. So if you're aligning against the mouse genome, right? And you have a standard kind of RNA seg run. This will take around eight hours and will take around 16 to 32 gigs of memory. So this is the really compute heavy phase of the alignment but since our input files are really small, right? Our input files are like 190 megabase and 180 megabases, they are, it's perfectly fine. So just takes a couple of minutes to do the alignment. But still, you can see that it needs to align all of these millions and millions of reads towards the reference genome. So it does take some time to do this. Good. So in the meantime, let me switch back to the script because after we're done, we're going to do step 2.1 and step 2.2, right? So we are going to use some tools to index the bump file so that the tools afterwards have an index. This means that it again, the tools can very quickly go to points in the bump file, right? Four to six cores of CPU would suffice for Star. Yes, yes, even one would suffice. One core is enough, it just means that you have to wait a lot longer. So Star is pretty well optimized for multiple cores. I would never use more than eight cores for an alignment because most of the aligners, like in the end, if you start using a lot more cores than eight, other things come into play. And the things that come into play is things like the speed of your hard drive, the amount of memory that every core is starting to take. So if you would say, I'm going to do the Star alignment and I'm going to use 64 cores, right? Imagine that you have a massive server somewhere with 128 cores and you're saying, well, use half of them for the alignment. Generally, it will not make it 64 times as fast as using a single core. There is always an overhead to do multiple core processing. And generally, I limit myself to using eight cores for alignment and this is because every core needs to write out temporary files. And these temporary files, generally, writing out to a disk is much, much slower than doing computation. So it's much better to have eight cores and not overload your hard drive than to use 16 cores and have your hard drive literally melting away. So for me, I generally say eight cores, that's the maximum that I'm going to use for any step, unless I know that there's no output being produced. Because if there's no output being produced, then generally the overhead is much lower. But for Star, I would say use eight cores when you have them, make sure actually that when you specify your number of cores, never specify more than the number of real cores than you have. And real cores means physical cores. So even though my Windows machine, let me show you. So if I go to my Windows machine, right? And I think, let me remove that away. So this is how my Windows machine looks like, right? So this is my task manager. If I go to performance, it tells me that I have a CPU, right? And if I click on it, or on this one, you can view it, change graph to logical processors, right? So now it says that I have an I7 and it tells me that there are one, two, three, four, five, six, seven, eight, nine, 10, 11, 12 cores in the machine. But this is a lie. This is a really big lie that CPU manufacturers tell you. And that is because you can see underneath here that there are actually only six cores on my machine. 12 logical processors. So every CPU, physical CPU in my machine is actually virtualized into two logical processors. So it's actually simulating. So you have one physical metal thing simulating two logical cores. And this is hyperthreading, right? So even on the big sale machines, you might have a server with just 64 cores in there, logical cores, but it means that they're physically only 32 cores in the machine. So always make sure that you stay within the core limit, not within the logical processor limit. Because if on my machine, I would start using 10 threads, right, so 10 different programs next to each other, four out of 10 programs would start waiting for the other ones because it can only simulate one core at the same time. It looks like both of them are active at the same time. Unable to access JAR file. So is the JAR file located there? Because the step one produced an error unable to access JAR file. Generally, that is make sure that you have the path exactly correct. So it might be that there's a typo in the path. So what you can do for that is, let me close this one down, is just go to home, right? So go to your home folder, then go to your software, go to Trim-O-Matic in this case, right? So and then go to dist, go to JAR and then make sure that this one is really here, right? So that it's named exactly the way that you type it in. Because unable to access JAR file generally means that the file is not found. So it might be a typo. What you can do is actually you can say properties on here and then it will tell you the whole path, right? So you can actually just copy paste it from here to make sure that you make no typos. Yes, it isn't in the folder. Did you compile it then? Because you have to execute and to compile the JAR file. So from the previous step. All right, so and you can also just download it from the Trim-O-Matic website, right? We compile Trim-O-Matic from source, but you don't have to do that. You can also just download Trim-O-Matic to your hard drive. All right, but you can see that it finished, right? So then the next step after alignment would be some basic indexing again, right? So that we can use the BOM file very rapidly. And the way that we can do that is using some tools. So I'm just going to say some tools, index the BOM file and the BOM file is gonna extension.bai. So BOM index file. So if the BOM index file exists, do not do this, but otherwise execute some tools, index and then start a BOM. So let's just do this step and put it into Linux, right? So, and this is really quick because it's a very small BOM file. So it will take like less than a minute to index it. So after we indexed it, we can now open it up in the IGV if we wanted to, right? So now we have our alignment, we have our BOM file, we have our BOM index file, we have our genome. So this is everything that we need to show it in the IGV. However, I wanted to show you two more commands which I think are really useful. So the Flagstad command will give you the flags in the BOM file. So every read in the BOM file has a flag saying it's successfully aligned, it's unsuccessfully aligned, it is aligned and it's made, it's aligned. So here we can get some very, very basic info from this, we can get some very, very basic information on how our alignment went, right? So if I execute the Flagstad on this BOM file, then it tells us that let me see 6.6 million reads in total past QC and zero of them failed QC. There are 4 million primary and 1,800 secondary alignments. So that means that 4 million reads had a primary alignment and there were 1.8 million reads which have two alignments. So they seem to fit not on one position in the genome but on two positions. In the end, all of the reads were mapped, 4,700 had a primary mapping, right? So that's 100%. 4 million had, were paired mapped. So that means that these maps, these were read ones and these were read twos. And it tells me that 99.8% was properly paired, right? That means that the left read aligned and the right read aligned. So that's why there's a small difference between these two, right? Because sometimes read one aligns, but read two does not align. Sometimes read two aligns and read one does not align. So that means that out of all of the reads that we had trimmed with Trimomatic, 99.8% of the reads were properly mapped. So both reads were mapped to these locations. Only 8,675 reads were paired and reads where only one of the two was mapped. So those are called singletons after alignment, right? So that's the problem here, Leo, with the wording, right? Singletons are generally the read that remains when the pair does not align. So, but in Trimomatic, Trimomatic uses unpaired because there's no alignment yet at that point, right? So 8,675. Would you suggest using FastQC or MultiQC for better visualizations of the mapped FastQ files? Well, FastQ files are by definition unmapped, right? So FastQ files are just files with reads in there. So FastQC is a really good program, right? It just takes one of these FastQ files and it will tell you how many reads there are, if there's anything wrong with the reads and these kinds of things. So we skip that step at this point because just going through FastQC takes you already a lot of time, right? You have, it looks at the flow cell, it looks if anything went wrong with the flow cell. It looks if there's like ATCG discrepancies in the reads. It looks to see if the adapters are still in. So yes, but the FastQC is only useful before trimming, right? After trimming, you get new FastQ files. So you can do FastQC again to see if the trimming actually solved most of the issues with your reads. The flagstots are just a very basic overview of how the alignment went, right? So we can see that the alignment of the reads that we have went really, really well because almost at 99.8% of the reads were good. So both the left and the right read aligned and then there were only 8,700 reads which did not, where the pair did not align. So which were not properly paired. We also saw, see that there are no reads that are mapped to different chromosomes and this generally points to very poor alignment or to repeated regions into the genome, right? So sometimes it can be that the left read aligned with the left read, sometimes it can be that the left read aligns on chromosome one and the right read aligns on chromosome 12 which is very weird, right? Because we don't expect genes to span multiple chromosomes. A chromosome is a single unit. So a gene generally is located on a single chromosome. If you actually do, for example, high C, with high C, you do expect to see numbers here because when you do high C, you're doing DNA sequencing after you kind of physically merge the DNA together, right? Because you use an agent which merges high C or in high C what you're doing is you're fusing chromosomal regions which are physically close. So if you see that there are reads, pair then reads with one of the reads on chromosome one, one of the reads on chromosome three, for RNA sec you would say that this is wrong. But for pair then reads, you can do this, it can happen when you do things like high C or different types of sequencing. But in this case, the read or the alignment went pretty well. So most of the reads were mapped, 100% was primary mapped. There are some secondary mappings, right? Which means that the read did not align uniquely into the genome and generally we want to get rid of these secondary mappings because we don't, if a read is not uniquely mapped, we don't want to use it because it maps to a region or it maps to a gene which is duplicated which gives us problems down the line. So the sum tools coverage tool will tell you and coverage is not that important when you do RNA sec, right? Because it generally has to do with DNA sequencing but I do want to show you the results of the coverage tool as well. So the coverage tool looks like this. So what it has, it shows you the chromosome name, it shows you the start position on the chromosome and position, it shows you the number of reads which were aligned to the chromosome. Then the next one is the coverage of the bases. So how many bases were covered on this chromosome? The coverage is then this number divided by the total length of the chromosome. So in this case, every base pair on chromosome one has on average 74 reads on top of it. The average depth is 34.46, 39.46. The mean base quality was 37.2. So that means that the trimming did a good job, right? Because it's fret 33. So that means that the score grows from zero to 40. And the mean mapping quality on chromosome one was really good with 251. So that means that most of the reads on chromosome one were very, very well aligned. We can see that that's the case for most of the chromosomes except for chromosome 12. Something is weird on chromosome 12 because on chromosome 12, we see that the mapping quality of the reads towards chromosome 12 is much lower than we expect based on the other chromosomes. So this might be that the large part of chromosome 12 has got things like duplications or duplicated regions. But this is a little bit weird. We do see that the mean depth is much higher on chromosome 12. So this, if I see this, this makes me think that something was wrong in the PCR step. So something kind of doubled the coverage, right? The mean depth, right? Because there's like much, much more reads on top of each other. So this is generally a sign that there are some duplications going on, PCR duplicates or optical duplicates. We see that the coverage of the mitochondria is really low, which is generally logical because when you do RNA extraction, you don't get a lot of mitochondrial RNA. But still we see that we have mitochondrial RNA, the quality of the mapping to the mitochondria is bad. It's good. But chromosome 12, something went wrong there, right? Something is not exactly right. So let's see if we can fix this by removing PCR and optical duplicates. So PCR and optical duplicates can be removed by PCAR tools. Again, more or less very similar structure which we saw before, right? So I'm going to define a new output file, p.bomb, for Picard, right? I initially in the code, I wrote Picard everywhere but it just was, it created lines which were way too long and you couldn't show and stream very well. So this is the output file, the Picard Bump file. It also produces a metric file. So the metric file gives you a summary of how many reads were marked as duplicates, which ones they were and how they more or less fit in. We have the executable, we have the input file, the output file, so that's the Bump file and the metrics file. I want to say the options in this case for Picard is going to be remove the duplicates and I'm going to really remove them. I'm not going to mark them. Picard can also mark duplicates. So that's why the tool is called mark duplicates. So I'm going to say Picard execute, use the mark duplicate tools, but instead of marking them, remove them, that's what I'm going to specify in the options, input file, output file, metrics file. That's not correct because I changed this a little bit. So it should be like this. All right, and then I'm going to execute this. I was still working on the code this morning so it might not be entirely up to date because I merged the output file. So there's just one P out. All right, so let's see if I actually did the step correctly. So I'm just going to go and I'm going to go and mark all of the duplicates and I'm going to remove them from the Bump file. So when I go here, then if I would paste it in, then it's going to execute. So it's going to execute mark duplicates. It directly throws me a warning, but then it will just go through all of the reads and it will start deleting the duplicates. So again, very long command. We can use R and to generate this command for us and we keep every line nice and small. So if we look in the files, right? If we look at the data and we look at the output, then for this alignment, we now get an RD file, right? So the same file name dot RD, so remove duplicates, which will start growing in size when it starts doing the output, right? So the original Bump file was like 370 MB and it now starts to write the output file after removing the duplicates. And this one will be smaller, of course, because some of the duplicates will be removed. All right, and this will take some time and it's done, right? Because again, it shows a very small genome size. Generally, this will take in the order of like an hour of computational time. All right, so we now remove the duplicates, right? So we now have a new Bump file. So every time we get a new Bump file, we are going to do the same thing, right? So we're going to index the Bump file. We are then going to do the flagstaffs and we are going to use the coverage to just make sure that everything went okay just to keep track of our statistics, right? So first step is indexing it and then we do the flagstaffs. So give us the statistics on the different flags and then we get the duplicates. All right, so now what we can see is that this actually solved the problem that we saw on chromosome 12, right? On chromosome 12, we still have a lower, mean mapping quality, but the mean mapping quality almost doubled compared to the previous time, right? The previous time was 84.1. So by removing the PCR duplicates, it tells us that they were PCR duplicate at the end, right? Because here it says, if we look here, it says that 1.3 million reads were marked as duplicate, but they did not find any optical duplicate clusters, right? So there were no optical duplicates produced by the sequencing machine, but there were PCR duplicates in the data. So it removed the PCR duplicates. After removing the PCR duplicates, we see almost a doubling of the quality of the mapping on chromosome 12 and we now also see that the coverage is more in line with the other chromosome. It's still high, so it's still not perfect. The quality is also still not perfect, but it is a lot better than it was. It's not so much of an outlier anymore. So there's still something weird going on on chromosome 12, but it's not as weird as it was. So this fixed part of the problem that we had, had there were PCR duplicates coming from chromosome 12. So when this experiment was done, probably the primers mapping to chromosome 12 or primers that they used or the random amplification primers, right? Because when you do RNA sec, you take these random hexameric primers and you take primers for poly A-tail. So probably these random hexameric primers were not in some way biased to amplifying genes which are located on chromosome 12. But at least it looks a little bit better now. All right, so the next step after removing the duplicate is now starting to prepare for the GATK. So for the GATK, we need to add something which are called read groups. We need to add information to the BAM file about, for example, which sequencing library we used, which machine we used, right? So we downloaded stuff from Illumina. Here we are talking about a single run, right? So we're saying that the project library was Illumina, the project unit is a run. Then we have LB, which stands for the, I forgot, LB. It is the library, yeah, it's the name of the library. So the name of the library is just the input base without the SSR, SRR, right? And this matters, for example, when you have the same sample done on multiple sequencers or the same sample done on multiple flow cells in the same sequencer, because then by adding the same information for the same sample, tools later on can combine multiple BAM files, right? So imagine that I have a BAM file and I, or imagine that I have a library and I split this library on two different sequencers, then I can add the information to it. So I can tell it which sample it is. So minus SM. So this is the sample name, and that is just SRR. So the name of the file in short read archive. All right, so besides that, I have to tell it that I want to add or replace read groups. So in this case, I'm going to add read groups because there was no information. I'm going to use the pcarb bomb, and then I'm going to write a new output file called rg.bomb, and I'm going to provide the rg.options. Again, very similar to what we did before, right? So it's more or less a command, add, replace, read groups. This is the BAM file that goes in, the BAM file that goes out, and these are the options. So these are the things that I want to add into the file. All right, so let's execute that. I'm just gonna copy the execute step here as well. So the sum tools index, so that is the indexing step to make a new index for this BAM file. All right, let's run through it. All right, so it will just add this information. So it will say ID is one, PL is Illumina, library is this name, and SM, so the sample name is this. So it will just add this to the BAM file so that if you would have multiple BAM files from the same individual, you could add the same information. And then later on, when you do SNP calling, for example, it recognizes that, oh, these two BAM files contain the same sample. So when I do SNP calling, I need to look at the reach from both of the files. But in this case, every file is one sample, so it doesn't really matter in this case. But we do need to add the information. We need to tell the downstream tools that it is made by Illumina. All right, good. So again, we made a new BAM file. So now if we look at our files that we produced so far, we can see that we are getting more and more files, right? So this is the original one, then we have the one with the remove duplicates, and then we have the one with remove duplicates, and then read groups added to the BAM file, and we have the buy file again here, right? So again, it's slightly bigger now because we added information to the BAM file. But you can see that we're already using almost one gigabyte of hard drive space just for a single file, for a single sample. All right, last step, GATK, right? Because the GATK allows us to update quality scores based on known SNPs, right? So that's why we downloaded the VCF file, the VCF file contains known SNPs in yeast. So these known SNPs should not be counted in the BAM file as being mismatches, they should be counted as perfect matches, right? Because we know that there's a SNP in other yeast species. So when we find this SNP in our animal, we don't want it to be a mismatch to the genome, we want it to be a perfect match because of the fact that we know it's there. So this is done by the GATK base recalibrator, right? So we need to do some preparation. So I just have two. So I'm going to say to execute the GATK use Java jar, right? I'm going to give it four gigs. I'm just gonna execute this jar file here. I'm always going to use the same reference file. So that's the Vasta GZ reference. And I'm going to give it the reference SNPs. So the known sites. So those are sites which are known to be variable in a certain population. So for mouse, you can use the mouse project. And for human, you can use the SNPs, known SNPs from humans. But we are just using the Selegans, not Selegans, the baker's yeast. So this consists of four different steps. So the first step is run the base recalibrator. So the base recalibrator will write a file, which is a covariate one file, right? So this is the information. This is the calibration information that it read. Then we need to apply this information to the BOM file, producing a new BOM file. And then we are going to run the base recalibrator again, creating a new covariate file, right? So we say, this is our BOM file. Now recalibrate it again to see how much has changed. And then we can do the analyze covariates, and we then can compare the first adjustment file with the second adjustment file, which was done after adjustment, right? So we run once for getting an adjustment file. We then use the adjustment file to do the adjustment. And then we run the adjustment again, or the base recalibrator again to get another adjustment file. So we generate before and after, and then we can analyze the difference between the before situation and the after situation. Of course, we then also need to index the recalibrated BOM file, and then we're going to execute flagstots and coverage one more time. All right, so let's do this, right? GATK, and this will take some time because these base recalibrators, they need to go through the whole genome, look at every snip in the genome, and then recalibrate based on if it's a known snip or not. All right, so let's run this just to see, all right? So we are going to run the first calibration step, and it's just going to go through the whole file, and it will tell us how many reads, it will process per minute, and in this case, it goes really fast, right? Because there's not that many reads. All right, so now we're going to adjust it, right? So we're now going to write out a new BOM file, and this is going to take a little bit of time, although it's already on chromosome four at the moment. It's going to, I think, chromosome seven. Yeah, so it's going to take like a minute or two to go through the whole file. But you can see that it does around like four million files per minute. So yeah, a minute should be okay. All right, then we're almost done, and then we have our final BOM file. So now it's writing out the final BOM file, then it will run the base recalibrator once more to see the before and the after statistics, and then after we have the before and the after statistics, we can kind of see. And you can actually see that it's taking a long time on chromosome 12, right? Chromosome 12 is not aligned properly. There is something weird there, and have we already saw that from the initial, from the initial flagstaffs, no, not from the flagstaffs from the initial coverage? So there is something weird going on. All right, so now it will start making a plot. So it will make a plot, which will show us the before and after situation. So this is one of the default quality control plots. That you generally want to do, to show how your base recalibration step went, right? So it will show you the before situation, the quality scores and the cycles, and then it will show you after. So these should be a straight line. The correlation should not be too bad, right? So we see actually that the quality scores were sometimes adjusted down and sometimes after, but you see that the after score follows straight line. You see here the before and after scores, again, so you see indeed that after we've recalibrated, it's a normal distribution. While before recalibration, you see that there was an overinflation of quality scores. Quality scores on average looked like a straight line. So it will tell you more or less what it did. All right, if you're interested in reading QC reports, the GATK is a very, very extensive how you should read these reports. But I wanted to do the last couple of steps. So the last couple of steps is again, very similar to what we did before. So we are going to execute indexing. After indexing, we will look at the flagstots and we will then look at the coverage and then we can drag the file into the IGV to look at how it went, right? And now what we expect is that the mean base quality actually is going up or is similar across all of the genome, right? The mean base quality used to be 37, right? So the quality score is lower. However, we are more, we have less variance across the genome. Good, so we still see that chromosome 12 seems to be a little bit weird, still high, but let's look at the reads, right? Because that is kind of the thing that we generally want to do because from the reads, we can see much more. So we have our last find files. So we have the, we have the removed duplicates, read groups added, recalibrated, bum, and buy file. So we are going to quit R and we are going to go down. We are going to go into software and we are going to go into the IGV and we are just going to start the IGV and then look at our files. All right, let me make sure that that doesn't overlap with me. So it loaded in the genome and that is because of course I used it before but I'm just going to show you guys how you can do this normally. So normally you would say file, load from file, right? And you would take the sarcomyces servigier genome and you would just say load the genome from file. No, genome, sorry. You say load genome from a file. We go and we say data. No, we go genome. Sarcomyces servigier genome primary assembly, right? It will load the primary assembly. If we now zoom in, it shows us the sequence. If we then go and we say file, load from file and we then give it the GTF file, let's just do the GC file, then now it will start showing us the genes at the bottom if there are any genes, right? Because we see here the different genes located on this chromosome, right? So we can see the intron structure, exon structure as well. So we see here an exon, then we see an intron. So the fat parts are the exons, the narrow parts are the introns. And then it also shows you three prime and five prime UTRs as well, right? The arrows denote the direction. So this is a gene on the negative strand. We can click on it and it will tell us also the name of the gene. It is protein coding and how it's called. All right, so let's take our aligned bump file and drag it in, right? So we have our latest version of the bump file. We can just drag and drop it into the IGV and then here we see our aligned reads. So we would expect reads to be on top of exons, right? Not on top of introns. So what we see here is that some of them are pretty okay. If we go to chromosome 12, which is the weird chromosome, what do we see, right? So we can see our reads and we see, for example, that this gene has high expression at the end, low expression at the beginning. But we also see that sometimes we see reads which are not matching the intron exon structure of known messenger RNA, right? So we can go through this file a little bit to look at some of the genes. But what we wanna do in the end is kind of count the number of genes within each of these intron exon regions and then get a count of how many reads are in each of these files. I always like looking at the mitochondria because in the mitochondria generally is very clear. So how, so you can see that this gene here is almost not expressed in the mitochondria, right? There's almost no reads which are mapping there. Here we see a gene where there's a lot of reads mapping, right? So this might be a duplicate. And mitochondria don't look that good either. So cox-3 is expressed. Well, mitochondria didn't have a lot of reads. Let's look at a different chromosome. Let's see if we can find a nice example of a gene which is highly expressed. If we don't want to, we can always look at a log scale, right? To see which genes are. Because then we can more clearly see the genes where they are, let me move, right? So this one is expressed. Another one which is expressed. Then we see the middle, right? So here we see an intergenic region. We see that there's almost no expression. So just a couple of reads aligning there. Here we see a tRNA which doesn't seem to be expressed. Pre-1 seems to be highly expressed because we see a lot of reads on top of the gene. And we see, for example, PRP22 being very lowly expressed because there's not a lot of genes, although there's a couple of duplicates here. All right, so it doesn't seem too bad. We see sometimes that there are SNPs, right? Here we see a SNP which is just based on three reads which is not that reliable. So let me see if we can find a nice read or a nice SNP somewhere in the genome. Shouldn't be, though, because the dataset comes from the reference strain. Yeah, it doesn't look too bad. Sometimes a little bit weird, right? That here we see expression reads being mapped to the intergenic region. And this could be because of the quality of the reads, but these are all relatively high-quality mapping reads. So this actually might be a SNP, or maybe because we see two reads. So this might be a region where the individual was heterozygous. Good, so that was everything that I wanted to show you guys for today because we've been streaming again for two hours. So the script is on Github as a gist. So you can get the script from there. We went through the whole script step-by-step, right? So in the end, we have six different steps, right? So we do read trimming. We do alignment after alignment. We execute some tools to get some basic statistics. We did the duplicate removal using Picard tool, which removes optical duplicates as well as PCR duplicates. We did some tool index. Let me show you guys where I'm actually looking at. So we do every time that we create a new bump file, we do some tool index. We do the flagstots and we do the coverage just to see what's happening with the different statistics. Step number four is adding read groups so that we can in the end do step number five, which is the GATK recalibrator. And then we want to do the same thing again. So we want to index the bump file that was created. We want to use the flagstots and then the coverage. Good, so I will probably update the alignment file. So it's just one way of doing it, right? So I'm just using our, just using the different tools that we installed to kind of do the mapping, do the expression. Like we saw, it's not gonna be perfect, especially not on chromosome 12, but it is a start and we can work from here, right? Using this, we could say, well, what would now happen if we change step number two, right, the alignment using star by using something like bow tie two or using another RNA second liner, right? So using it in code can help you really quickly reason about what's going wrong, what's going right, should it use a different aligner and these kinds of things. So next time, we're going to do this. So next time we are going to extract RPKM values. So we are going to take the reads, we are going to take the way that the genome is structured and we're going to count how many reads came from a certain gene. We are going to test differential expression because we have three samples coming from control samples. We have three samples which came from a different media and we are going to make our pipeline a lot more flexible, right? Our pipeline is now a script in which we have the input file hard coded. So we are going to read the input from the command line. We're going to make the R script more flexible saying that, well, we have different steps. How can I easily make a script which does step one, step two, step three, step four, step five, step six and now I want to change, for example, step three, the alignment step by using a different aligner. So we're going to make a more flexible pipeline and we're going to add some more QC to the pipeline. So we are going to add like fast QC to make sure that we look at the reads before we trim them to see if we can learn something from that. And we are going to add more QC at the output as well to make sure that we really know the quality of the alignment. Good, but that's it for today and next time we are going to do some more fun stuff. So extract RPKMs or FPKM values depending on what we want to do, more differential expression testing. So making sure that we do a test where we can say this gene is upregulated when we change from one media to the other, this gene is downregulated. And then we're going to focus on building a more flexible pipeline. So thank you guys all for watching. Let me see if this works because I actually put in this thing and I didn't use it yet. So let me actually do this for you guys. So let me take this one. So because I forgot to say it at the beginning, well, not next week, it's going to be the week after next week, right? We always put one week so I can play video games and then one week that we do a lecture, like probably next week, I might actually be streaming on Twitch, streaming some rimworld. So because I didn't say it at the beginning of the stream, I always should please like and subscribe and favorite. And once you see the next stream announcement for the week after next week, then ring the bell icon so that you get. Marsha Rodriguez, thanks a lot. I will eagerly waiting for the next lecture. Yeah, yeah, like I like doing it. Like it's a big time investment. Actually today I wanted to prepare like a lot more and make it a lot more smooth. Unfortunately, I had an open day yesterday. So yesterday I spent from nine to three talking with prospective new students for the university. So it was really fun, but I didn't have the time to prepare the way that I wanted to. But I think it worked out. We did one file and we did the alignment and we went through the pipeline step by step. All right, so thank you guys for watching. I will be seeing you next time and enjoy the rest of your weekend and that's it for now. So see you next week, next time the week after next week. So bye.