 All right, YouTube tells me I'm live. So if you hear me, say something in chat. Thank you and hope previous time everything was okay. Mohammed, hey, thank you, Mohammed. Welcome for joining us. I hope everything will go okay. I hope I'm audible. I'm hope that the technical part will work out. It's not that different from last time, just a little bit. I added this. Let me check if this works. Fancy animations. I hope people can hear it. I can't hear it myself, but it should be okay. So let's switch to me so that people can see me. Good, so let's start. I hope I'm audible. Yeah, sound is fine, perfect. Okay, so today we will be continuing our RNAsec pipeline work. So the first part is going to be talking a little bit about the project and the data that we use so that we get a better understanding of the kind of data that we're looking at. And besides that, we will then run the pipeline for all of the files. So since that takes around 30 minutes per file in the virtual box, I already did that beforehand, but it's just some minor changes that we have to make to the pipeline and I'm going to talk you guys through it. I also put the updated pipeline on the gist. So down in the down below, there are links so that you can get the presentation and you can also get the pipeline code, the updated pipeline code, as well as all of the other code that I will be producing today. So I will show you that. After we've ran all our pipelines, so because we have six samples in the experiment, so we have to do the alignment six times for each of the samples, we are going to do the read count extraction, computing RPKM values, and then we are going to test differential expression. So in the end, I hope everyone is able to make a nice looking plot like this, like a custom heat map, as well as a volcano plot. And then I've added a little bit of extra this morning. I don't know if that's in the presentation that I've uploaded yet, but if it is not, I will update it. But it's the analysis of gene annotation using Biomart and then we want to use David. So David is the tool that allows you to do kind of gene expression data and over-representation analysis. So it gives you a better insight in what's kind of going on in your data. All right, so with that out of the way, this is what we're going to do today. So let's start with the project, right? So the data set that we are using, perhaps people have read the paper already. There's a link to the paper here, but it's called the mechanisms of growth regulation of yeast, involving hydrogen sulfide from as propahil cysteine, catalyzed by something. I'm not even going to pronounce that. There's a gamma in there, so I'm going to butcher it anyway, if I would say it. But it's a data set which has been collected by the Fudan University. And like we are working on RNA-seq data, it's of course an RNA-seq data set of yeast. So it's done on an Illumina Heisek 2500. It's paired and reads. So this is really nice because it allows us to do some things which we cannot do with single and reads. But the data set is here and I actually wrote down the structural formula of this thing. So the structural formula here, we can see that there's this sulfur molecule in the middle and there's an acid group as well as an NH2 group which makes this an amine. So the sulfur will come back. So just remind that the thing that we're looking at is six biological samples. So there are three cultures which are grown in control medium and there's three cultures which are grown in this two millimolar SPRC medium. So this is medium which contains this additional molecule and of course the yeast has to deal with that. So this will be important when we start doing statistic and start looking into the data analysis. All right, so if you look at the overview page of the short read archive where we got our data from, then you can see that it's this accession number, it's a raw sequence read transcriptome or gene expression that was what we're looking for. And it also gives you some additional information and of course you can actually click on all of the links to find out more. Also read the paper. It's an interesting paper and that's why we chose this data set besides the fact that it's relatively small and can run very quickly on the computer. All right, so the changes that I made to the pipeline is very basic because the pipeline that we had, I don't know if you guys remember it. Let me pull it up. So just open up the old pipeline which is called alignment.r. So let me switch to Notepad. So here we see the old pipeline. So I made two changes. The first change is here so that I can read. I hope it's readable by the way. So let me know if it's small. Hi Leo, welcome to the stream. Make it a little bit bigger. So I've added a single call at the beginning and the single call at the beginning is the command args. Trailing only is true. So that means that the first element will not be the name of the script but it will be the first input parameter that is supplied to the script. And I store this in command line args and then all the way down here where we used to have our sample name, we are now going to take the first element of the command line arguments so that we can specify which as a short read archive run we want to do the alignment for. All right, so those are the two minor changes that I made. So that will allow us to save the pipeline into the virtual box. So the way that I did it is just open up the virtual box. I can show you that as well. So let's start the virtual box. It's still off. So it will take some time to power up. Well, okay, no, it will start it all the way from scratch. So test the boot into Debian. That's fine. It doesn't take too much time. But so what we can do is we can save the pipeline there and then we can just run it six times for each of the different input files. So password, very easy. And then it should become bigger. So there we are. Right, so the thing that I did is I made a new, let me get the files. So I made here a scripts folder and then I put a new pipeline.r file in it. So this is the exact same files that I just showed you. Again, it has the command line arcs. And instead of having the sample name hardcoded, we now read it from the command line. So that's what I did. So then the next step, of course, is to run the pipeline. So running the pipeline is done by just executing it. So we execute it for all of the different samples that are there. So there are three control samples. So you can just get the codes from the page, right? So if you go to the page of the short read archive where all of the things are stored, then it will tell you that the three control samples are called like this. So SRR one and then the number. And then it's 43, 44 and 45, which are the ones in control samples. Hey, Paulo, welcome to the stream as well. So we also have the three samples that were grown in SBRC medium. So again, we do the same thing. So we just execute the pipeline with these three samples. And every one of these runs, if you will do it in the virtual box takes around 25 to 30 minutes because it first downloads it and it does the re-trimming then the other parts like the alignment, cleaning out the duplicates, doing the realignment. So it's just running it six times in a row. So you can start doing that and then probably at the end of the stream you will have it done. But I wanted to focus on what do you do once you have your BOM files, right? Because this produces six different BOM files. So if we go quickly back, then I just ran it, right? So if we go to home and then we go to, I think we put it in data, output. And then here you see the different output folders, right? So for each of these, you just see that we have all of the files here. And in the end, we have the remove duplicate step, the read group step. And then we have the correction step using GATK. So we get a BOM file and a BI file. So this one is like 390 MBs. So the next step, of course, is to do the analysis of this because, hey, you now have done all of the things. So you've analyzed it. So you did all of the alignments, all of the trimming. You looked at the quality metrics and you saw that everything went well. And then what is the next thing that you want to do? So in my opinion, the next thing which you want to do is get the stuff back into the operating system that you're working on, right? Because I don't want to continuously work in the virtual box because the virtual box is relatively slow. And generally you would do the same thing, right? If I would not use a virtual box, but I would run it on a cluster, then I would do the same thing, move the data files that I have from the cluster back to my work machine, back to my local machine so that I can use it and do it efficiently. All right, so the next step, of course, is then to copy them out. So for virtual box, I wanted to show you guys how you have to do this because it's non-trivial because you have to add a shared folder. So you go into virtual box and you say devices, shared folder settings. So I can show you that, how that works. So I'm inside of the virtual box and I'm just going to go here to devices, shared folders and shared folder settings. So this opens up this little window. You can see that I don't have any shared folders. So you click the little plus icon, you say, well, what do I want to, where do I want to merge it to? So I'm just going to select on my C drive, my shared folder, I just made an empty folder. I'm going to select it. I'm going to say auto-mount it and then the mount point you need to give. So this is the name or the folder name as which it would show up under Linux. So I think I did something like tilde slash shared and then I say, okay. Right, so now when I press okay, it will start mounting it and now if I go and I look in my file explorer, so in my file explorer, I now have the shared folder here. So if I click on it, it will tell me that I don't have any rights and that is because you can only access the shared folder when you are a super user, right? So we can look into this folder, see that it's empty, but if we want to copy stuff out of it, we have to use the super user, so the sudo command. So how do we do that? Well, I just say sudo copy minus recursive, everything which is in the data output folder and then move it to the media SF underscore shared folder. So I gave it a slightly different name when I did this and not only are we going to copy the data output, but we're also copy the genome, right? So I'm copying both the whole output folder and I'm copying both the genome folder into the shared folder. So then when I look into the output, so after you did the alignment, the output folder will contain the six alignment files and what you will see is that also in the shared folder, there should be a genome folder which is next to the output folder. So let me quickly show you that. So I'll shut down the machine. Yeah, just power it off because we're not gonna need it anymore. All right, so when I look into my shared folder, so I'm just gonna go to see, I'm gonna go to shared, then you see the output. So this is just a copy and of course there's some other files which we will create very soon. So I'm just going to delete those. And then here we have our genome folder because we need this GTF file because the GTF file is the file which contains all of the information for us and it contains all of the information about where the axons start, the axons end and these kinds of things. So we're going to use this to create a database. So just run the script six times for each input sample after you've ran them. Just copy it over and we're only going to use this bump file here. So we're going to use the bump file at the latest point in our pipeline, right? But of course we could run it at every point in the pipeline because we could also say, well, take just the aligned reads which have not been removed duplicates or which have not got the read groups in there or which have not been read corrected. All right, good. So just copy them over and then close the virtual box. And once the virtual box is closed, we're not going to use it again because it is relatively slow. So for the rest of the lecture or for the rest of the livestream, I'm going to assume that you've been able to align the files using the pipeline. So happy to follow the video. Thank you a lot. Yeah, no worries, no worries. I hope that at the end, everyone's able to make a nice plot of their data and is able to make a really nice plot which shows the volcano. So the full expression and the P values. All right, so once you have these six files, we can start. So of course, since we're now doing it locally, I had to install all of these libraries that I need into R. So because my local R didn't have it and I know again, we did all of these installations in the virtual box, but I decided to kind of take them out of the virtual box because it's just easier and it's more efficient to do it. So, and generally I do the same thing when I'm working on a cluster or on a remote host. I use the cluster to do all of the computations and then once that most of the computations or the alignments are done, I transfer the bump files locally and then from the local files, I make my plots and the other things. So we're just going to install the libraries needed. I already did this, but if you need to install them, you only have to do this once. We use bio manager, so from bio conductor to install five different packages. So some tools, biomark, preprocess core, genomic alignments and genomic features. And besides that, we need two additional packages which we can get from Cron. So we can just use the install packages command for that. So that's the vio plot for making really nice violin plots and it's the R color brewer to have some fancy colorings so that we can do some fancy colors and stuff. All right, so after that, it's of course creating a new script. So the script that we will be creating is the expression script, that's what I called it. So you can find the whole script online down in the comments, but not only are we going to create a new script, of course, every script that we create, it has these hashtags at the beginning telling you what the script is for. So the script that we're going to write is call expression data from aligned RNA sec reads, add a copyright statement to it just to be sure. And then we're going to load all of these libraries that we just installed. So we're going to load the genomic alignments, genomic features, some tools, preprocess core, vio plot, R color brewer and bio mark so that we have all of the functions available that is there. So let's do that. So I'm going to open up a new R window for you guys and I'm just going to load these different libraries in so that we have that done. So let me make it a little bit smaller and then I'm just going to load all of the libraries and that's it. So now I have, for example, the bio mark library available. So if I type bio mark, then you can see that I can just use top to find which functions I have loaded. So I can know for sure that they are loaded. Sometimes when you load them, it will run some text on the screen. But in this case, it does not because I put everything on silent. So it doesn't show you all of the output that comes there. But after loading the libraries, the next step of course is to do something with these libraries. So the first part is to do the, let me show you guys the presentation again. So the next step is to go into the shared folder that we made. So we're going to issue the command set working directory and I put it on my C drive. Of course the path will be different for you guys but I'm just going to go into the shared folder. And then the first thing what we're going to use is use this function to create a database because we need to know where every exon starts, which exons belong to a gene, how long they are and these kinds of things. So we can do this with this command. So this is make txdb from gff file and this takes a gff or a gtf file. So since we have been using the gtf file, I'm just using the gtf file. So the gtf file is located in the genome folder and then this is called sarcomyces servicier R6411108.gtf. Right, so this is just the path to the gtf file. I'm going to specify that the format of this file is a gtf and then you can add the organism name, which is sarcomyces and the data source and those last two are not required. So you can also leave those empty but it's better to add them in because it knows something. Then it knows which animal we're working on and it knows also which release we have been using. Hello, it's good that I can attend at last. Oh yeah, welcome. Yeah, I know it's for everyone. The nice thing about the streams is that they record it. So if you watch them back later, you actually have chapters so you can skip out parts where I'm just talking nonsense. All right, so then once we have this, right, this will take a little while so actually let me start that already in R. So I'm going to go to R and I'm going to use this code. So I'm going to say go to the shared folder. So set my working directory and then the next part is going to make this database and this will take a little while so I'm just going to start it now. So what you see is that it will start producing some output. Oh, it's actually relatively fast but in this case we get a warning message and this warning message is a little bit annoying. What are the package names? I need to install them first be calling. So Leo down in the description, there's a link to the gist. So there's all of the packages there. I don't know if I actually put the code there. Let me see. So the package names that we are going to use are down R and let me go back to the presentation. So these are the packages that we're going to use and I don't know for sure if I actually put it on. Don't think so. Let me very quickly update the gist then for you guys so that you can actually follow it. So where's the code? Take the code. We are going to go and take a Firefox window and then inside of the Firefox window I'm going to just Google a little bit. So gist and then I'm going to update my line. Just hit and say edit description. I just want to add a file and this is called install R and then I'm just going to put the code there. Oh wait, smart quotes as well. I hate smart quotes in Windows but it just does that. And then I'm going to say update the public gist and then I can actually give you the link in chat as we see it. There you go. So if you click the link then you will directly get to the gist. All right, so back to the PowerPoint. Start the presentation then again. All right, so we're loading all of the libraries and then we're saying set working directory and then we issued a command to make the database, right? So this is just reading the GTF file, making the database and then the next step is to get all of the axons, right? So we can say give me all of the axons from the database and then by gene. So this means that I want to not have a list of all of the axons. I want to have a list of genes and in each element of the list, there will be the axons, right? So if a gene has five axons and it's the first gene in the genome, then it will return to me a list. The first element of the list will be a list of five axons. The second gene will be second and it will have all of the axons there. So the next thing that we need to do is calculate the length of all of the genes and we're not going to use that now. We're going to use that later when we compute RPKM values because when we compute RPKM values we need to know how long every gene is and of course the length of the gene is not the start of the gene on the chromosome all the way to the end of the chromosome but the length of the gene is the length of all of the individual axons, right? We're only looking at transcripts. We're not looking at genomic DNA. We're doing RNA-SEC. So we have to add up all of the different axons to get the length of the gene. And here of course we are ignoring the polyA tail and any introns that might have been not spliced out yet. But this is the standard way of doing it. All right, so back to R and in R let's do this. So I was talking I think very quickly about this warning message and the warning message is it can just be ignored because there's a phase column and in the phase column there's a value for the stop codon. So sometimes the stop codon is not in the first phase but in the second or in the third phase. But that doesn't matter. It will just ignore it and the database will build perfectly fine. Right, so the next step would be is to get the axons. So let me do that for you. And now when I look at axons and I look at the first 10, right? So then we see that the first gene is called ETS1 minus one. And in this case it has one range. So it has only one axon and the axon runs from here to here. Right, so this means that this gene if I'm just doing the computation by hand is 699 base pairs plus one, right? Because the first base pair is in the gene. So computing the length of the gene is generally the difference plus one base pair for the first one because if you subtract them, that doesn't work. So there's actually this little piece of code which you can use to very quickly look at all of them. So the way that this works is that I'm going to L apply to the list of axons which are actually not axons but they're the axons for each of the genes. So I'm gonna L apply to the axons a function of X and the way that I'm going to do is I'm going to reduce X, take the width of it and then take the sum, right? And this is because I'm taking the sum because of course some genes have more than one axon and the width will tell me how much there is. So and then the reduce function will take the object which we are dealing with. So the object that we're dealing with is a genomic ranges object. So I can reduce that to just be a list with beginning and ending values. Then I'm gonna ask for the width so it will do the subtraction from the beginning to the end and then I'm gonna sum them all up. So now for each gene, I get the total length no matter how many axons there are. Hey Florian, welcome to the stream as well. All right, and then the next step is going to be moving to the shared output folder because I wanna write all of my output into the output folder. And since we've loaded everything that we need from the genome folder, we can now just go to the output. All right, so let's go back to R and let's do that. So I'm going to say compute the gene length, right? Because we still needed to do that. So I'm gonna compute all of the gene lengths and this is going to take a while because there's like 25,000 genes and it needs to go through all of the axons. Is the analysis of non-coding transcripts in a certain region a totally different topic or easy to add in here? It's really easy to add because the genomic ranges object can also give you because here we're just using the axons by function, right? So that means that we're looking at genes, we're looking at protein coding genes but it is relatively easy to get other objects from the GTF file because the GTF file contains also non-coding RNA transcripts. And we'll actually see that some of them are actually coded already a little bit differently because at the end of the whole analysis we will see that actually things like tRNAs are also considered genes. So they are also in there. Although they don't really code for proteins, right? They code for tRNAs. They actually are listed as a single axon file in the GTF file. All right, so now we've got our gene lengths. So of course we wanna check the first couple to make sure that it's all okay. So let's look at the first 10 gene lengths and indeed we calculated that ETS1-1 was 700 base pairs and it's actually true. So let's look at the third one to make sure which should be 211. So if we look at axons, we look at the third one, we take the stop position and we subtract start position, then it's 210 but we also have to remember that 75 is also a position. So that means that it is indeed 211 long. So seems to all work out, seems to all match up. And of course this is very important. You always want to make these little checks, right? So to recompute what the computer has just computed, just check a couple of them to make sure that it actually does what you expected to do. All right, so next step is to move to the output folder and start doing some stuff, right? Because now we know the gene lengths, we know for example, how many axons there are per gene. We have our database. So now the next step is to start loading in our BOM files. So the thing that I'm going to do is first create a little vector which has the name samples and this will have the short read archive identifiers in there and to make sure that I don't mix them up, right? Because 40, 41 and 42 are the SBRC samples and 43, 44 and 45 are the control samples. I'm going to explicitly put names on my vector. So the names on my vector will be SBRC one, two and three for the three samples which were grown in SBRC medium and control one, two and three for the samples grown in control medium. All right, and then I'm going to check if all of the BOM files for all of the samples exist, right? Because I copied them over. It might be that one of the alignments failed somewhere. So I want to make sure that I don't try to load in files which are not there. So that's why I put this additional check in there with the file exist. But the thing that I'm going to do is I'm going to make a new vector which initially is empty which will contain all of my files, all of the BOM file names and I'm just going to iterate through my samples. So I'm going to say four S. So S is my iterating variable in samples. What am I going to do? Well, I'm going to paste zero, right? Because every, if I look at my hard drive then in the output folder, I have a folder called SRR13978640.aln which is the folder in which the alignment is stored. Then I have a slash, then I have the sample name again and then it's called align.sortedbycoordinate.reduplicates readgroups, readcontrolout.bomb, right? So this is just creating the output file names for me. And then once I have the file name, I'm going to check if this file really exists on the hard drive and if it does, then I'm going to add this to my files vector. So I'm just going to take the files vector that I had, add the file pointer to it and then store it again in the files. All right, so after I've done this then I now have a list of all of the BOM files that were correctly aligned and then I can just make a BOM file list. So this BOM file list is again a special function. This will not do anything, it will just create a pointer to the BOM files and I can specify how many reads I want to read in per cycle and I also am going to specify that I want to have paired end reads. So I'm going to say asMades is true and this is the parameter that you have to specify when you have paired end reads, right? I could also read in the reads not paired but then it would not know that the left read and the right read belong together. It would just treat both of the reads as being individual reads, which for our purpose might work as well because we're just going to count up how many reads have been mapped to each position in the genome. But when you specify asMades, singleton reads, so reads were the one where the left read mapped but the right read did not map, it will not read those in. So it will only read in properly made it pairs and I will read 100,000 at the same time. So it will read in 100,000 then read in the next 100,000. And this is very important because of course, if you put your yield size too high, then of course, depending on the amount of BOM files you have, you might run out of memory, right? So this will just make sure that you process the BOM files in a way that your computer can handle and that you're not trying to load in like 50 gig BOM files into memory, all of them in one go, but had just process 100,000 reads. And of course, you have to scale this number down if you have more BOM files. So if you're working on like 50 BOM files, then I generally set my yield size to be like 10,000 or something. And if you're working on like 1,000 BOM files, then set your yield size to be around 100 or 200, right? So then it doesn't matter. It will just go a little bit slower, but it will use less memory. So if you have a massive computer with 200 gigs of RAM, then of course you can modify all the yield sizes accordingly. All right, so let's go to R and do this. So just go through the file pointers, right? So I'm going to set my working directory because I haven't done that yet. I'm going to do my samples. I'm going to do my files and then I'm going to create this list of BOM files. So I'm just going to copy paste it all in in one go. Just do not waste too much time, right? So if I now look at files, it will give me a list of files. And these are the BOM files that I have. So I have all six of them. So all of my alignments worked, but I wrote it in this way so that it can already start developing the pipeline even though not all of the samples were aligned at that point, right? So I aligned one reference, one control sample first, and then I wrote the pipeline and every time that one of the alignments finished, I added it, right? Because of course it takes time to align samples. So these are the files that I load in and then if I look at BOMs, right? It will tell me that this is a BOM file list of length six. There are six names in there and those are the same names, but this is nothing more than just a pointer to the BOM file. So once we now start reading in the reads, it will use our yield size and it will treat them as mates. So it will only look at properly paired reads. All right, next step. So the next step is just to summarize the overlap, right? So I'm going to say here, summarize the overlap of the axons and the BOM files, right? So take my axon definition database that we just created, take the BOM files and use the BOM files to kind of look at how many reads from the BOM files are falling within each of these axons. I'm going to say mode is union, single end is false because I have paired end data. Ignore strand is true and this is depending on what kind of an analysis you do. Sometimes if you do a stranded analysis, right? If you have a different library prep, you might know exactly if a read comes from the positive strand or the negative strand and you might care about that because you might want to look only at negative strand expressed genes or only a positive strand expressed genes. And I'm going to say fragments is true and the fragments here means that when a read is only hitting part of the axon, but not the full axon, I'm still going to count it, right? So I'm going to still say, oh, there was one read which fell into the axon, even though part of the read was outside of it. And of course, these things you can, if you just do question mark summarize overlaps, it will tell you all of the different modes that you can set to read in your BOM files and it will also give you examples on when to use which settings, right? So if you have single end reads, then single ends should be true. If you have paired end, then single ends should be false. Like that. But there's different modes and depending on these different modes, the read count will vary slightly and you will get slightly different results. And this really depends on your experimental setup and how your experiment was more or less performed. So the summarize overlap will just go through the BOM files, 100,000 lines per step. It will then count up all of the reads that are mapping into the axons and then I will get an overlap file. And from the overlap file, I'm just going to extract the raw read table. So I'm just going to say essay overlap and this will give me a table with read count. And then the big issues, of course, because my file names are really long, right? Because they have a line sorted by coordinate read, duplicate, read group, read calibrated.out.bom, I'm going to get rid of the last part of the file name. So I'm just going to keep only the SRR with the number part and I'm going to just globally substitute this string by nothing. And then I'm going to do that in the call names of the read count. So let me show you in R how this looks like. So this is relatively fast, I think, because the summarize overlap is going through the BOM files, but the BOM files are not that big and my computer is pretty okay. So it will take some time. But now it starts running and if you would have specified here like 10 million, then it would just start using a massive amount of memory. But since we only do 100,000 reads at the same time, it keeps the memory relatively constant. So it will not overflow the memory. But it will take some time, right? So this is the crucial step. It's now reading through all of the BOM files and summarizing the overlap with the axons in the way that I specified that it should count them. And then it will take a little bit of time. All right, so let's wait for that to finish. So the next step is to just write out the read count, right? So that's going to be the right table, right? So once I've extracted the raw read count per gene based on the axons, not based on the genomic coordinates, then I'm just going to write this table in a file called readcount.raw.txt, separate it by tab and don't quote them, right? Because this is the standard right table that I always use so that I can easily drag and drop it into Microsoft Excel, for example. All right, so it just finished. So if we go here, I can show you what this thing looks like. So if this overlap, right? It will, it's just a range summarized experiment object. It will tell me that there are 7.1000 genes, six columns, and it gives me that there's one essay and this essay is the counts. It will tell me some of the row names and it will tell me the column names. So here you can see why I want to get rid because this is not a proper column name, right? I'm only interested in the first part and the rest is going to be equal for all of them. So that's why I'm going to globally substitute that out. So let's get the read counts out. So I just say, give me the essay, right? And now when I look at read counts, it's just a read count matrix. I'm going to show you the first 10 rows and then it looks like this, right? And now you can see that this is really annoying because you have this massive header and then there's only like read counts which are not that big. So let's substitute those away so that we have better read counts. So let me do that. And now when we look at our read counts, one to 10 again, then it kind of fits into a single window, right? So now we can see that for this first gene, ETS1-1, we had 133 reads in the first sample, 133 reads in the second sample, and then we see that in the third sample, we had 400 reads all of a sudden. So you might now think, oh, so in this sample, this gene was much higher expressed, but that is of course not the truth because we are not knowing how many reads there are in total, right? So if we would just do something like this where we say apply to the read counts, to the columns, the function sum, then it will tell me the amount of reads or the amount of reads that were mapped into the whole sample, right? So the first sample had 181,000 reads in total. The second sample only had 160, or 1.6 million, right? So 1.8 million, 1.6 million, but you can see that the third sample actually has 2.2 million reads. So a lot more reads were mapped in this sample or were obtained from this sample. So we have to take these values and we have to scale them of course, because these do not represent what happens in real life, because if I would just sequence more, I would get more reads, but that is of course not interesting. I want to know what percentage of reads were targeted more or less to a gene. Danny, I know it's out of the subject, but I sent you emails some time ago with some code I am working on. If you could take a look, I appreciate it. Yeah, of course. Just resend me the email so that it's on top of my email list. No worries, then I can look at that later. All right, so we have to look at the read counts, but the read counts have to be interpreted based on the total amount of reads that there are. So that is going to be the next step. So I'm going to write it out as well. So let me write out the file as well, just so that you guys can see that I'm really saving it. And I'm generally saving every step of the analysis, right? So that I can redo the analysis and that it's tractable what happened. So read counts, save it to a file. I can actually show you guys the file. It's not the most interesting one, but I'm just going to open it up and then show you guys the file, right? So there it is, right? So you can see, for example, that some of the genes were completely non-expressed, right? And those didn't get any reads or just single reads, but some genes are relatively highly expressed, like this one here, 2,000 counts, 1,000 counts, right? So this is just a single number for each sample, for each gene listing how expressed it was. But of course, this is still raw expression levels and not expression levels scaled by the number of reads that we have. So the way that people generally deal with this is to compute RPKMs, right? So RPKMs are called reads per kilobase of axon per million reads mapped, right? So it's reads per kilobase, per million reads mapped. And the way to calculate it is kind of like this, right? So the RPKM value of a gene means that I take 10 to the power of nine, so that's a million, no, that's not a million, that's a billion. So it's 10 to the power of nine times C and C is the total reads to a gene, so that is our raw read count. And then we divide this by the total length, but the total reads in the samples multiplied by the gene length in base pair, right? So that's why this is 10 to the ninth because it's per kilobase and then we have million, so that is 10 to the third, 10 to the sixth. So in total, that comes to 10 to the ninth, right? So we multiply the reads, time stand to the ninth, and then we divide by the total reads that were obtained from the sample, multiplied by the gene length in base pairs. So how do we do that in R? Well, first we need to get this read count, so the total number of reads per sample, right? So I already showed you how you can do that. You can just apply to the read count matrix to the columns, the sum, and this is going to be big N. So big N is the same here, but we're of course not going to do this formula for each sample individually. We're going to do this formula row by row, right? So for each row, we're just going to apply this formula. So then we have to loop through all of the genes and we're going to compute the RPKM, right? So I'm going to say for each of the read counts, go through the rows and go through the rows and then you have a function which is called C, right? So the C is the input to the function. So each row is going to be C, which is the total reads to a gene. And then we're going to compute our L. We already computed L, but we're just going to say from the gene lengths, take the current gene length that we're looking at, right? So that's why we computed the gene lengths beforehand. So we can just say, well, when I'm looking at the first gene, I want to take the length of the first gene. When I'm looking at the second gene, take the length of the second gene. And then I'm just going to compute my RPKM. So I'm going to say 10 to the power of nine times C divided by N, which is the total reads in the sample, times L. And this is why R is so powerful, right? Because here we compute the RPKM for all six samples at the same time. So we're taking a whole row and R allows us to do these functions, not so much on a single value basis, but we can do it on a vector basis, right? So C is a vector of length six, N is a vector of length six, and L is a single number, because of course the gene only has a single length. And then I'm going to update my N. So this N, because it is outside of the function, I need to use the double-headed arrow. And I'm just going to return these values and I'm going to round them down with one digit behind the comma, right? There's no need to use like five or six or 10 digits because RPKM values, they're not that precise in a way. So I'm just going to round them down here so that I get read counts, which are 100.5. The original number used to be in the order of like 100 or 200 or 300 before it was scaled. But of course I don't get like eight digit behind the comma precision from it. So that's why I'm rounding it down. So this is how you compute RPKM values, right? It's just taking the gene length, multiplying it by the total number of reads in the sample and then taking 10 to the power of nine times C, which is the number of reads to the gene that we're currently looking at. All right, so this takes a little while, so let's start doing it. So I'm just going to take my values. So in the file, I actually put a little comment, right? So I just make a comment. When I do these kinds of more or less crazy loops, I'm just going to say this is what I'm going to do. This is the formula that I'm implementing and then these are the inputs so that when I'm writing the code, I know exactly what I'm doing and how I'm going to deal with it, right? So I'm gonna compute N, right? So big N is going to be what we just saw is the total number of reads for each of the samples and then I'm going to loop through all of the genes, computing the RPKM. So, and this will be very, very quick because it's factorized, so I can do this 7,000 times within more or less a second. And now when I look at my RPKM matrix, my RPKM matrix looks like this, right? So now I can see that all of the values are more or less stabilized based on the total number of reads, right? So we can see that originally we had 133 reads to this gene and now we have an RPKM, so read per kilobase of fragment of 104.6. And we now also see that this one sample, which had 438 actually came down quite a lot, right? It became 280 and this has to do with the fact that this sample has many, many more reads than the other two samples. So it gets pulled down much more in scale. All right, so of course we can look at things like the correlation, right? So we can ask, for example, look at the correlation of the first column of the RPKM values versus the raw reads values or the read counts, right? And then I also want to look at the first column and then the correlation is 0.7. If I would plot it, it would look like this, right? So here we see the original read count value on the y-axis and we see the scale down RPKM value on the bottom. So of course this is based on the length of the gene, the number of reads per sample. So you can see that although there is some relatively okay correlation still, it is not exactly the same and this is because longer genes get more adjusted than shorter genes. So that's how this works. Good. All right, so RPKM values. So we're kind of not really computing RPKM, right? Because if we go back, I actually told it to do fragments as well, right? So this is not RPKM, this is actually called FPKM, fragments per thousand base pairs because I'm including fragments as well, right? Normally an RPKM value, for a real RPKM value, you would say fragments is false and you would only count a read when it is fully inside of the axon. But the difference is very, very minor in this case because it just, it counts like these half reads or reads which are five or six base pairs which are outside of the axon as well. Good. So after computing our RPKM values, we again write them to disk. I don't think that I did that yet. So let me go back to R, do that. Let's go to R and write out the table, right? So again, I can open up the table after I've computed it. So let's look at the RPKM table as well. So when we do notepad, then we can see that we have our raw counts and then we have our RPKM values. Of course, if you had zero before, you still have zero, right? So it's always good to check. And of course, some samples have been scaled down. All right, so the first things that we'd want to do now, of course, is to do a little bit of statistics. Well, not so much statistics. We want to look a little bit to our data, right? So we want to see and we want to look how our data looks like, so the data distribution. So I've chosen violin plots to look at the data distribution, right? So this is the first plot that we're going to make. So I'm going to take my RPKM values and I'm going to do a viol plot. And because the first three samples that I have, so the 40, 41 and 42 are the SBRC samples, I'm going to make them orange and I'm going to make the other three samples light blue. So I'm just going to specify the color so that they look a little bit nice. So what we can see here, of course, is we see the read counts or the RPKM values for each of the genes. So you can see that some genes have RPKM values up to 10,000. So 10,000 reads being mapped to this gene. On average, you can see that there's a large amount of genes which actually has zero, right? So the further down you get, the less reads there are. So some genes have no reach mapped to them. Some reads have up to like 10,000 or even up to 15,000. So this is the way that we can do the plot. So let me guys show you that in R so that it really works. I think I added another call. So I first did the call op par MF row. So to set the margin of the plot, right? To make sure that all of the names on the bottom are readable, I'm going to give a little bit of an additional margin at the bottom of the plot. So that is what I'm doing here, just setting the margin so that the plot looks quite good. And then we do the violin plot. We add the legend and then we look at it, right? So this is the same plot as that we saw before. So here we have our RPKM value, but we can still see that there's a lot of variance between the different samples. One of the things that stands out is that the last sample, so SRR 45 here, it seems to still on average not be normalized. It seems to still have much higher expression values than the other ones, right? So for this, we want to do a normalization step. So a normalization step, we've now normalized internally, right? So we normalize based on the gene length and the total number of reads, but we can still see that the last sample is outlier, right? It doesn't look like the other five samples and also the third one still seems to be a little bit different than the other ones, but this one is a little bit concerning. So I want to do a quantal normalization step just to make sure that all of them have more or less the same data distribution. So if I would draw histograms of all of them, I want every histogram to do the exact same because I don't want one gene which is at 15,000 here, right? That would pull up the whole average of the gene and then the gene might be differentially expressed just based on the fact of this one sample having a much different distribution than the other one. So that's the thing that we're going to do next. So the thing that we're going to do next is just basic normalization. So I'm going to quantal normalize my RPKM matrix. Again, I'm going to round everything down after normalization to one digit behind the comma and I'm going to put the column names and the row names back onto the normalized matrix and write the normalized matrix as well as make a violin plot, right? Same thing as before, just make a violin plot. And then now after doing this, we can see that all of the samples are more or less normalized, right? So they have the same mean, they have the same quantiles and they have the same maximum value across all of the runs of RNA sec that we did. All right, so let me show you that in R, right? So the code is there. So we can just write the table and then make the second plot, right? So I'm just doing the quantal normalization and now we can indeed see that our data is quantal normalized. Good, so now we can start doing some more statistics, right? Because now we know that there's no massive effect of a single sample which pulls the distribution of all of the groups into a weird direction like we had before. So now more or less all of our data is normalized. All of our data is the way that we want it. It's kind of clean now. So the next step is to do our log two transformation. And why am I doing a log two transformation? Well, I like doing log two transformations because then data comes into a normal scale, right? And this is of course, because there is a PCR step involved. So it's not that there are 120,000 copies of one of these genes inside of the cell. Well, there might be 120,000 molecules but in the end, you wanna get back to kind of values which are more normalized. So I'm just gonna take a linear transformation, just do a log two. Again, I'm going to do the same thing. But the problem here is that when I take the log two of a value which is between zero and one, I get a negative value. And of course, a gene can't be negatively expressed. So this is going to be kind of my cutoff value. So I'm going to say, well, every value, so every RPKM value which was below one is a gene which is not expressed. So I'm just gonna do a plain cutoff and say, well, I'm going to do the log two transformation. And then if the value of the log two transform data is below zero, I'm going to put that to zero because these genes have such low RPKM values that I'm going to say these were not expressed. Doesn't have any package to calculate the RPKM step and normalization. Yeah, R has packages for that, but the thing is called writing your own pipeline from scratch. And that's why we're writing everything from scratch ourselves, right? So of course, there's RPKM values. There's even like packages which will do the whole thing for you, which you just give the fast queue files and it will do the alignment and everything. But since we're doing it from scratch, we're doing it from scratch and we're going to go into excruciating detail on how these things work and how you can compute them yourself, right? Because if you're a bioinformatician, you might want to work on this, right? You might think, oh, I have a much better idea on how to normalize RNA sec data, right? So then you write code which fits into this one step into this massive pipeline. No, don't be sorry. Don't be sorry. It's good to remind yourself why we're doing it and why we're going step by step. All right, so the log two transformation, very easy. We just call a log two again on the RPKM normalized values. We're gonna store it and then we're going to make sure that all of the values which were below zero, which means the original RPKM was below one, we're going to put those to zero, right? So we're going to say these values were not expressed. We might have had one read per 1,000 base pairs of the gene, but those are too little reads to actually care about. So we're going to just say those genes were not expressed. And then I'm going to write out our table again, just so that I can drag and drop it into Excel. All right, let's do this. Let's look at them. And then we can also play a little bit by looking at the correlation between the things, right? So I'm just going to do the log two transformation and then we see our data distribution looks like this. So what we can actually see, right? If we look at the first sample, right? And we look at the RPKM log two, we take the first sample and we do a plot of the log two transform values on the x-axis and we use the RPKM normalized values. And we just look at the first column, then we see a pattern like this, right? So this is just a basic log curve, right? So we see that when you had a very high value to start off with, you ended up with a very high value in your log two transform data, right? And we also see that there's a whole bunch of things which have been put to zero. So if your original RPKM was very low, then the zero was there. If we look at the raw RPKM values before normalization, we still see that this pattern is more or less exactly the same. So we only do minimal transformations on our data. The biggest transformation that we did was computing the RPKM values. So if we would look at the raw read count, right? So those were read, no, it's not, yeah, read count, read count, right? So now, of course, we start seeing that we did make a difference, right? So these are now the raw read value. So this gene had 150, no, 15,000 reads and it ended up with an RPKM value of around 13. This gene here had a number of reads of around seven and a half thousand and it ended up with an value of around nine, right? So that's kind of what we do here. So what we do here is we take our raw read counts, we do the RPKM computation, but then we take two steps, the normalization step and we take the other step, the log two transformation but these steps don't really do that much. They don't change the data distribution. The only step which really changes the data distribution is the RPKM computation because that scales relative to the total number of reads and the total length of the gene. But besides that, we're just making our distributions better for statistics but we're not changing the order in which the values are occurring, right? So the highest value in the RPKM normalized data is still the highest value in our log two transform data but our data is now a normal distribution, right? If we look at the violin plot again, let me get the violin plot up, violin plot up, right? So now we can see that our data is a almost perfect normal distribution except for the bottom, of course because there's the point here, those were all the non-expressed genes or the genes which were very close to the noise level so that part is kind of cut off. So we can visualize that a little bit differently as well so if we take the histogram of the first sample then you can see that it's almost a beautiful normal distribution except for this and these were the genes that were not expressed. So in total there were around a thousand genes out of the 7,000 genes that are in, say, Elchans which were not expressed and all of these genes are expressed so you have the very low expressors, the medium express genes and the very highly expressed genes. All right, next step, right? So the next steps that we wanna take is of course statistics, right? So I'm going to compute p-values by t-test because I'm going to assume that the values that we have, so the three values of the one group and the three values of the other group that they follow a normal distribution, of course we cannot be sure, right? We only have three replicates in the one group, three replicates in the other group so a t-test is probably going to be statistically hard to defend, although a lot of people do t-test when you have three versus three so that's kind of what we're going to do. So the way that I'm going to do it again is using one of these apply functions so I'm going to apply to the log transformed RPKM data to the rows a function of X, so X will be the row. So what I'm going to do, I'm going to try to do a t-test between the first three versus the second three, right? So I'm t-testing the treated samples versus the control samples and then I'm going to take the p-value and that is the thing that I'm going to return. Of course if I'm unable to do a t-test, right? Some of the genes are not expressed at all so if three zeros versus three zeros that will lead to an error when we do a t-test because the data is all the same so there's no variance so you can't do the statistical test then I'm just going to return an NA value. So I'm just going to go through the matrix do the t-test if it's possible if the t-test is not possible I'm not going to worry about it I'm just going to return an NA. And since we want to do a volcano plot in the end I'm also directly going to use the exact same strategy to do the fault changes, right? So the fault changes I'm going to apply to the RPKM log two transform data to the rows a function of X and then I'm going to compute the mean of the treated samples divided by the mean of the control samples and then I'm going to take a log two of the difference, right? Two of the difference means that a fault change of plus one will be a two times increase and a fault change of minus one will be a two-fold decrease and this will be all relative to the control sample because the control samples are second, right? So I'm dividing by the control meaning that the control sample is my values are relative to the control. Of course, if it's not possible to compute means or not do the log two then I'm going to return an NA again just so that it doesn't error halfway through. All right, let's go to R. Let's do the computations of the P values the fault changes and make sure and then we can look at how they look, right? So when I look at my P values I can just plot the P values, right? Then I see something like this. So there's a whole bunch of genes which have a P value of one and then we see here at around like little lines and there seems to be at the end some genes which all got the same P value little bit strange but this is kind of the data distribution that we expect. I want to have a histogram just to show and that seems that there's no real bias. Well, there's a little bit of bias to slightly lower P values and there's some P values or too many P values at around 0.5 but this might be because the data might like every gene, right? Every row might not be exactly a T test, right? So of course, generally you want to not look at three versus three generally you want to look at like 20 versus 20 like 20 cases, 20 controls to really have a very clear separation of what's going on and then you can also evaluate which kind of statistics you want. So I can of course make a histogram of the fault changes as well, right? So this will tell me that, well, most of the genes do not change that much, right? So there's only some genes which have a fault change of a negative fault change. So these are genes which are down-regulated relative to the control and these are genes which are up-regulated relative to the control but we can see that most genes do not change their expression that much which is kind of what we expect, right? We just have the same animals in a slightly different medium. So we don't expect hundreds or thousands of genes to be differentiated expressed. We expect only like a hundred perhaps or perhaps 50 genes to be reacting to the single molecule that we added to the medium. All right, so let's go back, right? So we've computed our statistics and then we can now make our beautiful volcano plot. So for the volcano plot, I'm going to use the P values to color them, right? Because I'm going to say, well, I have three levels of significance. So first I'm going to make a vector which contains the word black for every gene, right? Because I want to have black dots for my volcano plot. But then I'm going to say, well, if the P values are below 0.05, I'm going to make them red. If they are below 0.01, I'm going to make them gold. And then the really, really highly differentially expressed genes, one times 10 to the minus three. So 0.001, I'm going to make those blue. And then I'm just going to say, well, make a plot where I take the full change on the x-axis, on the y-axis, I take the minus log 10 of the P value, and then the colors are going to be the colors that I just created. I'm going to use a closed circle. So PCH is 18, which gives me a closed dot. I'm going to add a main saying this, this is a volcano plot, which is wrong because it's a volcano plot with an O. And then the x-label is going to be full change. And then the y-label is going to be minus log 10 P-vals. Because I'm just going to, it's going to take it from the name. And then I'm going to add a legend. So the legend is going to be at the top left of the gene. It's going to be having PCH 18. So it's going to be round filled circles. And then I'm going to add the thresholds that I selected and change or add the color to the threshold that I have, right? So how does this look? Well, let's do this in R. So let's copy paste this in, right? So I'm going to assign my colors first. So this will just be a vector of colors. So if we look at the calls, let's look at the first 100. Then we can see that from the first 100 genes, there's only one here, gene number 82, which was gold. So it means that it is significant or at least it's likely that this gene is changed. Of course, we want to look both at the full change as well as the p-value, right? The p-value just tells us that there is a difference, but we also want to make sure that its difference is big enough to be biological meaning. All right, so once we have all of the colors, we can do our nice volcano plot. So the volcano plot will look like this. So here we see that there's one gene which is very, very differentially expressed, right? So there's a big difference. So it's a lot down-regulated in the one medium compared to the control medium, but it's not significantly different. So here we see the group between 0.05 and 0.01. This is between 0.01 and 0.01 and 0.001. And these are the couple of genes which are very, very significantly different, but you can see that some of them, the difference is relatively small, right? So of course, when we are going to look at differential expression, we want to take into account that they are significant. We definitely want them to be significant, but we also want to, ah, right, sorry, sorry, sorry. I, very good. All right, I forgot, I forgot. I was looking at my own screen. Very good. Thanks for notifying. So here we see the full change, right? On the x-axis, we see the p-value on the y-axis. Well, the minus log 10 of the p-value. And what we see here is that there's a bunch of genes which are not significantly expressed. We see here in red, the values which are between 0.05 and 0.01. The yellow values are 0.01 to 0.001 and the blue values are highly significant. But what we can see is here in the top, we can see that there are genes which are significantly different, but some of them are not that different, right? So they are significant. They are very significantly different, but they are, they're not massively changed, right? They don't go from an expression level of nine all the way to four. They go from an expression level of nine to eight and a half, for example. So the full change is not that big. All right, so we see also this one here which is very, very different, but not significant. And that, of course, has to do with the variance, right? If you have three very different values and then you have three other values which are very different. So these volcano plots, they will generally look a lot nicer when you have more samples. But this is what we have to deal with. So we're now going to continuously more or less drill down into these genes which are different. And of course we want to focus on the genes which are very different, right? So these five, six genes here, they are the ones which are really interesting because they're significantly different and they're significantly up-regulated in the SBRC medium relative to the control medium. The same thing holds for these yellow ones here which are very bad to see, but these are significantly down-regulated in the SBRC medium relative to the control medium. Good, all right. So volcano plot, I put in a picture. So this is how it looks like. All right, so now we want to create a subset, right? So we're going to now say two thresholds, right? So I'm going to say which of the P values were significant and the fault change is smaller than minus 0.3. And I'm going to call this down, right? So I'm taking my matrix of my RPKM log two transform data and I'm only keeping the rows for which the P values were significant and the fault changes were below 0.3. And I'm going to say this is down-regulated, right? So the down-regulated ones are here and I'm going to do the same thing for the up-regulated ones. So I'm going to just now flip the sign, right? So I'm going to say fault change has to be larger than 0.3. So I'm going to ignore the genes which are significantly different but don't have a large fault change because we assume that the genes that react the most, right? Which change their expression by a factor of two or a factor of four that they are much more interesting than genes which do not really change their expression that much. Of course, this is an assumption. Assumption might not be true but generally this is the genes that we look in first. And then I'm doing another thing, right? So I'm computing also a ordering. So I'm taking my down-regulated genes and then what I'm doing is I'm creating a distance matrix and then doing hacklust. So I'm using hierarchical clustering so that instead of having the genes listed in the order on which they came in the GTF file, I now want to have them ordered in a way that they are showing a similar pattern, right? And this is what hacklust can do for us. It does a hierarchical clustering. So it takes all of the values from the down-regulated genes or the distance between them and then it clusters them. And I'm going to use this ordering when I'm plotting. And this creates this really nice-looking heat map because otherwise you would have, yeah, because you have different groups of genes. You have groups which are down-regulated by two-fold, down-regulated by four-fold, down-regulated by 0.5-fold. And by grouping them together, so by grouping them in the way that they occur, it just looks a lot better in the end, right? So instead of taking the original ordering which is the ordering at which they are in the genome, I'm going to take the ordering in which they are more or less similarly down-regulated. And I'm doing the same thing for the up-regulated genes. So I'm also not using the original ordering. I'm using the ordering based on the clustering of the genes together, right? So genes which behave similarly are together in the order. So they come first and then genes which are slightly different from those. And then what I'm going to do is I'm just going to take the gene IDs. So I'm going to take the row names of the downward cluster and the row names of the upward cluster. And those are my gene IDs. So this is the ordering in which I'm going to plot them. So these are then the gene IDs of the genes which are adhering to being significantly down and having a full change larger than 0.3. So in total, there's like 53 genes which are significantly different and have a large full change. All right, so now we're going to create this really nice looking heat map, right? And here we can see because we can see now that we have groups, right? So these are down-regulated and these are up-regulated, right? So they come from being red to being yellow or they come from being red to being kind of like orangey and then or they come from being like yellow all the way to green, right? So the color here is how highly expressed they are. But now you can see that by ordering them slightly differently, we see that there are kind of three groups in the down-regulated genes and there are two groups or more or less also three groups in the up-regulated genes, right? So we can start seeing these patterns of genes which act alike. So all of these genes here, they seem to be responding in a very similar way being like medium or relatively highly expressed in the one condition and being slightly higher expressed in the other, right? So you can see these patterns. So how do we make these? Well, we use spectral colors, but first I'm going to make sure that my margins are okay because I want to be able to read the names on the bottom. So I'm going to say, give me a little bit of more margin at the bottom and then give me a margin of six here so that I can also read the gene names. Two at the top, one at the side because I don't really care about the top and the other and the Z axis. So I'm going to say make an image. The X is going to be one to the number of columns of the RPKM-12 log two, right? So that's six because I have six samples on the bottom. On the Y axis, I of course have one to the number of rows of the down plus the number of rows up. And then on the Z, it's going to be a row bind of the downward cluster and the upward cluster. So this is going to be the down cluster and then the up-regulated cluster or the other way around, I think. We'll have to see, we'll have to see. In the end, we can just get the values. I don't want an X axis. I don't want a Y axis. I don't want an X label. I don't want a Y label. And I just want to use 11 colors from the spectral band. The Y axis, so put the gene IDs there, right? So here we put the gene IDs. That's why we actually put the gene IDs or put them in this variable here so that we can reuse them. So add one to the length of gene IDs, put the gene IDs there and then LAS is two to flip them around and make the axis a little bit smaller. Otherwise, you don't have enough room to plot 53 genes. For the X axis, I use a little bit of a trick, right? Because you can see that the X axis is orange and these are blue, right? Based on the original colors that we used in the violin plot. So I'm just going to plot the axis twice, right? So at one, two and three, I'm going to put the first three labels and I'm going to make those orange. And then at four and six, I'm going to use the other three labels and then make them blue, right? You can't do this in one go because color.axis is only allowed to have a single value. So if you want to have multiple colors on the axis in R, you have to plot parts of them. So that's just a limitation of how the R axis work. All right, so how does this look? So let me show you in R. So let's switch to R. Let's not forget this time. So we're going to do the plot like this, right? And now when we squeeze it a little bit, we can see that the plot starts looking really, really nice. So of course we can actually look at what is up and what is down and we could actually add different color scales as well, right? So if I don't want to use spectral colors, but I wanted to use the blue-green. Let me see. Blue-GN. So now it runs from blue to green, hoping that that one. Well, so now it's just got the green color. But I don't have enough color. So I only have nine colors, right? But you can change it and then you can see kind of what's happening, right? So down-regulated and three different groups for the down-regulated ones, three different groups for the up-regulated genes. So, and that's why you get these nice block pattern. So the nice block pattern comes from the step where you do the clustering. So that's important. If you forget it, it looks very different. And I can show you actually how that looks, right? So if instead of D-clust, we use down and instead of U-clust, we use up. So if I would not have done the clustering, then this is how it would look, right? So now you don't see that nice block pattern of genes which have a similar expression pattern being similar at the same point, right? So it's just a little bit more shaky, so to speak. They contain the same information though, it's just nicer to have them clustered. All right, let's go. And then do some more statistics, right? Because we now computed the full change, we computed the p-values, but now we want to make an overview table, right? So an overview table is a single table that you can add to your publication as a supplementary file or that you can add to your presentation as a table within the text, right? So the first thing that I wanna do is I wanna compute the means, right? Because I want to have the mean for the first group, mean for the second group, so for the treated and for the controls. And besides the mean values, I also want to know the standard deviation, right? Because that's important to interpret what's going on. Like if you have a big difference in mean, but also very big standard deviations, then of course, this is not going to be very interesting. While if you have very big difference in mean, very small standard deviations, then these might actually be where genes are significantly different. Right, so I'm just gonna do the same strategy as what we did for the p-values. So I'm going to say, take the RPKM log two values, go through the rows, compute the mean of the first group, mean of the second group, round them down one digit by into comma. And then if you cannot do this for some reason, return an NA, right? Because it might be that, I don't know. I don't know, this should not fail, but I always like writing a tri-catch around it just to be sure that it doesn't crash halfway through. For the standard deviation, same thing. So we do exactly the same. We go through each of the rows of the RPKM values. And we take the first three samples, compute the standard deviation, take the second three samples, compute the standard deviation, round the standard deviation down. If for any reason we cannot compute it, return an NA value. And of course, I'm going to put the column names on there. So I'm going to say the column names of the mean are S, B, R, C and control. And the same thing for the standard deviation so that I know which group is which, right? So the first three samples are the S, B, R, C samples. And the second three samples are the control samples. All right, let's go and do that in R. Oh, I'm touching a button, which I don't want it to. Good, so we want to compute the means and the standard deviations. Put the column names on there and then this looks like this, right? So we now have our means. Let's look at the first 10. And then we have our standard deviations, right? So we can see that the first gene had an average, the first gene in the S, B, R, C samples had a mean of 7.2 with a standard deviation of 0.8. Same thing for the control, 7.6 average, 0.8 standard deviations, right? So we can see that sometimes there's a big difference in standard deviations between the one group and the other group. But that is something that of course the t-test will deal with. And of course now we can also make some more informed decisions if we should use far equals is true or if we should switch to non-parametric statistics for some of the genes. Yeah, because when there's no equal variance in the groups then you want to not assume the variance being equal when you do it. Anyway, so we have now the means. We have our standard deviations. We have our p-values. We have our full change. So now we can make an overview table, right? So I'm just going to say, well, I'm gonna make an overview table. So these are going to C bind, column bind everything together. So I first going to want to have the controls, then the standard deviation of the controls, the S, B, R, C samples, standard deviation of the S, B, R, C samples. The full change and then the p-value. And so now you can see that the full change of minus 0.1 means that it goes from being 7.6 expressed in the control samples to 7.2 expressed in the S, B, R, C samples. And we have this for all of the genes that we are looking at. All right, gonna do that in R as well. So just so that you guys know and see that the code works and that it really does something, right? And this is of course a little table which we could add as a supplemental material, right? Because it has all of the different genes. What you can see is that sometimes the full change is infinite, right? Because if it doesn't change, if it changes from zero to zero, then the log two full change is going to be infinite because the full change is going to be zero. So infinite should actually be fixed and you should fix this by saying all of the infinite values in the full change are going to be zero or very close to zero. And this is of course, because it's not the full change, but it's the log two of the full change. Good, good PowerPoint. Of course, if we wanna look at this, then we now have this nice little table, but this nice little table only gives us the ensemble gene IDs, right? So we want to add additional information and information that we get from a database, right? So we can use the biomark package to query. So instead of just having the ensemble gene ID I also want to get the external gene name. I want to get the chromosome name, the start position, the end position of the gene. And I wanna get the description of the gene, right? So that I can go to a biologist and I can say, well, here's the list of genes which is differentially expressed and this is the description of them, right? So that they can quickly read through the list and say, oh, this is interesting. This gene is a gene that I know, right? And biologists generally don't know the ensemble gene IDs. They know the names of the genes which is, in this case, is called the external gene name which is the name or the real name of the gene as if you would write it down into publications. All right, so I'm going to load my biomark library. I'm going to connect to ensemble and I'm going to use the sarcomyceterficea data set and I'm just going to query these attributes, right? So this is what I want to retrieve from Biomart. So I want to retrieve the ID, the name, chromosome, positions and the description. And then what do I want to use, right? So the filters here tells me what am I going to give to Biomart? Well, I'm gonna give ensemble gene IDs to Biomart and which IDs am I going to give? Well, I'm going to give only the gene IDs that I already had, right? So the ones which have a p-value below 0.05 and a full change higher than 0.3, right? So those I'm going to query. I'm not gonna query all 7,000 of them because I'm a friendly neighborhood Biomart user. So I'm going to limit myself to not asking all of the information of the whole yeast genome which would be a download of some megabytes. I'm just going to say no, only these 53 genes which I found to be interesting, those I want to annotate, right? So that's why I'm only querying a limited amount and not all of them. And then I'm going to say that the row names of the result Biomarts of the object, I'm going to take the ensemble GID column because all of my other data, right? If we look at R, all of my other data also uses ensemble gene IDs as row names. So I want the same thing to happen in my Biomart data. So I'm going to take the first column, so the ensemble gene ID column and I'm gonna put that as the row name column. All right, let's do that. And this might take a while depending on how busy Biomart is, but I hope it will be okay. I hope Biomart actually allows us to do that. All right, so connecting to the Biomart, doing the query, and then we get something back which looks like this, right? So here we see that the ensemble gene ID is called SNR18, still in the same column. It's located on chromosome one, this position. Where's the gene name? This is the description. So these don't have gene names. So there, and here we see that not all of the... So actually this one, GCCO2 is actually called CIF17 and then there's CIF7 which is also differentially expressed. But you can see that not all, every ID has a proper name. They do have a description though and then it will be like small nucleolar RNA, right? So this is not a protein coding gene. So you can already see that by looking at the axons, the question that Florian had before, we are getting all features back which are consisting of axons and not just protein coding genes. So that's why not everything has a gene name, but CIF17. So this one, this is a real gene so it really encodes for protein. All right, so if I look at my REST biomarkers and I look not at all of them but just at like the first nine columns. Oh, comma one to nine, eight columns, three columns. All right, let's look at the first three columns and then we can see that a lot of these are actually not protein coding genes but there are some protein coding genes in there which have different expression. So also short nuclear, there are RNAs and other. Good, so next step, combine these two matrices, right? We now have one matrix which contains, so we have the overview matrix which contains all of the data, right? So for every gene, I have the expression level, the fault change and the p-value and now for the genes of interest, I have the gene IDs, the chromosomes and these kinds of things but also the description, right? So I want to mash these two together. So that's what I did and that's what you should do. So this is going from like a, had the overview table that we have now contains all of them, right? So that's a supplementary material file but for inside of my publication, I also want to have a little table summarizing like the top 20 differentiated express genes or something like that, right? Or the top 50. So that's what I'm going to do now. I'm going to merge them together. So the way that I want to have them merged, I have to do some swiveling, right? So, but I'm going to always use the gene IDs that I use to query Biomart as the ordering, right? So to make sure that everything is in the same order, I'm going to make sure that I always select using the ensemble gene ID. So from Biomart, I'm going to take the external gene name column, the chromosome name, the start position and the end position. So this is going to be my front. So those are the first four columns of my table. Then I'm going to put the whole overview table afterwards. So the values, the expression values, the fault change and the p-value and then the description is going to be afterwards, right? Because the description, I might want to cut that off once I come to Excel. I might not want to have it in my main table in my text. I'm also going to change the column names a little bit because I want to have gene name, chromosome, start and end and I want description to be written with a capital. And then I'm going to just write out my overview table that I just created. All right, let's go to R, let's do this, right? So just swivel some of the data that we have around. So I'm just going to look then at the overview table after making it, right? So I'm doing the bio-mart and I'm always using the gene IDs, right? And this is the important part because when you merge tables together, you always have to merge them with the same row names, right? So I'm going to make sure that every time the row names are consistent, both for the front as for the overview as for the rest bio-mart. So I'm explicitly hiding the description column, right? Because the descriptions are very, very long. So I'm just looking at the first 10. So this is a very nice table which you can use to put in your publication. Of course, you might want to resort the ordering here by saying, no, I want to have chromosome one first, chromosome two, chromosome three, but I've sorted them based on the way that they were in the original heat map where you see these nice blocks. But we can see here that this gene goes from being seven in the control to being 5.3 in the, in the medium, in the SBRC medium, which is a full change of minus 0.4 and it has a p-value of 0.01. Do you need some p-value correction? Yes, of course, because we're testing 7,000 genes. The thing is I'm not doing that here because it's only three samples versus three samples. So if I would do a p-value correction then nothing would be significant because of the simple fact that you don't have the power, right? Three samples in each of the two groups don't give you enough statistical power. So in that case, since we don't have more samples, we only have three versus three, this is the best we can do, right? So we can only rank them saying that, well, these are the genes which have the biggest full change. They seem to be different, but we know that they are kind of not, right? Because we can't really prove that they are different. But if you would want to do some validation, right? If you would do QPCR, right? You could say, well, okay. So I'm going to design for the gene that I find most interesting. I'm going to design, is the study valid this way? Yes, of course, of course. You still can rank the genes based on which one is the most likely to be involved. And when we start doing the gene ontology and the David annotation, then you will actually see that it makes sense. The annotations that come back really point you into the right direction. So it is underpowered, but still with an underpowered study, you can pick out the genes which are most different from each other, right? So it's not a perfect study. They could have used more samples, but of course RNA sequencing used to be very, very expensive. And it's still relatively expensive, right? If you want to do an RNA second experiment, it's going to cost you a couple of hundred euros. So you might not want to do 20 versus 20. And that's just one of the drawbacks. All right, but we have our overview table, right? And if we look at the 11th column, then we also have the whole description there. So this is something that you can put in an Excel file. So let's do that. Let's write it out, right? So I'm going to write out my table. And then we are going to quickly look at it. So let me open it up. All right, and let me show you notepad. All right, so this is now how our table looks. So we have 53 genes with their expression values, their probability values, and then we have the description of what the gene does. Right, so the thing that we want to do now is put it into Excel and make it look a little bit more beautiful, right? So this is what I did. So this is now the same thing as what we just generated, but I just put it into Excel and it looks a little bit better. That's the way that you would do a supplemental file. All right, so the next step, of course, is to start looking in with pathways might be affected. Right, these genes which are up and down regulated, do they all fall into the same pathway or do they fall in different pathways? Is there a common theme between these genes? Right, so I'm going to quickly put them in Excel because I did that for the screenshot, but I didn't save it. So I'm just going to take, and let me guys show you that because I actually can. Right, so just open up Excel, I just copy paste my Notepad++ document and I just pasted it. Right, so this is how it looks. And now the problem is that when you do this, these were the row names, right? So they don't have a header. So I can say here, insert, and then I say shift the cells, right? And now the gene name column is on top of the gene name, the chromosome column is on top of that. All right, so I can just make them bold, put them in the middle, do some like make it look a little bit better. Right, so now the next step for something like this would be to either use something like conditional formatting, right? To make it look a little bit better even. So let's use one as well, right? So take the control means and the SBR says and just say, oh, I want to do conditional formatting and use some color scheme going from white to green, right? And then it nicely colors your table and you still have the standard deviations there and your full changes and stuff as well, right? So we can see that the first genes are downregulated and then we have genes which are upregulated. Sometimes they're infinitely upregulated but of course these infinite comes from going from zero to 2.9, right? So if you're not expressed in the control condition but you are expressed in the SBRC medium that this actually makes you a really, really good target genes, right? So we see that there's some tRNAs, we see that there's some silver houses, assimilator subunits. So these are just the most interesting genes that we can come up with. All right, back to the PowerPoint. So the next step, right? We now have it in Excel. We have our table ready for our main text or our supplemental file. And then the next thing is to just go to David and do an over-representation analysis, right? So how do we do this? Well, I can show you guys how you do this. You go to here, which is my hitter page and then we go to David. So I'm just going to Google it. I'm just going to say David Gene. I'm going to go here, right? So David is the database for annotation, visualization and integrated discovery, which is a quite nice name for it. But I'm just going to click start analysis, right? And then I have to do a gene list. So let's go to Excel and I'm going to copy all of the genes both up and down regulated together, right? So I'm just going to copy all of them. So all 53 of them and I'm just going to put them in. Wait a second, why I can upload here? I'm going to paste the list. Right, so these are all of the genes that I found, right? And then I'm going to say, what is my identifier? So my identifier in this case is the ensemble gene ID. And this is my gene list, right? Because it's not a background list. It's all of the genes that were differently expressed. But isn't Excel notorious for changing gene names into dates? How would you avoid that? Ah, good question, Linus. There is actually a really, really easy way to avoid that. So let me go back to... So what you should do, right? Is say, give me a new blank workbook, right? Then you select all of the fields of the workbook. You right-click it and then you say format cells. And then you say format cells. I want them all to be text. And now if I would paste, so let me show you guys this. Right, so now if I would take this and it would say oct 19, right? Which is one of these genes which always gets converted to October 19. If we now put it in, it will say oct 19 because it's a text field. Of course, the numbers that we have, we should then convert back, right? So the best way of preventing changes by Excel is just selecting the whole thing, right-click, format cells, format cells as being text. Because now anything that you will paste in, it will just treat as text. It will not correct dates. Well, if I would take a normal Excel file, right? And I would put in this oct 19. So this is one where I haven't put it into text. Now you see that it changes it to 19th of October, right? So that's the big trick with Excel. Just make all of the whole worksheet, make it text. So I should have done that before. I generally forget, but you, this is the way to do it. So that's gonna save you a lot of pain that. Could you also select only the columns that you paste your G names? Yes. But then you run the risk and it does some smart stuff somewhere else. But yeah, and in theory, like the thing that I generally do is just change the whole thing to text, paste it in, and then format them back, format the numbers back to being numbers, because that's kind of the only thing. Because, yeah, but yeah, so that's the big trick. So don't save. So in this one, if I think I changed everything, right, to format cells, yeah, all of them are text, and I can just paste in October 19 and it will not change it to October 19. Just a little tip. All right, so don't save. So this is my genus, right? So what I did is I just copy pasted this whole thing, right? So the ensemble gene IDs that I had, and then I go to David, which is here, right? And then I push them just into the little list box. So if 53 genes, which are here, I have an ensemble gene ID, and this is my gene list, and I'm just gonna submit it. All right, so it recognizes that the identifiers that I gave it are from sarcomyces suffiziae. Some of the genes it doesn't recognize, but that's fine. But now I can just click here on the functional annotation tool, right? So I'm gonna click functional annotation tool, and now I can look, for example, the pathways, right? So show me, oh, and then it doesn't capture that because it opens it up in a new window. All right, so just do the functional annotation clustering. It also doesn't recognize that. So I have to then switch the, let me just do it on the screen. So instead of this, just go to the monitor, just drag the David thing here and drag the David thing here. Right, so when you open it up, it looks like this, right? So here you have the functional annotation. Right, so now I can say, okay, so which keck pathways are affected, right? Then you can see that, well, 10 of my genes were in keck pathways, and I can click on chart, and then it shows me that most of my genes, right, which are up or down-regulated, are significantly over-represented in the sulfur metabolism, right? Then metabolic pathways is also over-represented, which makes sense, right? Because we changed the medium, so it makes sense that the metabolism of the yeast would adjust to the new medium. You also see that porifirin is changed. The porifirin metabolism is also significantly affected, and then the biosynthesis of cofactors is also, although these after-correction are not significant, right? But you can see that the sulfur metabolism is significantly affected even after correction for multiple testing. Then we can click on genes, and it will give me a list of the genes that are in there. So we see that the adenyl sulfate kinase, the sulfate adenyl transferase, redactase, and redactase subunit alpha are the four genes that are changed. So why did I want to show you guys this, right? Because generally for biofematics, we don't really go into what it means, right? For that, we have a biologist, right? The biologist wants to do a certain analysis. We do the analysis, right? But we can already do these kinds of analysis and see if it makes sense. So why does this result, even though, like Florian pointed out, we don't really have the power to do this experiment. It's only three versus three. We're looking at 7,000 genes in total, so we can never prove which gene is interesting. But we can see that these four genes, which are in the sulfur pathway, actually make a lot of sense, right? Because if you look at the name of the title, then they say mechanisms of growth of yeast involving hydrogen sulfide. So hydrogen sulfide is, of course, a downstream product from the sulfur. And this is the molecule that actually was put into the medium. And you can see that there's this big sulfur group in the middle, right? So if the food that you're eating contains a lot of sulfur much more than another group, right? So the other group just got standard medium, where there's not a lot of sulfur or additional molecules in there. Then of course it makes sense that the sulfur metabolism would be affected, right? So we learn something that these four genes, which are up-regulated or down-regulated in the sulfur pathway, they are the ones which are reacting to this molecule. They are involved in breaking down this molecule because these sulfurs are more or less taken out of the molecule. Then we have the polyforin metabolism and the cofactor synthesis, right? So polyforin metabolism, if you Google around a little bit, it has to do with iron and cofactors. In humans, it's heavily involved in heen biosynthesis. So the synthesis of these molecules, which capture oxygen in red blood cells. So polyforin itself is interesting because to break down this molecule, you need inorganic cofactors to do this. So the breaking down of this molecule internally in the cell is done by this propagyl cysteine and the other name, right? And these are big proteins. And these proteins, they have these copper and iron molecules in there to do their biological function, to create the active site, bind to substrate and then change it. So even though we only have three versus three, the pathways that are coming out make perfect sense relative to the experiment that we did. So we kind of can point to which are going to be the top candidate genes to look at using David. All right, so that was it for today. So not a very long stream. We stream for like almost two hours still, but we've now run through the whole pipeline, right? We installed all of our software in lecture one. In lecture two, we did all of the work building this pipeline, aligning our reads, and now we do all of the downstream analysis, right? So we did all of the downstream analysis, extracting the read counts, doing RPKM computation, doing normalization, doing a little bit of statistics. So with all of the tools that I've given you in the last like six, six and a half hours that I've been streaming, I feel very confident that you should now know how to do RNA second analysis. And of course, look at other data sets, right? You have this pipeline. You can download other RNA sec data sets from the short read archive, which have more samples, higher read counts. Just make sure that when you do this that you change inside of your pipeline the name of the reference and make sure that the reference genome, the GTF file and all of the other files that they come from the same release. So just for fun, you can make some randomization and make more data. We could, but it would be better to just download more samples, right? In the end, if you want to do a real analysis you would probably want to download like 50 lung cancers versus 50 non lung cancers, right? So the same analysis strategies you do. So that's what I wanted to tell you for today. So for today, we looked at the project itself, running the pipeline for all of the files. I didn't do that live, but it's very easy, right? You just change these two lines of code or you just download the whole script from the description. We did extraction of the read counts, computing of the RPKM values. We did differential expression testing, we created a volcano plot, we created a custom heat map and then we did the analysis of the gene annotation just downloading data from Biomart and then we use David to very quickly look into the different biological pathways which were affected within this yeast by changing or by adding this molecule into the medium. Good, thank you for the really nice lectures. Thank you Florian, thank you for attending. I enjoy doing it. So actually we should put up a poll which I'm going to do. So there will be a poll on the channel about what you guys want to see next, right? So again, same as last time, last time we had like four or five choices. Hey, do you want me to interview someone who's working in or, but I'm just going to put up a couple of topics again. So if you have any topics that you want to suggest, put them down in the comments or throw them in the chat which is still possible now, but of course not after you've seen this video later. So if you see this video later and you want to suggest one of the topics to get a lecture series about, then we should do that. And now I'm not going to talk about Dutch soccer. That's not my expertise. But I will run a poll on my channel again and then we'll see what the next four or five lectures are going to be. So I'm trying to get into the flow of doing this on a Saturday or Sunday. So we will have a poll and you guys get to decide what I'm going to talk about. And for now, this is all that I wanted to tell you guys. So thank you so much for your attention. Thank you for being here, for supporting the channel and liking it and subscribing and all of these things. Like we're doing really well, like the channel's almost got 3,000 subscribers. I'm really, really happy about that. I never expected that when I started. So thank you guys for that. And I will see you guys very soon. And if the poll goes up, don't forget to vote on the poll to what you want to see next. All right, then see you guys next time.