 Okay morning everyone. My name is Siabash and I'm happy to be with you for next 90 minutes and we go over some gene expression analysis in R. So, I guess we are like 51 people so we can start. So, so the the logistic here is like you can unmute yourself. Or you can turn your camera on. It's up to you. But if you want to unmute yourself and ask question because this is an interactive workshop. So, it's better first raise your hand like with any sign here and then I let you so you can then go ahead and ask your question but anytime in between if you have any question. You can raise your hand and ask question there is no question period at the end. It's interactive you can ask your question in between. So just just please raise your hand. And then you can unmute yourself. So we can start. So, the pipeline the general pipeline for for RNA is bulk RNA sequence analysis in our is like. Usually we first get the raw data and then from the fast a file and then quality control, make it fast Q, and then do alignments to make Sam or binary version of it, which is called bam file. And then align reads to jeans and create account matrix. So, the, the, the first four steps are time consuming, and it's beyond the scope of this workshop. So, and you can see the other tutorial that we put in the github link using star. And we are focusing for this workshop is differential expression using our and we use the DC library for that. So we already processed the file, the fast fast a file and process it make it bam file, align it and create a comp matrix, we just load the and so the experiment design. And data that we use are kind of. So I answered the question yeah yeah slides slides going to be available afterwards, I'm going to put it. Thank you. Thank you for your question. experimental design. Like that we use the database that we use for this analysis there are five different five different patients. Which are at very high level advanced stage of prostate cancer. So we like, I guess they took the biopsy sample from the tumor cell in two different condition. So treat them first in a control condition with the MSO and treated with a drug like which is called SP 2109 and then do RNA seek for each of them so then five patients, five patients treated with control and draw. So this is this is the experimental design. So, I hope that you guys installed the libraries in our and let me know if you have any help. Yeah, these these slides I'm going to put it in the in the workshop I guess link or or I sent an email to you guys anyways this this slide. So that that is yeah that's true that's only one slide so we're going to add it so everything's under the files and everything's on GitHub and so so you can you can use terminal and clone it here and like created directory for yourself and clone it. So, any directory anywhere, like on your desktop, just go to your directory and make sure that you are on a git on a correct like make sure you click git branch and you are on the correct branch. So, and there are some libraries that we we put instruction for installation. Let me see the questions. So, I hope that you guys in already installed because there's no time now for installation and we assume that you clone it from GitHub and also we put the instruction here for each library how to install it like some of them from CRAN some of them from Bioconductor and some of them from GitHub. So, I hope you guys did it before. Now just watch this tutorial and this is recorded you can go do it yourself but that's that's pretty easy and straightforward. So just clone it in a directory for yourself and everything's there and then install all the package needed based on based on the instruction there. And so we assume that you did that we take from here we just load all the libraries needed. So there are some from CRAN some from Bioconductor some from GitHub. So the next step you can just set your working directory to the directory that you are in you can just get it you can just use a set WD or go to your session set working directory and choose your directory from here. So either way is working. So then make you can make sure if you are you want to make sure you are in a correct directory you can check it with get WD that you are in a correct directory. So because I already did that I don't need to set it for myself, but I do it again. Sorry. And then we need to create some folders for output. So we create one one output and then another directory in the output. As you can see here for figures that we already did that and it's on the GitHub that we save all the figures as PDF. So about the how to clone the repo is pretty straightforward. I can show you just because you asked. So if you go there to your GitHub. So click on code on HTTPS you can get the link and just copy that and come back to your terminal to your terminal just make sure you are in a correct folder just just go to your correct folder and then use git clone and then copy paste the what what you copied here I didn't copy what you copied here from clip or you can copy here. So then you create the you it's it's pretty straightforward just git clone and you get the link from here. But yeah, I assume that you already did that so we get back to the question we to the rest of it. So, below the director will be. Yeah Brian you can unmute yourself and ask question. You're not on mute. Yeah thanks thanks. Many of us who work with these kinds of data we try to keep local repository up to date, just wondering if you can walk us through how to do that you just show us how to clone your repo but how do we if you make changes in your. GitHub repository how do we keep a local version of the repository up to date. So, after after you make changes on your local, you can just say, there are some. Get at that, and then get commit and get push you just send all all your changes again back to to the remote make sure first first set your. There there the remote repository, and I guess you need to go to your GitHub and ask. Because you cannot push to our GitHub, so you can just pull it from our GitHub. But later, if you want to push you you you have to set another remote repository for yourself and push it to that one. So this one is the current. Hey, you can you can ask your question yeah. Yes, so if you updated your repository, and we wanted to keep track of any changes in your repository. Yeah, that's that's what I'm saying you cannot push into our repository but you can create another repository for yourself on your GitHub, for example, and then you can go for example here and you can see the history of. Any commits happening, you can see that's that's you can keep track of the changes that happen after you get at it commit and get push you can keep track of it. But that's, I guess, beyond the scope of this course, we can we can chat offline. If you. So yeah, so get back again to to the main point. Yeah, the directories and then the next step is getting the input data and getting the input data. Just we put it in the data folder already. So you can load them by these lines and so the next step we want to look into what we load, just review them. There are two teams. I describe the experimental design I describe the five different patients that we have and treated like once with DMS so control once with the drug. So two teams are necessary for analysis and we put it in the data folder. The first one is read count table, and the second one is sample info. So after after I view it I just I can go through each of them and explain what is mean. So each count matrix usually has a different rows and the rows are jeans and columns are different conditions. So rose jeans and columns are different condition in sample info. So we get the same sample in rows, like the columns of read counts, going to be rows of our sample info. As you can see, for example, we have the MSO number one number two number three number five, and to like SP, and different one. So for sample info, we have the MSO number one to two five and SP one to five. We have another column called condition, which is either the MSO or SP 2509 treated and replicate, which means the biological replicates that we have now number one to five and mutation count. It came from whole genome sequencing analysis, the total mutation count in each sample and diagnosis age is the age that the patient diagnosed and this is PSA posted antigen level and the stage of the cancer, which the good thing is all of them at the same stage all five different patients at the very high level, but for our analysis, it's a good thing that they're at all at the same stage. And so we don't have variability in different states. And we have a mutation burden. So here that that is another factor that might be important. So going to question. Yeah, it's question about the still about the gate. So yeah, you can, you can just for our analysis, you can just clone it, but you cannot push it back to our repository because you don't have the credential but you can make a remote repository for yourself on your feed hub. And just as people mentioned like to set origin master and then the question is, is it going to affect it. Yeah, if so the thing is if they are at the different stage we have a different factor of variability. So we have five different biological replicates. For example, a different like imagine like if they are at the different stage. So, I go through the data to plots, and you can understand better later, because that's adding another factor of variability and makes the data harder to interpret. So that's the data that we load. So the next step that we want to do first we want to do a little bit of preprocessing. So we want to make sure that, for example, our, the read count matrix, we want to put this gene ID column as a row name. We want to make it cleaner. So and then get rid of that column so if we run this and so you can see the read count here. So we put that the ensemble ID we put the ensemble ID as a row name, make it clear for our further analysis. So the next step that we're going to do we want to increment all the reads by one. So because zero read counts can make error in our further analysis. And because this is differential expression that's okay because we increment all the genes and all the counts by one, all different condition by one. So we do that. And you can view again. See for example this this used to be zero but no it's one. So that's another thing. We're going to do the same thing for sample info the sample info that the role we're going to put the sample column as a row name. So we do that and view it again. Just, if you see the sample info, you can see we put it as a row name. So there is one question. No, no one question. Okay. So before before we start our DC analysis. So we have some checks that should be done. Otherwise, we can see problem afterwards and that's waste of time. So we want to make sure, as I mentioned, like the sample info rows are the same as read cons columns so we want to make sure first that all the row names of sample info are in the columns of read con and are equal. And if we make a temporary count matrix from row names with the sample info that we make another matrix temporary from our read con matrix with the wrong names of wrong with the wrong names of wrong names of sample info. And that should be equal to again to call names of our the call names of that matrix should be equal to row names of our sample info. So these three checks we do. And all should be true. And see if we see the truth that means we are good. So data loaded correctly. Yeah, that's that's a good point. Yeah, we can do it but it's in sample info but I didn't. Yeah, but yeah that's that's a good point we can we can get rid of also this sample but because it's no harm so I guess I leave it there. Okay, so now we want to start our DC analysis and creating DC object. So just just a little bit because it's, there is no time for that but just just briefly, I'm going to explain it. I'm going to explain how it works like DC. And so there is a paper came in 2010 about DC. And if you are interested, you want to know more, you can go read the paper. I think it's the first people taught. So based on the read based on our read counts and they try to fit. They try to predict kind of variation, and among our biological replicates from our mean of read counts, and they first they say Poisson distribution is the best to describe it. But Poisson, as you know, Poisson only has one parameter standard deviation is equal to mean in Poisson, and that should be linear. And then and not and doesn't fit with the real world data. So another method came out HR, which is a dashed line. And then, then DC came DC looks like that's the best fit. And the best prediction of our read count, our variation read count based on me. So what DC, that's that's that's the, I'm not going to go through the math, but just just suffice to say that, instead of Poisson, it uses a negative binomial, and in negative binomial we have a mean, and we have a dispersion here. Dispersion is proportional to variation, but we call it dispersion here and the dispersion is actually mean and dispersion are correlated to size factor. So what DC does usually first estimate the size factor and then try to estimate dispersions based on our, as I mentioned based based on the mean cons, try to predict the variation the dispersion using the maximum likelihood of transmission and the fit a linear model. And then afterwards we do the hypothesis testing to see if the differential expression is significant or not. So the, the limitation that we have in our data because usually we have few number of replicates, like in our case, for example, we have five. So that makes it unreliable. So DC tried to solve this problem by sharing information across the genes actually assumes that some genes with the similar expression level. So here, if you look some genes with the similar expression level level, they assume the dispersion level is also the same. That's the main assumption in DC model that we use. So that's that's a brief intro but getting back to our and see how we can implement it in our. So we first created DC to object, but for, for better, I guess for easier interpretation of results and data in in future analysis that we can see. We can name the DMSO as a control factor and SP 2509 as a treatment factor. So just just just different name. So the first function. In DC to is the creator DC data set. So the function called DC data set from matrix, and we have to pass through these function three different things. So first count data. So count data is our read count matrix that we read it here. It counts here and that's our count data call data is our sample info. Another file that we loaded. So and another thing is our design factor. So for example, now, if you go to sample info there is a column called condition. We want to say, okay, we want to design our experiment based on these column. We want to just compare two different things if it's DMSO or drug treated. So that's why here we just tilt and then condition. So imagine we have another column. Like, like I mentioned like and someone asked if it's the stages different. So imagine we want to take into account that a stage so you can it's we don't have time for that for this session but you guys can do it. If you have any similar experiment for yourself with the plus sign, you can add any other columns that you have, you can add it here and make it multivariate. But to make it simple, our analysis here for this workshop, we only assume condition but yeah, we can if the stage are different, you can using plus sign and adding a stage also into into the column. It depends on your experimental design and depends on what you're looking for. So you are looking for to, you want to study something and for example this case is very simple. We want to see if the condition of drug or the untreated how they are different. But imagine some case, so we have drug, we have other types of condition as you mentioned like batch, we have some technical replicates sometimes, and you want to see if it's the how contamination in your lab can affect the data. So depends on what you are looking for. So you can just adding to your design, but we keep it simple. Just use the condition here. We create this data set from matrix. No we create the dds object. So we can view it. As you can see here there are different slots and might be confusing but that's fine because we can go through each of them and see what does. So design is like the design that you can see here is only condition and dispersion function. As I mentioned it's a dispersion try to fit the dispersion based on means and raw range and call data assays. We only have one assay here, but imagine we have multiple assays in our experiment so we might have different assays. But now for this experiment is only one assay that we use sometimes like there are multiple assays. So if you click on the call data and go to list data. So you can see the sample the condition the replicate mutation count and diagnosis age PSA the state to those columns that are in sample info. So then you can then go to call data and the list data and see them. But we can we can access them in another way that I showed in a few lines later. So because we only as I mentioned we have only one assay just just we can adding if we have different assay is adding different say but just I want to show you it's the same thing. You can get the counts counts of that tedious object using the count function. Or using a says dollar sign counts. So and as you can see they're similar. They're the same object that created here so just just wanted to show that they're the same so either way you can access to your counts data. So, I showed you that we have multiple columns here that you can go through it like you can be called data list data, but they're easier way that you can access them PCA for example replicates and condition. You can just dds and dollar sign and access them as and as you can see different conditions are here like the MSO the MSO the MSO SP. And the five different replicates and for PSA as well. So, another step that we want to do. Because we assume like some low read counts can be noise can be contamination. So we, we want to kind of narrow down our analysis, just to save time. And we here, we set a threshold of at least 10 rates, you can go like further. You can go up to you, you can go for example to any, or any other minimum that you want. We set this threshold just because we want to make sure any, any genes that has higher counts, come to our final matrix we just defining a variable keep. And our dds object just try to keep the only the ones that we want for example here you can just, we can view it like different in the in the counts with the count few. See. No, they're lower. So, the next step that we want to reorder the condition level. For example, we want to set the MSO as a control. So, using real level function, which is in DC to library and working on pds object. So you can change your condition and, and set a reference, we want to set the reference to be dms so we want to set our control to be dms so be you you have to set one of them to be wild type. So we set the control factor as a reference. And, as I mentioned here we define control factor as a variable which is the MSO. So, we created to make life easier here. So, now the main after this really re leveling and after creating the data set from matrix. So, again, we need to run the main function in DC library which is called DC on dds object that we have. We just run it. And as you see estimating size factor as I mentioned it says, like, that's what, what DC to works estimating size factor dispersion is trying to fit a model and now DC to fit it and we can view it again. As you can see here is the dds object that we have with 30 4000 different genes and different columns that we can see. The threshold. Actually, it's, it's up to you you can sometimes you need to play around with it sometimes you don't have time for your analysis you and you want to make sure you only get the top ones so you can set it higher. Usually, even, even it's not necessary to have a threshold for example you can you can go for all the genes but to keep it cleaner simple and save time. That's what we usually do but like sometimes you you're looking for top to any genes and you know anything under 100 for example is like you don't care about them. So it depends on what you look for. So after we, we done with the DC and created the dds object. Now we want to check the dd the DC analysis. By visualizing it. So first thing is, because we want to make sure everything's correct, and our experiment, because sometimes you you get the data from lab, and you want to see if there is a huge contamination in it, or not so if you want to analyze further, or you want to just do the check and you know this time that the experiment looks like didn't work so you're going to try different ways so we can do some easy ways like plotting. So one of them is dispersion estimate plot. So if you plot this, this version. So here's the plot that you can see. So, as I mentioned, this is the mean of normalized count and this is dispersion. So this seek, try to fit dispersion based on the mean of normalized count. So and this is the fitted line, and these are the real data. And, as I mentioned, there is assumption in DC, some of the genes with the similar expression level, they share the dispersion also. So then it tries to kind of avoiding some noise and outliers. So this is the final fit line. And this looks good. So this is the line that you should be looking for. And, because that makes sense, anything that has higher counts. This is lower in and anything that the counts is lower. The variability is higher. So that's the type of line that you should be looking for. So these are, for example, these are high count gene, like housekeeping genes, usually here because they're high counts, and the variability is lower. Like some transcription factors, some genes that are not expressed that high are in this part of the plot. So this is the plot that everything looks okay. So the next thing that you want to do is using results function and getting these tedious object. You get the results of DC analysis and put it in res variable. So in order to view better are created data frame from my res variable, then we can view the data frame. So you can see here. So this is ensemble IDs as a row name for genes and you see base means and luck to full change. You can see for each gene like in different conditioning control and rock treated, how the luck to full change, how it change. And you can see different columns stats p value. As I mentioned, the DC also after calculating the luck to full change doing some hypothesis testing to see if the change is significant or not and create a p value and q value or p adjusted value based on p value. So because we have multiple comparison analysis here we have a lot of genes here. So we need to take into account the p adjusted or q value, and I assume DC to using Benjamin in Horschberg multiple comparison for p adjusted, but we care about p adjusted value because there are 30,000 40,000 of genes, and we have if you have 5% of false discovery rate that that would be a lot so with using p adjusted we make it more stringent for our analysis. Okay, so going to question p adjust some p adjust our n a. That's that's that's a good question probably. I'm not sure how to answer this. Probably I, I should go offline and check it afterwards and we can be in touch over email. I'm not sure the plot data. Where did the plot data plot data is use the dds just just put dds object, because this function is in DC to library. And you just, you have to just pass the dds not dds element metadata. Yeah, it's, we can, we can compare. Actually, the limo woman the DC. We can we can have the comparison is it's not that hard just just I have to just just few lines of code and we can do the comparison. But I haven't done it here but that's it that's it that's a good idea we can go over it. Question is there a way to determine which library data frame and method call is coming from, for instance. So all the all the libraries, all the library, all the function that I mentioned here. So the main function data DC data set from matrix and count. This is the DC to library in R these are built in a function and also results that you can just pass the dds the DC object and it recognize it as a DC object. And yeah. Yeah. That's true. Yeah. Some of them 2000 of them means the NA one yeah I have to. So some person thank you for the, for the link for explanation. So the results. What do you mean to go to the help and look for results from package you uploaded the results function results method. It's not in the help. It's in the unit to go through the DC to documentation. Yes, you can you can unmute. Hi, thank you. Why did my first thank you so much for answering the question. Why did the data frame when I call the dds at element meta data it has zero columns I mean yours didn't have zero columns. So why did you use element. So for which function you mean which which function results function. So this was one that one of the calls that we had let's see when you're calling dds. Yes. So if I just run the dds that's, let's see I'm trying to look for the function maybe what I'll do is look for the function again and then just paste it for you. Yes, just put it but what I show is actually because this this function are built in function for DC library and it knows the dds as a DC object. You don't have to set the slots for example like element and it knows what is dds it knows, and it knows where to go because results is a built in function in the DC to library. See, I got it. Actually I got it now. Thank you so so much. No problem. No problem. Okay. Another plot that we can use is MA plot visualization. It's very useful. And it's a most of the time people show it in courses for DC. Probably you saw it before as a diamond shape plot that has mean of normalize count on x axis and y axis is a lock to fall change. And the blue dots are the ones that are significant. So that means if it's the the cons are lower, the fall change should be higher to be significant. But if the cons are higher with lower fall change also it's going to get significant so what we look for, it should be in positive or negative either top right or bottom right. So those are the ones that are super significant like those genes are like highly significant that we look for. Also, we can have a summary of our raise our results and with summary for example you can see the total number of read cons total number of genes and if the adjusted p value is 0.1 like how many of them what percentage of them up regulated how what percentage of them down regulated it's just a general overview of your data and your results with summary. You can get it. So another step. Another thing in DC is the shrinkage. Using shrinkage shrinkage is actually is a kind of correction factor. We try to improve our data and avoid our avoid the biases and outliers using posterior distribution. So we want to make sure if the log to fall is is high. It's because of the, the status because of the date the data not not not because of the, the, the, the, the low cons and everything. So then we want to reduce our bias and shrink it. So that's the, the similar to same concept as I mentioned for dispersion. And the assumption that we assume the genes with the similar expression level and they have similar dispersion. We try to with this feeding method and we use this. GLM general generalized linear model. We try to fit better our dispersion and get rid of our biases so. So we use this. It's kind of object in our and pass it in. So, in LFC shrink function which is again a built in function in DC and we set the type a GLM. So we shrink our data and make it. In general, what shrinkage does it's a correction factor. It avoids outliers and biases. So let's apply this. And if you if you run this, you can see here. There's a paper. You can read this paper if you need to know more about the shrinkage analysis. So this link, you can follow it. And that's a good paper for this analysis. So we plotted again, the MA plot based on res LFC, based on the shrinkage. And you can see the ones that supposed to be outliers noises are gone. So our data more stringent. So the, the, the remaining part is, as you see, like most of them are highly significant. So that's the, the advantage of using the shrinkage in DC function. Yeah, thank you. Thank you for answering blues are significantly statistically significant. So, so the next step is we want to cut off the FDR. We set a cut up for false discovery rate. So depends on you, you if you want to go a stringent you can set 5% or 10% we set it 5% here because we got a lot of significant ones as you can see so we can go further but sometimes you in your analysis you don't see a lot of significance. So you, you want to still have something on your plate so you go with 10% as as you see sometimes papers like even go to 20% for cut off for FDR. So we set it 5% and you can see the res FDR also with the cut off. No, our results are just kind of applied the ones be the genes that are only the first discovery rate is less than 5%. So, the next step is a little bit of data process preprocessing, and because we use ensemble ID, as you see here we use ensemble ID, but we want to get the gene name because nobody remembers what is this ensemble ideas so it's better and it's cleaner once you plot and you what you want to work with genes like see the genes name. So we get from a hg and see a gene database from human genes database. And there is a library in Bioconductor that we already installed annotation dbi and another one or g.hs.egdb that we already installed using these two libraries we map ensemble IDs to hg and see gene symbols, just just mapping the names. And if you run again. So that maps. So the warning you see here, for example, see many many mapping is like is like not nothing there because in ensemble ID some of IDs are not not not not not all of the protein coding genes, some of them are non coding RNA or micro RNA. So but in hg and c are all protein coding genes so some of them just remain empty because of that. So we then we want to reorder the gene that we make a gene ID column the gene ID column now means the gene name hg and c name and we want to reorder it and put it first. And now we can have the head and just take a look at it see gene ID is now it's the first column and it's a name is ensemble ID is still the raw name. So again we can have a data frame. And take a look at the data frame gene ID ensemble ID. And now we have the gene name also here. Now we want to create a data frame, a table between gene ID and ensemble, ensemble ID and hg and c gene name that comes handy later you will see in plotting because you want to plot something. And you want to tell the plot okay I want to plot this gene, but you know the gene name but their own names are in ensemble ID to make makes life easier. We created data frame just mapping these two, which is called gene list and we can see it gene list see just just mapping gene ID hg and c to ensemble ID. We just save the results of table into our output file. Just for future. That's that's the result if you want to see the Excel sheet of your data, you can go through the results and take a look. And so we are about to start the visualizing part so I just answered the question. So we can we can delete the genes that are not uniquely map but So the most thing we care is like we want to because for this analysis we want to plot something. And we want to tell the plot, we want to plot these gene, and that gene goes to that ensemble ID so if there are some not map or empty so we don't care about it because it's, but if you want to make your results cleaner. Yeah, that's a good idea. So for visualizing especially for PCA principle component analysis. It's better to first to do a one transformation, which is called variance stabilizing transformation VST. So VST. We from fitted dispersion model, we transfer our data or dds object to make it to have a constant variance along the range of mean values so that, or it's called the homo, Abgeordn鬼 is the skedastic and if the variants are constant for a similar range of mean so we do this transformation and because is again is a differential expression we do the apply the transformation to all of it so it So it doesn't matter. So it's just a transformation of data to make it for visualization better. So we do this and make sure there is a function, there's a building function, various stabilizing transformation and you pass DDS, your DDS object and there is a blind setting that make sure to pass it false. So that blind actually if you pass it through it's gonna recalculate again your dispersion function and actually blind through means doesn't care about it's the different condition are biologically different. So but in our case, two different biologically different we wanna take into account because sometimes we don't care about like we wanna the transformation to be blind over the whole condition, but no, we don't wanna be blind. We want because for example, the variance might be higher in drug treated or in control. So we wanna make sure to keep it false for our analysis. We're on this and we can view that VST as you can see here, it's kind of transformed in a way. The numbers are different from the level that you get like for read cons before. No, it's like in a way that close to each other and we handle the variance. If you wanna go to draw the PCA so make sure to pass VST variable here and here we just for condition and replicates we just make it into group in our PCA data and the percent variable is just we round the person variability and just multiply it by a hundred just to make it a percentage. And then we prepare the PCA variable then we add it, we pass it through the GG plot for function for PCA and we will see how the PCA looks like. So as I mentioned, this is real-world data, this is not some makeup data that's like sometimes people show that they are cleanly clustered together. This is real-world data and as you can see it's pretty nice for real-world data because we have lab contamination and other issue for sure for samples like you do experiments and everything is not always clean and nice but this is real-world data but still you can see the clusters for DMSO and one cluster for SP treated in the data in the PCA plotting. Yeah, that's a good point, Zach. That's a good point. And yeah, you can save it to PDF just with this instruction, you can save it to PDF. We already did it, it's on the Github but you can do it for yourself, for your repository or your local machine. So you can have the, if you have data and you wanna show it as a PDF or presented anywhere you can just make it PDF. So the next step is just we wanna do some clustering. So for clustering, so we have this VST variable we pass it through the say and then transpose it and using this function we create the sample this and now with some preparation we make it matrix and just some settings for clustering. So now we can have our clustering plots and here for example, we use the distance function as a Pearson you can use anything Euclidean, Manhattan, Cumbra or any other type of function. Spearman, so what we use a Pearson. Anything that looks better for your data just come here and change it. And here's the clustering. So we can see this is patient number four and five. The drug treated of four and five and the MSOF four and five are kind of clustered together like two patients clustered together and three patient one, two, three, like you see the DMSO and drug treated also kind of clustered together which is kind of nice because they are same patient, same person. It's drug treated, the condition is different but it's still the same person. So or other information that you if you go further, go over the plot further you can get and capture other interesting facts. And again, you can save it as a PDF. So now we can using the complex heat map library that we load, let me first see the question. Yeah, the heat map is actually we do some hierarchical clustering. So we see, for example, these two first clustered to each other and clustered to each other then clustered together is like a hierarchical clustering. You can see the lines here. And also in rows, you can see these two like replicate four and five of a drug and then they clustered with the same person but untreated. And you can just cut off anywhere you want and just say, okay, these are clustered together. That's the good thing about hierarchical clustering. So it depends on your application and what you see useful here. But sometimes it's just to see if you did your experiment correctly, if the data makes sense, if your cluster for drug and for example, untreated, sometimes you don't see any difference and then you see, okay, so there might be no effect. So it will be better in this complex heat map. So first we just get the top 30 gene that are differentially expressed and order them, just kind of rank them and having first created data, see these are the top 30 genes and then again, we map it through the HGNC because these are from ensemble ID. And now if you get the heat map data, you can see we have gene ID also here. So the NA1 here is probably, this one is not protein coding. So, and then we kind of do some cleaning, just omit some columns that we don't want and we set the row names as gene ID. Now we just get the variable of heat map data that's from subset of the heat map data that from gene ID, the top 31, we subset from top 31. And so the other thing is in order to look nice, we're gonna scale it from center. So using this, we apply this scale function. So we scale our data to avoid odd liars and to center our data. So for example, like we have heat map data, we can take a look at heat map data and heat map two is once it's scaled. So we can take a look at both of them, heat map data and heat map two. See heat map data is like here is like 10, 11, something like that. But if you see the same row, it makes it center like from one to minus one makes the data center. Still differential expression, but to avoid odd liars and to make your heat map look better. So we do this and here just be setting the colors and everything for your heat map. And again, we use Spearman centroid here for our clustering, you can change this. If you find it, it looks better. So just feel free to change it. If you wanna change it, for example, the clustering method from the distance method from Spearman to Manhattan, Euclidean or Pearson or anything you like. And here is the better version of our heat map. So you can see here, I guess you need to zoom more because we have 30, maybe with 20 it looks better. You can try for yourself. So this left part of your map is your control, the MSO and right part is your drug treated. You can see, for example, which genes are upregulated or which genes are downregulated. And these are the most significant, the top 30 one, the most significant one. Then you go through the genes name and looking for the most interesting one. For example, that CDKN1 is like, I guess it's assumed to be related to prostate cancer or VEGF, which is causing angiogenesis to accelerate blood flow to tumor. So then you can look for any genes that based on literature that comes interesting and you wanna do further analysis on that. Yeah, thank you for answering. Yeah, Devoff using to stop the PDF. So again, we can save it as a PDF. Another useful plot is a volcano plot. We plot it for all genes here. But we wanna specify some cutoffs. So for example, the p-value, 5% point 0.5, and also the log fault change if it's minus plus minus one and also p-value or combination of these two together. We use this function and has volcano from one of the library that we load so here, for example, you can see your volcano plot. So the red ones are both p-value is less than 5%, 0.5 and log two fault change is above or lower than above plus one or lower than minus one. Or we can have p-value only or log two fault change only and the gray ones are not significant. So basically these are the right side of plot, the one that are up-regulated, the left part, the one that are down-regulated and we are looking for something to be right and top or left and top. For example, this one now looks very significant in our analysis, but we need to go further through this gene and look into literature, see if it makes sense or not just the CDKM one that I mentioned, for example, is here. So it needs the knowledge of your study that what you're doing and the genes that you are looking for, the gene that you are interested. So you can get some candidates, some nice candidates to work further. And yeah, that's for volcano plot and you can save again in PDF. So the last plots that we wanna do, we wanna for some genes that we think that are interesting like CDKM1A or VEGF or BRCA1 that are important in all type of cancers. So you can select either of them just to start with any of them here, that's how it goes. Like you don't have to run all of them, just pick one and just, this is the gene that you wanna plot. Then remember that we made a data frame called gene list that I said it comes handy later that maps, it's here, maps gene ID, gene name to ensemble ID. So we use that one and just create another variable called gene count and that gene count using gene ID. So now, for example, we can plot different things as a like some dot plots. So for example, this is we passed in aesthetic in ggplot the condition and count. So the X axis is the condition, the MSO and SP for different replicates and this is the normalized read counts. For example, you see for these genes CDKM1A, it's for different replicates, you can see how it's expressed and you can see that that's the similar thing that we saw before. Now we can play around with different columns that I mentioned that we have in sample data. For example, this mutational burden, TMB1 that you can just use the, another jitter as a color and size for that. So for example, here you can see you have the MSO, you have your drug and normalized read count, but it's color coded. So if it's darker, the mutation burden is lower for the brighter blue one, those are mutation burden is higher. And as you can see here, it's expressed more, but there is no correlation in mutation burden, which makes sense because for example this dot and this dot are the same person, still the same person. So drug doesn't do anything with the mutation of burden. I guess it's the same person, it doesn't do anything in transcription level. And you can see this and these are the same and these three are the same person. So you can play around with different, for example, PSA and different things and just see for example, for PSA also the same, like you see, it's highly expressed, but once it's there, but the PSA is higher, for example, for this one, this is the same person and it's the same person for this. So there is no correlation you can find, but you can go for different condition. If you had different stage, for example, no, you can use different stage and using color coded different stage and plot any of them and see, okay, for any of them you can go manually one by one and check, okay, which one looks more interesting and you can analyze it further. So then yeah, then you can save it to PDF. These three and go again, select another gene, VEGF, for example, and then repeat the same process for any gene that you select for different plots that you get. For VEGF, for example, it's highly expressed again, but here, but the same pattern. So this is the same person. This is the same person. So you can basically go through different columns that you have in your sample based on your data and try to see if there is a correlation between the control and your wild type. So you can use the star, as I mentioned, but there is a subread map. There is a library also in R. If I remember correctly, R subread, I guess, then you can just pass the annotation of different genes and your BAM files and just pass through your BAM files and annotation and it gives you the count matrix as an output. I guess R subread package is in R. You can do it. Yes. Okay, so that's it for today. If you have any question, I'm happy to answer. And I'm going to put the slides. I'm going to make sure you reach the slides. So yeah, B, we wanted to put that pathway analysis based on Go. But so first, it was the time limitation and the installation of some package for pathway analysis is like you need to install iGraph. And I found that it's difficult to install. The installation sometimes is not good for Mac. So we decided to leave it out the pathway analysis for this section. But we are happy to share the code with you if you want to work on that because we already have the codes for that. No, it's not in the bit. We removed because it would have been confusing, but I can just add another file. Like this one is called analysis. I can just add another analysis with pathway and add those and just push it again. Yeah, I tried to do that. I just, B, we had the code. But if you're using Mac, sometimes the installations are, because pathway analysis, you need to have a graph from different genes. So sometimes it's especially the M1 chip. If you're using Intel chip, it might be easier. So for the alignment, if you go to our GitHub repository, at the end there is a link in his GitHub. He has the star, the complete tutorial. If you go additional notes, there is a link here at the end. So yeah, advanced RNA sequencing analysis is like, I guess now it's time for transition from bulk RNA sequencing to single-cell RNA sequencing. If you learn single-cell RNA sequencing, that would be great. That's a very hot topic. Yeah, and it's a sorat. Yeah, that's true. Sorat in R that you can use. It's pretty hard right now. People try to transition from bulk RNA sequencing to single-cell RNA sequencing. In the future, you're going to hear more about single-cell RNA sequencing. Yeah. Okay. If there is no more questions. So thank you for your attention, everybody. Yeah, if there is no more questions, so we can just end the session. And I hope you learn from this session.